Complexity
0 / 0 expanded

Local Model Trainer — a neural network that learns to predict text, one character at a time.

A character-level language model trained and run entirely in the browser. No server. No cloud API. Weights are initialised as random noise, adjusted through gradient descent on the text you provide, then used to generate new text one character at a time.

The electricity it uses comes from the device reading this. The heat it generates is real.

This represents one person's current understanding of these topics. Corrections and additions welcome at kristoffer@oerum.org.

A neural network is a function composed of many smaller functions arranged in layers. Each layer transforms its input by multiplying it by a matrix of learned weights, then passing the result through a non-linear function. The "learning" is the process of adjusting those weight matrices so that the overall function produces useful outputs.

The term comes from a loose analogy to biological neurons, but the analogy is mostly historical. What is actually happening is matrix multiplication and gradient descent.

This is not intelligence in the sense of human cognition — there is no understanding, no embodied experience, no intention. It is something more specific and in some ways more legible: a function that has been shaped by exposure to a particular body of text until it produces outputs that are statistically consistent with that text. What it knows is entirely a function of what it was trained on. That constraint is also a possibility. Heikkilä (Viznut) writes that artificial intellects "should not be thought about as competing against humans in human-like terms. Their greatest value is that they are different from human minds and thus able to expand the intellectual diversity of the world." Whether that framing holds at the scale of frontier models is debatable; at the scale of a locally trained character-level model it is at least plausible.

McCulloch, W.S. & Pitts, W. (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, 5, 115–133.

Matrix multiplication is the core operation of every neural network. Given input vector x and weight matrix W, a layer computes Wx + b where b is a bias vector. For a layer with 64 units processing an input of dimension 32, this is a (64 × 32) matrix multiplied by a (32 × 1) vector — roughly 2,048 multiply-accumulate operations per forward pass through that layer alone.

Modern GPUs accelerate this by running thousands of such operations in parallel across shader cores. TensorFlow.js maps these operations onto WebGL shader programs, which is why a GPU matters for training speed here.

Golub, G.H. & Van Loan, C.F. (2013). Matrix Computations (4th ed.). Johns Hopkins University Press.

Gradient descent is an iterative optimisation method. Given a loss function L(θ) measuring how wrong the model is, and the gradient ∇L(θ) pointing in the direction of steepest increase, gradient descent updates the parameters in the opposite direction:

θ ← θ − η · ∇L(θ)

where η is the learning rate. This is repeated for many batches of training data until the loss stops decreasing. The conceptual origin is Legendre and Gauss's method of least squares (1805–1809); the continuous generalisation to non-linear models came much later.

This model uses the Adam optimiser, a variant of gradient descent that adapts the learning rate per parameter.

Cauchy, A. (1847). Méthode générale pour la résolution des systèmes d'équations simultanées. Comptes Rendus, 25, 536–538. (First formal statement of gradient descent.)

Adam (Adaptive Moment Estimation) maintains running estimates of the first moment (mean) and second moment (uncentred variance) of the gradients, and uses these to scale the learning rate individually for each parameter. Parameters with consistently large gradients get smaller updates; parameters with small or noisy gradients get relatively larger ones.

m_t = β₁·m_{t-1} + (1−β₁)·g_t   (first moment) v_t = β₂·v_{t-1} + (1−β₂)·g_t² (second moment) θ_t = θ_{t-1} − η · m̂_t / (√v̂_t + ε)

Default values β₁ = 0.9, β₂ = 0.999. Over 200,000 citations as of 2024.

Kingma, D.P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ICLR 2015. arXiv:1412.6980.

TensorFlow.js detects your available hardware and selects a backend: WebGL (GPU), WASM (CPU with SIMD extensions), or plain JavaScript. Training on GPU is roughly 10–50× faster than CPU for the matrix operations involved.

The energy consumed by this browser tab during a training run is small but real. At the scale of GPT-3 (175 billion parameters), a single training run consumed an estimated 1,287 MWh of electricity — roughly the annual consumption of 120 average US households. The physics of what happens here is identical. The scale differs by approximately eight orders of magnitude.

That gap is also the space in which something becomes possible. Small models trained on specific corpora on consumer hardware can do things that large generalised models cannot: they can be understood, audited, and owned. They can be trained on material that will never appear in a commercial training set. Running locally, they require no subscription, no API key, no ongoing relationship with an infrastructure provider. This model runs on any device with a modern browser — training is faster with a dedicated GPU but the computations are achievable on modest hardware, just slower. Many electricity providers offer tariffs tied to renewable sources, and some allow time-of-use pricing where off-peak hours correspond to periods of higher renewable generation in the local grid.

This approach — running models locally on available hardware, scheduling intensive computation for off-peak or surplus-energy periods, treating compute as a finite resource — is described in the permacomputing literature as frugal computing. The term permacomputing applies permaculture principles to computation: maximise hardware lifespan, minimise energy use, work with already available resources. Ville-Matias Heikkilä's foundational essay (2020) proposed that AI batch training would ideally occur only when surplus energy is being produced or when electricity-to-heat conversion is needed — that is, when the computation is in phase with the energy system around it rather than drawing from it regardless.

Patterson, D. et al. (2021). Carbon and the Case for Energy-Aware Machine Learning. arXiv:2104.10350. Strubell, E. et al. (2019). Energy and Policy Considerations for Deep Learning in NLP. ACL 2019. arXiv:1906.02629. Heikkilä, V.-M. (Viznut) (2020). Permacomputing. viznut.fi. Devine Lu Linvega / Hundred Rabbits (2023). Permacomputing 101. 100r.co.
Preflight

Browser compatibility

Training requires several browser capabilities. The checks below run automatically and report what your environment supports.

WebGL 2.0 is required for GPU-accelerated matrix operations. Without it, training falls back to CPU and will be significantly slower. WebAssembly is required for the WASM fallback backend. SharedArrayBuffer and cross-origin isolation enable WASM threading but require specific HTTP headers (COOP/COEP) that are not set when opening files directly from disk. WebGPU is a newer, higher-performance alternative to WebGL, available in Chrome 113+.

Kocher, P. et al. (2019). Spectre Attacks: Exploiting Speculative Execution. IEEE S&P 2019. SharedArrayBuffer was disabled across browsers following this disclosure. MDN security requirements.
WebGL 2.0Checking…
WebAssemblyChecking…
TensorFlow.jsChecking…
SharedArrayBufferChecking…
WebGPUChecking…
Memory estimateChecking…
WASM SIMDChecking…
Float32ArrayChecking…
Running checks…
Step 01

Input — the corpus

The model learns from text you provide here. It treats that text as a corpus and converts it into numbers through a process called tokenisation.

In linguistics and computational linguistics, a corpus (plural: corpora) is any collection of text used as data for analysis or training. The statistical properties of what a model learns — the character frequencies, transition probabilities, and sequential patterns — are entirely determined by the corpus. A model trained on legal documents will generate text that resembles legal documents. A model trained on this short demo text will generate text that resembles this short demo text.

The distributional hypothesis — that the meaning of a linguistic unit is constituted by the contexts in which it appears — was formalised by Harris (1954) and Firth (1957). All word embedding and language model approaches are computational implementations of this idea.

Most language models in public use are trained on corpora assembled at enormous scale according to priorities set by a small number of institutions. The resulting models reflect those priorities — in what they know, what they elide, and how they construct plausibility. A model trained on a specific, chosen corpus is a different kind of instrument: narrower in scope, more transparent in its assumptions, and oriented toward questions that you have defined rather than questions anticipated by someone else.

Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford University Press. On the politics of training data and whose interests shape generalised models, see: Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT 2021. ACM DL. On fine-tuning as a means of domain-specific adaptation without large-scale retraining: Hu, E. et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685.

Zellig Harris (1954) proposed "Distributional Structure" in the journal Word: linguistic elements with similar distributions have similar meanings. He was attempting to formalise structural linguistics as a discovery procedure — a method for deriving grammatical structure from distributional evidence alone, without appeal to meaning or intuition.

J.R. Firth (1957) gave the most-cited formulation: "You shall know a word by the company it keeps." Firth's contextual theory of meaning argued that meaning is not a property of isolated words but of their patterns of co-occurrence across real texts. This sentence is arguably the most consequential in the intellectual history of NLP.

Ferdinand de Saussure (1916, posth.) had earlier argued that linguistic signs have no positive content — their identity is constituted by their difference from other signs in the system. This structuralist position is the deeper theoretical background to the distributional hypothesis.

Harris, Z. (1954). Distributional Structure. Word, 10(2–3), 146–162. Firth, J.R. (1957). A Synopsis of Linguistic Theory 1930–1955. In Studies in Linguistic Analysis. Blackwell. Saussure, F. de (1916). Cours de linguistique générale. Payot.

Neural networks cannot process text directly. Text must be converted into numbers. Tokenisation is the process of dividing text into units (tokens) and mapping each unit to an integer index.

This model uses character-level tokenisation: every unique character in the corpus is assigned an integer. This is the simplest possible strategy, with a vocabulary size equal to the number of distinct characters (typically 50–100 for English text). The alternative approaches used in large language models are:

Strategy Unit Vocab Used in
Character-level Single characters 50–200 This model
Word-level Whole words 10k–100k Word2Vec
Byte Pair Encoding Sub-word units 30k–100k GPT, LLaMA, BERT

The character-to-integer mapping ultimately rests on the ASCII and Unicode standards.

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. arXiv:1508.07909.

The ASCII standard (1963, formalised 1967) defined the original 128-character mapping of text to 7-bit integers: uppercase and lowercase Latin letters, digits, punctuation, and control characters. RFC 20 (1969) standardised this for network transmission.

Unicode (first published 1991, now at version 15.1) extended this to over 140,000 characters covering most of the world's writing systems. Every character you type is already an integer at the hardware level — tokenisation at the character level is simply making that implicit mapping explicit and using it as input to the embedding layer.

Unicode Consortium. Publication history. RFC 20 (1969): IETF.
Corpus input
Keep under 2,000 characters to avoid browser heap overflow. The default text is an original permutational piece written for this corpus input — structurally constrained Danish prose in the tradition of systemdigtning as practised by Klaus Høeck, Inger Christensen, and Per Højholt. The character-level model will learn the closed vocabulary of its permutations.
Step 02

Architecture

The model has three layers: an embedding layer, an LSTM, and a dense output layer.

An embedding layer is a learnable lookup table. Each integer token index maps to a dense vector of fixed dimensionality — here, 32 dimensions. These vectors are initialised randomly and adjusted during training. Over time, the distances between them come to reflect statistical co-occurrence patterns in the corpus.

The theoretical basis is the distributional hypothesis: tokens that appear in similar contexts should have similar representations. This is what Harris and Firth argued at the level of linguistic theory. The embedding layer implements it computationally.

GPT-3 uses embedding dimensions of 12,288. The increase in dimensionality allows the model to represent finer-grained distinctions between tokens — at a proportionally higher energy cost per forward pass.

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137–1155. jmlr.org. Also: Hinton, G.E. (1986). Learning Distributed Representations of Concepts. Proc. 8th Annual Conf. Cognitive Science Society.

An LSTM (Long Short-Term Memory) is a type of recurrent neural network. Unlike feedforward networks that process each input independently, an LSTM processes sequences by maintaining a hidden state — a vector updated at each time step that carries information forward through the sequence.

Standard RNNs suffer from the vanishing gradient problem: when gradients are backpropagated through many time steps, they are multiplied together at each step. If those multiplications involve values less than 1, the gradient shrinks toward zero, and early parts of the sequence stop contributing to learning. Hochreiter identified this mathematically in his 1991 diploma thesis.

LSTMs address this through gating mechanisms:

f_t = σ(W_f·[h_{t−1}, x_t] + b_f) — forget gate i_t = σ(W_i·[h_{t−1}, x_t] + b_i) — input gate C_t = f_t * C_{t−1} + i_t * tanh(W_C·[h_{t−1}, x_t] + b_C) o_t = σ(W_o·[h_{t−1}, x_t] + b_o) — output gate

The cell state C_t can persist information across many steps without gradient degradation. The LSTM was the dominant architecture for sequence modelling until the Transformer displaced it in 2017–2019.

Hochreiter, S. & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. MIT Press. Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich.

The Transformer (Vaswani et al., 2017) replaced sequential recurrence with self-attention: a mechanism that computes pairwise relationships between all positions in a sequence simultaneously.

Property LSTM Transformer
Processing Sequential Parallel (all positions)
Long-range dependencies Limited by hidden state Full context window
Training parallelism Low High
Compute vs sequence length Linear Quadratic (standard)

The quadratic cost of attention at long context lengths is the central engineering problem in current LLM development, motivating architectures like Mamba, RWKV, and various linear attention approximations.

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS 30. arXiv:1706.03762.

The dense output layer maps the LSTM's hidden state (a vector of 64 values) to a vector of length equal to the vocabulary size — one value per possible next character. These raw values (logits) are then passed through a softmax function, which converts them into a probability distribution:

softmax(z)_i = exp(z_i) / Σ_j exp(z_j)

The result is a vector of non-negative values that sum to 1. Each value represents the model's estimated probability that the next character is the corresponding vocabulary item. During training, the model is penalised by cross-entropy loss for assigning low probability to the correct next character.

Cross-entropy measures the difference between the model's predicted probability distribution and the true distribution (a one-hot vector pointing at the correct next character):

L = −log q(correct_class)

A perfect prediction assigns probability 1.0 to the correct character: −log(1.0) = 0. A uniform random guess over a 50-character vocabulary gives −log(1/50) ≈ 3.91. Watch the training log to see where your loss starts and how far it descends.

Cross-entropy minimisation is equivalent to maximum likelihood estimation. Its theoretical grounding is in Shannon's information theory (1948): the cross-entropy of a predicted distribution relative to the true distribution equals the entropy of the true distribution plus the KL divergence between them.

Shannon, C.E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379–423. PDF.
Architecture: awaiting data.
Input integersseq_len × 1
Embeddingseq_len × 32
LSTM64 units
Dense + Softmaxvocab_size
Probabilitynext character
Step 03

Training

At initialisation all weights are random noise. Training is the process of adjusting those weights by repeatedly measuring prediction error and applying backpropagation.

The loss function quantifies how wrong the model is. Here we use sparse categorical cross-entropy: given the model's probability distribution over the vocabulary and the true next character, the loss is the negative log probability assigned to the correct character. A lower loss means the model assigns higher probability to correct predictions.

Each training epoch is one complete pass through all the training sequences. With 25 epochs and a small corpus, the model will overfit — it will eventually memorise the specific sequences in the training data rather than learning general patterns. With a short corpus this is expected and unproblematic.

Backpropagation is the application of the chain rule of calculus to compute the gradient of the loss with respect to every weight in the network. Working backwards from the output layer to the input, each weight receives a gradient value indicating the direction and magnitude of adjustment needed to reduce the loss.

The algorithm was described by Rumelhart, Hinton, and Williams in 1986, though precursors existed: Linnainmaa's reverse-mode automatic differentiation (1970 thesis) and Werbos's application to neural networks (1974 PhD thesis at Harvard). The 1986 Nature paper is what established its practical significance for multi-layer networks.

Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning Representations by Back-propagating Errors. Nature, 323, 533–536. nature.com.
Training controls
Status: idle
Awaiting command…
Step 04

Inference

After training, the model's weights are frozen. Given a seed sequence it produces a probability distribution over the vocabulary. We sample from that distribution using a temperature parameter to control randomness.

At each generation step, the model performs a forward pass through the full network and produces a softmax distribution over all characters in the vocabulary. This is a vector of non-negative values summing to 1 — the model's estimate of how likely each character is to come next, given the preceding sequence.

The model never produces a deterministic answer. It always produces a distribution. Whether you select the most probable character (greedy decoding) or sample from the distribution is a choice made at inference time, not a property of the model itself. This is why the same trained model produces different text on different runs.

The output is not retrieved, recalled, or reasoned. It is generated, character by character, from a probability distribution shaped by training data. There is no moment of comprehension. The text that emerges can be coherent, surprising, or useful — but those qualities arise from the statistical structure of the corpus, not from anything the model understands about them.

Temperature scaling divides the raw output logits by a temperature value T before the softmax is applied:

p_i = exp(logit_i / T) / Σ_j exp(logit_j / T)

At T → 0: the distribution collapses to argmax (deterministic greedy decoding). At T = 1.0: raw model probabilities are used unchanged. At T > 1.0: the distribution flattens — low-probability characters become relatively more likely, producing more varied but less coherent output.

The term is borrowed from statistical mechanics: in the Boltzmann distribution, temperature governs the probability of a system occupying a given energy state. High temperature = more uniform distribution over states. The analogy is exact at the level of the mathematics.

Beyond temperature, the two main sampling strategies in production LLMs are top-k sampling (sample only from the k most probable tokens) and nucleus (top-p) sampling (sample from the smallest set of tokens whose cumulative probability exceeds p). Holtzman et al. (2020) demonstrated that nucleus sampling produces more coherent text than both greedy decoding and pure temperature sampling.

Holtzman, A., Buys, J., Du, L., Forbes, M., & Choi, Y. (2020). The Curious Case of Neural Text Degeneration. ICLR 2020. arXiv:1904.09751. Ackley, D.H., Hinton, G.E., & Sejnowski, T.J. (1985). A Learning Algorithm for Boltzmann Machines. Cognitive Science, 9(1).
Generation controls
Train the model first.
Context

Intellectual history

Neural language modelling emerges from three converging traditions: mathematics and statistics, structural and distributional linguistics, and machine learning research. They developed largely independently and only converged decisively in the 2010s.

1713
Ars Conjectandi — law of large numbers
Jakob Bernoulli (posth.). As trials increase, observed frequency converges on true probability. This is the foundational justification for learning statistical patterns from large corpora.
1763
Bayesian inference
Thomas Bayes (posth., ed. Price). Updating probability estimates given new evidence. Probabilistic language models are Bayesian in this sense: they estimate P(token | context).
1805–1809
Method of least squares
Legendre, Gauss. Minimising squared residuals to fit a model to data — the conceptual origin of loss minimisation. Gradient descent is a continuous generalisation of this to non-linear models.
1906
Markov chains
Andrei Markov modelled sequences of letters in Russian prose as a stochastic process where each state depends only on the previous state. The direct mathematical precursor to n-gram language models. Autoregressive training conditions on the full preceding sequence, which relaxes but descends from this assumption.
1948
A Mathematical Theory of Communication
Claude Shannon. Defined information in terms of probability. Introduced entropy. Cross-entropy — the loss function — is a direct application. Shannon also modelled English as a stochastic process and demonstrated n-gram approximations of it. PDF.
1970
Reverse-mode automatic differentiation
Seppo Linnainmaa, MSc thesis, Helsinki. The derivative of a composite function can be computed efficiently by traversing a computational graph in reverse. The mathematical basis for backpropagation. Applied to neural networks by Werbos (1974 PhD, Harvard).
1916
Cours de linguistique générale
Ferdinand de Saussure (posth.). Language as a system of differences — the identity of a sign is constituted by its relations to other signs, not by positive content. The theoretical background to the distributional hypothesis. Word embeddings are a computational implementation of relational meaning.
1954
Distributional Structure
Zellig Harris. Linguistic units with similar distributions have similar meanings. The distributional hypothesis. Harris was trying to formalise linguistics as a discovery procedure; the computational community rediscovered his argument fifty years later as the basis for word embeddings.
1957 (a)
Syntactic Structures
Noam Chomsky. Language is governed by innate rule-based competence, not statistical habit. "Colourless green ideas sleep furiously" is grammatical despite never appearing in any corpus. This was a direct attack on statistical approaches. Neural language models are sometimes framed as an empirical rebuttal to this position, though Chomsky and others argue the two approaches address different questions and are not in direct competition.
1957 (b)
"You shall know a word by the company it keeps"
J.R. Firth. Contextual theory of meaning: meaning is constituted by patterns of co-occurrence across real texts. The most-cited sentence in the intellectual history of NLP. Directly operationalised by every word embedding approach, including Word2Vec and the embedding layer here.
1988
Statistical language modelling at IBM
Frederick Jelinek. N-gram language models for speech recognition brought probabilistic methods back into linguistics after Chomsky's influence had suppressed them for two decades. Attributed with: "Every time I fire a linguist, the performance of the speech recogniser improves." (Probably apocryphal.)
1943
A Logical Calculus of the Ideas Immanent in Nervous Activity
McCulloch & Pitts. First mathematical model of a neuron as a threshold logic unit. Showed networks of such units could compute any logical function. The origin point of artificial neural networks.
1986
Learning Representations by Back-propagating Errors
Rumelhart, Hinton & Williams. Backpropagation for multi-layer networks. Showed hidden layers develop internal representations not explicitly programmed. Ended the first AI winter. Nature.
1997
Long Short-Term Memory
Hochreiter & Schmidhuber. Gating mechanisms solve the vanishing gradient problem. The architecture used here. MIT Press.
2003
A Neural Probabilistic Language Model
Bengio et al. First neural language model with learned word embeddings. Proposed representing words as dense vectors and learning both embeddings and probability function simultaneously. Direct motivation for the embedding layer here. JMLR.
2013
Word2Vec
Mikolov et al., Google. Shallow networks on large corpora produce vectors with geometric properties: king − man + woman ≈ queen. Made embedding training tractable at scale. Widely treated as surprising; follows directly from Harris and Firth. arXiv.
2017
Attention Is All You Need
Vaswani et al. The Transformer architecture. Self-attention replaces recurrence. Enabled the scaling that produced GPT-3, GPT-4, and successors. The LSTM is the architecture this paper displaced. arXiv.
2020–present
Large Language Models at scale
GPT-3 (175B parameters) demonstrated that scale alone produces qualitative shifts. These models implement the same operations — embedding, matrix multiplication, softmax, cross-entropy, gradient descent — at roughly eight orders of magnitude greater scale. The physics is identical. The outcomes are not.