Local Model Trainer — a neural network that learns to predict text, one character at a time.
A character-level language model trained and run entirely in the browser. No server. No cloud API. Weights are initialised as random noise, adjusted through gradient descent on the text you provide, then used to generate new text one character at a time.
The electricity it uses comes from the device reading this. The heat it generates is real.
This represents one person's current understanding of these topics. Corrections and additions welcome at kristoffer@oerum.org.
A neural network is a function composed of many smaller functions arranged in layers. Each layer transforms its input by multiplying it by a matrix of learned weights, then passing the result through a non-linear function. The "learning" is the process of adjusting those weight matrices so that the overall function produces useful outputs.
The term comes from a loose analogy to biological neurons, but the analogy is mostly historical. What is actually happening is matrix multiplication and gradient descent.
This is not intelligence in the sense of human cognition — there is no understanding, no embodied experience, no intention. It is something more specific and in some ways more legible: a function that has been shaped by exposure to a particular body of text until it produces outputs that are statistically consistent with that text. What it knows is entirely a function of what it was trained on. That constraint is also a possibility. Heikkilä (Viznut) writes that artificial intellects "should not be thought about as competing against humans in human-like terms. Their greatest value is that they are different from human minds and thus able to expand the intellectual diversity of the world." Whether that framing holds at the scale of frontier models is debatable; at the scale of a locally trained character-level model it is at least plausible.
Matrix multiplication is the core operation of every neural network. Given input vector x and weight matrix W, a layer computes Wx + b where b is a bias vector. For a layer with 64 units processing an input of dimension 32, this is a (64 × 32) matrix multiplied by a (32 × 1) vector — roughly 2,048 multiply-accumulate operations per forward pass through that layer alone.
Modern GPUs accelerate this by running thousands of such operations in parallel across shader cores. TensorFlow.js maps these operations onto WebGL shader programs, which is why a GPU matters for training speed here.
Gradient descent is an iterative optimisation method. Given a loss function L(θ) measuring how wrong the model is, and the gradient ∇L(θ) pointing in the direction of steepest increase, gradient descent updates the parameters in the opposite direction:
where η is the learning rate. This is repeated for many batches of training data until the loss stops decreasing. The conceptual origin is Legendre and Gauss's method of least squares (1805–1809); the continuous generalisation to non-linear models came much later.
This model uses the Adam optimiser, a variant of gradient descent that adapts the learning rate per parameter.
Adam (Adaptive Moment Estimation) maintains running estimates of the first moment (mean) and second moment (uncentred variance) of the gradients, and uses these to scale the learning rate individually for each parameter. Parameters with consistently large gradients get smaller updates; parameters with small or noisy gradients get relatively larger ones.
Default values β₁ = 0.9, β₂ = 0.999. Over 200,000 citations as of 2024.
TensorFlow.js detects your available hardware and selects a backend: WebGL (GPU), WASM (CPU with SIMD extensions), or plain JavaScript. Training on GPU is roughly 10–50× faster than CPU for the matrix operations involved.
The energy consumed by this browser tab during a training run is small but real. At the scale of GPT-3 (175 billion parameters), a single training run consumed an estimated 1,287 MWh of electricity — roughly the annual consumption of 120 average US households. The physics of what happens here is identical. The scale differs by approximately eight orders of magnitude.
That gap is also the space in which something becomes possible. Small models trained on specific corpora on consumer hardware can do things that large generalised models cannot: they can be understood, audited, and owned. They can be trained on material that will never appear in a commercial training set. Running locally, they require no subscription, no API key, no ongoing relationship with an infrastructure provider. This model runs on any device with a modern browser — training is faster with a dedicated GPU but the computations are achievable on modest hardware, just slower. Many electricity providers offer tariffs tied to renewable sources, and some allow time-of-use pricing where off-peak hours correspond to periods of higher renewable generation in the local grid.
This approach — running models locally on available hardware, scheduling intensive computation for off-peak or surplus-energy periods, treating compute as a finite resource — is described in the permacomputing literature as frugal computing. The term permacomputing applies permaculture principles to computation: maximise hardware lifespan, minimise energy use, work with already available resources. Ville-Matias Heikkilä's foundational essay (2020) proposed that AI batch training would ideally occur only when surplus energy is being produced or when electricity-to-heat conversion is needed — that is, when the computation is in phase with the energy system around it rather than drawing from it regardless.
Training requires several browser capabilities. The checks below run automatically and report what your environment supports.
WebGL 2.0 is required for GPU-accelerated matrix operations. Without it, training falls back to CPU and will be significantly slower. WebAssembly is required for the WASM fallback backend. SharedArrayBuffer and cross-origin isolation enable WASM threading but require specific HTTP headers (COOP/COEP) that are not set when opening files directly from disk. WebGPU is a newer, higher-performance alternative to WebGL, available in Chrome 113+.
The model learns from text you provide here. It treats that text as a corpus and converts it into numbers through a process called tokenisation.
In linguistics and computational linguistics, a corpus (plural: corpora) is any collection of text used as data for analysis or training. The statistical properties of what a model learns — the character frequencies, transition probabilities, and sequential patterns — are entirely determined by the corpus. A model trained on legal documents will generate text that resembles legal documents. A model trained on this short demo text will generate text that resembles this short demo text.
The distributional hypothesis — that the meaning of a linguistic unit is constituted by the contexts in which it appears — was formalised by Harris (1954) and Firth (1957). All word embedding and language model approaches are computational implementations of this idea.
Most language models in public use are trained on corpora assembled at enormous scale according to priorities set by a small number of institutions. The resulting models reflect those priorities — in what they know, what they elide, and how they construct plausibility. A model trained on a specific, chosen corpus is a different kind of instrument: narrower in scope, more transparent in its assumptions, and oriented toward questions that you have defined rather than questions anticipated by someone else.
Zellig Harris (1954) proposed "Distributional Structure" in the journal Word: linguistic elements with similar distributions have similar meanings. He was attempting to formalise structural linguistics as a discovery procedure — a method for deriving grammatical structure from distributional evidence alone, without appeal to meaning or intuition.
J.R. Firth (1957) gave the most-cited formulation: "You shall know a word by the company it keeps." Firth's contextual theory of meaning argued that meaning is not a property of isolated words but of their patterns of co-occurrence across real texts. This sentence is arguably the most consequential in the intellectual history of NLP.
Ferdinand de Saussure (1916, posth.) had earlier argued that linguistic signs have no positive content — their identity is constituted by their difference from other signs in the system. This structuralist position is the deeper theoretical background to the distributional hypothesis.
Neural networks cannot process text directly. Text must be converted into numbers. Tokenisation is the process of dividing text into units (tokens) and mapping each unit to an integer index.
This model uses character-level tokenisation: every unique character in the corpus is assigned an integer. This is the simplest possible strategy, with a vocabulary size equal to the number of distinct characters (typically 50–100 for English text). The alternative approaches used in large language models are:
| Strategy | Unit | Vocab | Used in |
|---|---|---|---|
| Character-level | Single characters | 50–200 | This model |
| Word-level | Whole words | 10k–100k | Word2Vec |
| Byte Pair Encoding | Sub-word units | 30k–100k | GPT, LLaMA, BERT |
The character-to-integer mapping ultimately rests on the ASCII and Unicode standards.
The ASCII standard (1963, formalised 1967) defined the original 128-character mapping of text to 7-bit integers: uppercase and lowercase Latin letters, digits, punctuation, and control characters. RFC 20 (1969) standardised this for network transmission.
Unicode (first published 1991, now at version 15.1) extended this to over 140,000 characters covering most of the world's writing systems. Every character you type is already an integer at the hardware level — tokenisation at the character level is simply making that implicit mapping explicit and using it as input to the embedding layer.
The model has three layers: an embedding layer, an LSTM, and a dense output layer.
An embedding layer is a learnable lookup table. Each integer token index maps to a dense vector of fixed dimensionality — here, 32 dimensions. These vectors are initialised randomly and adjusted during training. Over time, the distances between them come to reflect statistical co-occurrence patterns in the corpus.
The theoretical basis is the distributional hypothesis: tokens that appear in similar contexts should have similar representations. This is what Harris and Firth argued at the level of linguistic theory. The embedding layer implements it computationally.
GPT-3 uses embedding dimensions of 12,288. The increase in dimensionality allows the model to represent finer-grained distinctions between tokens — at a proportionally higher energy cost per forward pass.
An LSTM (Long Short-Term Memory) is a type of recurrent neural network. Unlike feedforward networks that process each input independently, an LSTM processes sequences by maintaining a hidden state — a vector updated at each time step that carries information forward through the sequence.
Standard RNNs suffer from the vanishing gradient problem: when gradients are backpropagated through many time steps, they are multiplied together at each step. If those multiplications involve values less than 1, the gradient shrinks toward zero, and early parts of the sequence stop contributing to learning. Hochreiter identified this mathematically in his 1991 diploma thesis.
LSTMs address this through gating mechanisms:
The cell state C_t can persist information across many steps without gradient degradation. The LSTM was the dominant architecture for sequence modelling until the Transformer displaced it in 2017–2019.
The Transformer (Vaswani et al., 2017) replaced sequential recurrence with self-attention: a mechanism that computes pairwise relationships between all positions in a sequence simultaneously.
| Property | LSTM | Transformer |
|---|---|---|
| Processing | Sequential | Parallel (all positions) |
| Long-range dependencies | Limited by hidden state | Full context window |
| Training parallelism | Low | High |
| Compute vs sequence length | Linear | Quadratic (standard) |
The quadratic cost of attention at long context lengths is the central engineering problem in current LLM development, motivating architectures like Mamba, RWKV, and various linear attention approximations.
The dense output layer maps the LSTM's hidden state (a vector of 64 values) to a vector of length equal to the vocabulary size — one value per possible next character. These raw values (logits) are then passed through a softmax function, which converts them into a probability distribution:
The result is a vector of non-negative values that sum to 1. Each value represents the model's estimated probability that the next character is the corresponding vocabulary item. During training, the model is penalised by cross-entropy loss for assigning low probability to the correct next character.
Cross-entropy measures the difference between the model's predicted probability distribution and the true distribution (a one-hot vector pointing at the correct next character):
A perfect prediction assigns probability 1.0 to the correct character: −log(1.0) = 0. A uniform random guess over a 50-character vocabulary gives −log(1/50) ≈ 3.91. Watch the training log to see where your loss starts and how far it descends.
Cross-entropy minimisation is equivalent to maximum likelihood estimation. Its theoretical grounding is in Shannon's information theory (1948): the cross-entropy of a predicted distribution relative to the true distribution equals the entropy of the true distribution plus the KL divergence between them.
At initialisation all weights are random noise. Training is the process of adjusting those weights by repeatedly measuring prediction error and applying backpropagation.
The loss function quantifies how wrong the model is. Here we use sparse categorical cross-entropy: given the model's probability distribution over the vocabulary and the true next character, the loss is the negative log probability assigned to the correct character. A lower loss means the model assigns higher probability to correct predictions.
Each training epoch is one complete pass through all the training sequences. With 25 epochs and a small corpus, the model will overfit — it will eventually memorise the specific sequences in the training data rather than learning general patterns. With a short corpus this is expected and unproblematic.
Backpropagation is the application of the chain rule of calculus to compute the gradient of the loss with respect to every weight in the network. Working backwards from the output layer to the input, each weight receives a gradient value indicating the direction and magnitude of adjustment needed to reduce the loss.
The algorithm was described by Rumelhart, Hinton, and Williams in 1986, though precursors existed: Linnainmaa's reverse-mode automatic differentiation (1970 thesis) and Werbos's application to neural networks (1974 PhD thesis at Harvard). The 1986 Nature paper is what established its practical significance for multi-layer networks.
After training, the model's weights are frozen. Given a seed sequence it produces a probability distribution over the vocabulary. We sample from that distribution using a temperature parameter to control randomness.
At each generation step, the model performs a forward pass through the full network and produces a softmax distribution over all characters in the vocabulary. This is a vector of non-negative values summing to 1 — the model's estimate of how likely each character is to come next, given the preceding sequence.
The model never produces a deterministic answer. It always produces a distribution. Whether you select the most probable character (greedy decoding) or sample from the distribution is a choice made at inference time, not a property of the model itself. This is why the same trained model produces different text on different runs.
The output is not retrieved, recalled, or reasoned. It is generated, character by character, from a probability distribution shaped by training data. There is no moment of comprehension. The text that emerges can be coherent, surprising, or useful — but those qualities arise from the statistical structure of the corpus, not from anything the model understands about them.
Temperature scaling divides the raw output logits by a temperature value T before the softmax is applied:
At T → 0: the distribution collapses to argmax (deterministic greedy decoding). At T = 1.0: raw model probabilities are used unchanged. At T > 1.0: the distribution flattens — low-probability characters become relatively more likely, producing more varied but less coherent output.
The term is borrowed from statistical mechanics: in the Boltzmann distribution, temperature governs the probability of a system occupying a given energy state. High temperature = more uniform distribution over states. The analogy is exact at the level of the mathematics.
Beyond temperature, the two main sampling strategies in production LLMs are top-k sampling (sample only from the k most probable tokens) and nucleus (top-p) sampling (sample from the smallest set of tokens whose cumulative probability exceeds p). Holtzman et al. (2020) demonstrated that nucleus sampling produces more coherent text than both greedy decoding and pure temperature sampling.
Neural language modelling emerges from three converging traditions: mathematics and statistics, structural and distributional linguistics, and machine learning research. They developed largely independently and only converged decisively in the 2010s.