Vibe Coding Home

LLMs Part 4: Put the pieces together. The simplest LM imaginable

This software was partly created by AI Vibe Coding
Created by YouMinds
Let us now demystify the intricacies of Large Language Models (LLMs) and bring theory to life with an interactive simulation.
Watch in real-time as this basic yet fascinating transformer model runs a live simulation, piecing together language constructs with the elegance of simplicity.
The following model is very small, so we better call it SLM. Nevertheless, it has the same architecture as its great role models.
Create and train your own language model
On the left side, you have the input prompt transformed into a series of numerical sequences or vectors. Each step through the model enriches these sequences, embedding them with layers of meaning, word positioning, and contextual nuances.
On the right side, the magic unfolds through a series of mathematical operations—primarily matrix multiplications—that dynamically shape and refine the data. These operations are the heartbeat of the model, turning raw inputs into coherent and contextually aware outputs.
Click Create and train Model to create the transformer model and train it on the specified text corpus above. Watch the weights and embeddings update as the model improves with each training cycle. Click Predict next Token to calculate the next token for the prompt. Or go to the very end and click Append and predict next token to continuously generate text. Also use the temperature slider to creatively search for the next token In addition enter your own text corpus or play around with the model and training configuration parameters.
Embedding dim.: Sequencelen.: Batchsize: Learning rate:

Finished:  % Loss:  Accuracy:  %
Flow of Information Math Operation
Input

Lookup Meaning from Vocabulary Table
Idx. Token Embeddings
+
x
Multiply with Query, Key, Value Weights
Value Matrix
x
Multiply with Attention Score
Optionally, split the embeddings and stack the self-attention
Optionally stack the transformer block and use the output values ​​as input for the next block
Project last Embedding to Vocabulary
Neural Network Weights
6. Select Token


Repeat until EOS token is reached. An EOS (end of sequence) token is not defined in this model.
What is a Large Language Model anyway
A large language model (LLM) is a sophisticated type of artificial intelligence that uses advanced mathematical techniques to process and generate human language. At its core, an LLM is not capable of thinking or understanding like a human; it operates purely through mathematical computations.
Where is the intelligence in the LLM
Large language models (LLMs) like Transformer models store their "intelligence" in the form of learned weights and the Q (Query), K (Key), and V (Value) matrices. which are parameters adjusted or optimized during the training process. These weights and matrices are crucial in determining how the model processes and generates language.
During training, the model "tries out" different values to minimize the error in its predictions. By adjusting these values through many iterations, the model learns and improves. The final set of values represents the "solution" that allows the model to process and generate language effectively. In essence, the optimization process tunes the model’s intelligence, allowing it to understand and respond to language better.
In essence, large language models (LLMs) like Transformers are performing a form of curve fitting. They take sequences of tokens (words or subwords) as input and output probabilities for the next token in the sequence.
Ultimately, the "intelligence" of the model is the result of this finely-tuned hyper surface fitting in a hyper space, enabling it to interpolate word meanings, construct sentences, understand text, generate software, and, in huge LLMs, even perform logical reasoning, allowing it to process and generate coherent language. The curve serves as a lookup for the next token, where the input is a token sequence. Once this hyper surface is established, the model can predict text it has not explicitly learned from the training data. This is achieved through curve interpolation, allowing the model to generate plausible and coherent language even for new, unseen inputs. By capturing the underlying patterns and relationships in the data, the model can generalize and create meaningful predictions beyond its training examples.
Why do LLMs hallucinate
Definition: In the context of LLMs, hallucination refers to generating outputs that are plausible-sounding but factually incorrect or not based on the training data.
Causes: Hallucination occurs when there is not enough training data to shape the hyper surface precisely enough. In such cases, the model may generate outputs that are plausible-sounding but not grounded in the training data. This is because the model interpolates patterns it has learned, but with insufficient data, it might produce imaginative or incorrect content. It can occur due to:
Why is the architecture called transformer
The name "Transformer" comes from the core concept of transforming sequences of data. Introduced by Vaswani et al. in their 2017 paper "Attention Is All You Need," the Transformer architecture fundamentally changed how we process sequential data like text.
In essence, the Transformer architecture is named for its revolutionary approach to transforming data sequences with attention mechanisms, allowing for more efficient and effective processing of information.
The transformer process step by step
  1. Lookup Meaning from Vocabulary TableMeaning Encoded
    When a word (or token) is fed into the model, it gets converted into a numerical vector, called an embedding. Embeddings capture semantic meaning, allowing the model to understand relationships between words (e.g., "king" is related to "queen"). These embeddings are stored in a lookup table, which can be thought of as a large matrix where each row corresponds to a token and contains its embedding vector. When a token is encountered in the input, its embedding is retrieved from this lookup table.
  2. Add Sine/Cosine Positional EmbeddingsPosition and Meaning Encoded
    To understand the order of words, positional encodings are added to the embeddings. This helps the model differentiate between "The cat sat on the mat" and "The mat sat on the cat." The positional encodings are vectors added to the word embeddings, incorporating information about the position of each token in the sequence.
  3. Multiply with Attention ScorePosition, Meaning and Context Encoded
    Attention mechanisms, particularly self-attention, allow the model to weigh the importance of different words in the context of a sentence. For example, in "The cat sat on the mat," the word "cat" might pay more attention to "sat" and "mat" than to the word "the." The attention mechanism calculates attention scores through matrix multiplications, enhancing embeddings with context from other relevant tokens.
  4. ProjectPosition, Meaning and Context further Encoded
    The embeddings, enriched with positional and contextual information, pass through multiple ANN layers. These layers apply a series of linear transformations and non-linear activations (like ReLU) to further process and refine the embeddings. Each layer extracts higher-level features, making the embeddings more informative.
  5. Project last Embedding to VocabularyVocabulary Prediction
    The final layer projects the refined embeddings to a vector the size of the vocabulary. This is done through a linear transformation, resulting in logits. The logits are then passed through a softmax function, which converts them into probabilities. This probability distribution indicates the likelihood of each word in the vocabulary being the next token. The temperature parameter T is introduced here to control the randomness of the predictions.
  6. Apply Temperature and Top-k SamplingSelect Token
    The model uses these probabilities to sample the next token. This can be done using different strategies like greedy sampling, top-k sampling, or nucleus (top-p) sampling: Greedy Sampling: Selects the token with the highest probability. Top-k Sampling: Considers only the top k tokens and samples from them. Nucleus Sampling: Considers tokens until the cumulative probability exceeds a certain threshold (e.g., 0.9).
What is the difference to a large language model
This simulation shows a small language model but its architecture corresponds exactly to its big brothers.
If we increase the embedding dimensions from 4 to 768, the sequence length from 3 to 1024 (giving us an attention score table with 1048576 cells) and the vocabulary to 50,257, add another network layer with 3072 neurons, as well as divide the self-attention into 12 heads and place the transformer block 11 times in a row, then we are already at GPT-2. This could be achieved with a few lines of additional code and some configuration changes.
It's all just a question of configuration and less of architecture. If we ignore small improvements like residual connections and normalization layers. However, this model would no longer be able to be trained efficiently in the browser.
For even larger models, the aim is to reduce the number of active parameters using technologies such as mixture of experts (MOE) or multi-head latent attention (MLA). Methods that may seem complex but essentially reveal themselves through matrix operations. The selection and preparation of the training data also plays an important role.
But wait, there's this Think Deeper button
The "Think Deeper" feature might give the impression that the LLM (Large Language Model) is thinking more deeply, but in reality, it leverages multiple passes through the same model to improve the quality of the response. It doesn't change the underlying architecture or fundamentally alter how the model processes information.
In other words, the model isn't "thinking" in the human sense but is using its existing capabilities iteratively to refine and enhance its output. This process leads to more accurate and contextually relevant responses without actually changing the way the model operates.
Here's a brief explanation of how it works:
So, while the architecture remains unchanged, the iterative process improves the model's reasoning and response quality. It's like taking multiple looks at a problem to come up with a well-rounded solution.
How was it built
This software was created using Vibe Coding by a Large Language Model LLM / chatbot and reworked in look & feel.
As a result, neither Copilot nor Gemini nor Chat-GPT (as of February 2025) could create the LLM, but only a framework code. Many problems such as backpropagation in the attention layer or convolution to generate the MLP mapping for the sequence had to be programmed on foot.
Nevertheless, the bots were a great help.
It should be noted that the final implementation was based on JavaScript Tensorflow, for which there is significantly less trainable example code. The bots had to creatively find solutions using analogies.
The following prompts were used on Copilot to create a python example llm5.py
"create a python tensorflow model for an llm. Use 2 dimenstional embeddings. Train the model on a sample text corpus. implement a predict function that takes a text as input and outputs a probability distribution for the next token."
"run a softmax on the predictions and print them line by line"
"where is the q,k,v attention layer in the createmodel"
"Iterating over a symbolic KerasTensor is not supported."
"should the predict sequence not hold the whole senence"
"use a CustomMultiHeadAttention class instead"
"shapes used to initialize variables must be fully-defined (no `None` dimensions). Received: shape=(None, 128) for variable path='dense_4/kernel'"
"you need to change dense_layer = Dense(128, activation='relu')(flatten_layer) instead"
"dense_layer = Dense(128, activation='relu')(flatten_layer) throws shapes used to initialize variables must be fully-defined"
"graph execution error Only one input size may be -1, not both 0 and 1 in model.fit"
"only one input size may be -1, not both 0 and 1 [[{{node functional_1/flatten_1/Reshape}}]] [Op:__inference_multi_step_on_iterator_2309] File "/Users/ichapple/Documents/Python/llm5.py", line 90, in model.fit(X, y, epochs=100, verbose=1) tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error: Detected at node functional_1/flatten_1/Reshape"
"CustomMultiHeadAttention returns shape with null"
"CustomMultiHeadAttention has _shape = (None, None, 8)"

The following prompts were used on Gemini to translate the python to javascript and to solve several problems.
"create a single webpage translate llm5.py to javascript and insert it to the webpage"
"llm5.py has its own training methods there is no need to specify weights and bias. translate the python one to one to javascript so it will work exactly the same"
"it throws: Class being registered does not have the static className property defined."
"generic_utils.js:243 Uncaught n: Unknown initializer: glorot_uniform. This may be due to one of the following reasons: The initializer is defined in Python, in which case it needs to be ported to TensorFlow.js or your JavaScript code. The custom initializer is defined in JavaScript, but is not registered properly with tf.serialization.registerClass()."
"topology.js:143 Uncaught TypeError: Cannot read properties of null (reading 'length') in model.add(new CustomMultiHeadAttention({key_dim: embedding_dim, num_heads: num_heads, name: 'mha', kernel_initializer: 'glorotNormal', bias_initializer: 'zeros'})); // tf.customLayer"
"topology.js:773 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'rank')"
"util_base.js:153 Uncaught (in promise) Error: Error in matMul: inner shapes (8) and (2) of Tensors with shapes 8,3,8,8 and 8,3,2,8 and transposeA=false and transposeB=false"
"tensor_util_env.js:92 Uncaught (in promise) Error: Argument 'x' passed to 'floor' must be float32 tensor, but got int32 tensor"
"display a html table with the trained vocabulary use one row for each token, display the index, name and the embeddings of each token."
"for each head display html tables for the q,v,k matrices. Also display a html table with the attention scores"
"tensor.js:461 Uncaught (in promise) Error: Tensor is disposed. at e.value (tensor.js:461:13) at r5.slice (slice.js:32:8) at displayAttentionVisualizations (a4.html:408:80) at async predict_next_token (a4.html:309:14)"
"add position embeddings to the code and display them as a html table"
"you did not add positionEmbeddingLayer to the model so it is not part of the training process"
"but you apply the position_tensor only in the predict_next_token method but it is not used in the train method"
"instead of manually adding the position embeddings in the prodict and train method it would be better to make it part of the model. Can you do that"
"const positionTensor = tf.tensor2d([positions], [batchSize, seqLength], 'int32'); throws Uncaught (in promise) Error: Based on the provided shape, [8,3], the tensor should have 24 values but has 3"
"why are you doing this: const predicted_prob = model.predict(embeddingsAfterTensor).dataSync(); instead of const predicted_prob = model.predict(input_tensor).dataSync();"
"But the positional embeddings are already added in the model with the custom AddPositionEmbedding class"
"it is still redundant because also the norm layer is part of the model so model.predict(input_tensor).dataSync() should be enough"
"for the positions could you implement a sinus / cosines curve"
"positionEmbeddingsTensor = addPosLayer.positionEmbeddingLayer.getWeights()[0]; does not work"
"explaint the CustomMultiHeadAttention class with respect to self attention"
"If the sequence_len = 3 then for 2 heads there should be 2 attention_scores matrices with 3 by 3 values, giving a total of 18 scores in attention_scores right ?"
"so for a batch_size of 1 the shape would be [1,3,2,3] ?"
"but when I run the code it gives me [1,3,2,2] attention_scores"
"query_reshaped and key_reshaped both have shape 1,3,2,2 is this correct ?"
"but this.attention_scores = tf.matMul(this.query_reshaped, this.key_reshaped, true); computes the score matrix which we agree should have shape 1,3,2,3 but it turns out to be 1,3,2,2. Is there anything wrong with the matMul ?"
"The problem could be solved by changing the cols (batch_size, seq_len, num_heads, key_dim) to (batch_size, num_heads, seq_len, key_dim) then when transposing it would give [1, 2, 3, 2] matmul [1, 2, 2, 3] which would yield [1, 3, 2, 3] is that right ?"
"implement a Top-k Sampling sampling for a given array of softmax values jusing javascript only"

At this point the code was not yet executable and the lengthy troubleshooting and implementation of missing features began with the support of Gemini, Copilot and Chat-GPT.