What is a Large Language Model anyway
A large language model (LLM) is a sophisticated type of artificial intelligence that uses advanced
mathematical techniques to process and generate human language. At its core, an LLM is not capable of
thinking or understanding like a human; it operates purely through mathematical computations.
- Transforming Text into Numbers (Embeddings):
The first step involves encoding words or tokens into numerical vectors, known as embeddings. These
embeddings capture various aspects of the words, such as their meanings, relationships, and context
within the text.
Think of embeddings as a way to represent words in a high-dimensional space where similar words are
closer together, enabling the model to grasp contextual relationships.
-
Mathematical Operations:
The model processes these embeddings through multiple layers of mathematical operations, primarily
matrix multiplications. These layers include attention mechanisms that allow the model to focus on
relevant parts of the input sequence.
Each layer transforms the embeddings, enriching them with more contextual information and refining
their representations.
- Predicting the Next Token:
Given a sequence of tokens, the model's task is to predict the next token in the sequence. It
does this by considering the transformed embeddings and calculating probabilities for each
possible next token in the vocabulary.
The softmax function converts these logits (raw scores) into a probability distribution,
indicating the likelihood of each token being the next one.
-
Autoregression:
In a large language model, the process of generating text involves predicting and appending one
token at a time to the input sequence. This process, known as autoregression, ensures that the input
sequence gradually grows and improves the context for predicting subsequent tokens.
-
Mathematical Basis:
The entire process is grounded in linear algebra and probability theory. The embeddings, matrix
multiplications, and softmax function are all mathematical constructs.
The model doesn't understand or think about the text in a human sense; it simply applies learned
patterns to generate the most probable continuation based on its training data.
Where is the intelligence in the LLM
Large language models (LLMs) like Transformer models store their "intelligence" in the form of learned
weights
and the Q (Query), K (Key), and V (Value) matrices.
which are parameters adjusted or optimized during the training process.
These weights and matrices are crucial in determining how the model processes and
generates language.
During training, the model "tries out" different values
to minimize the error in its predictions.
By adjusting these values through many iterations, the model learns and improves. The final set of values
represents the "solution" that allows the model to process and generate language effectively.
In essence, the optimization process tunes the model’s intelligence, allowing it to understand and respond
to language better.
In essence, large language models (LLMs) like Transformers are performing a form of curve fitting. They take
sequences of tokens (words or subwords) as input and output probabilities for the next token in the
sequence.
Ultimately, the "intelligence" of the model is the result of this finely-tuned hyper surface fitting in a
hyper space, enabling it to interpolate word meanings, construct sentences, understand text, generate
software, and, in huge LLMs, even perform logical reasoning, allowing it to process and generate coherent
language. The curve serves as a lookup for the next token, where the input is a token sequence.
Once this hyper surface is established, the model can predict text it has not explicitly learned from the
training data. This is achieved through curve interpolation, allowing the model to generate plausible and
coherent language even for new, unseen inputs.
By capturing the underlying patterns and relationships in the data, the model can generalize and create
meaningful predictions beyond its training examples.
Why do LLMs hallucinate
Definition: In the context of LLMs, hallucination refers to generating outputs that are
plausible-sounding but factually incorrect or not based on the training data.
Causes: Hallucination occurs when there is not enough training data to shape the hyper surface
precisely enough. In such cases, the model may generate outputs that are plausible-sounding but not grounded
in the training data. This is because the model interpolates patterns it has learned, but with insufficient
data, it might produce imaginative or incorrect content.
It can occur due to:
- Ambiguity: The model might fill in gaps with fabricated details.
- Pattern Synthesis: The model combines known patterns in unexpected ways.
- Imagination: The model generates creative but incorrect content.
Why is the architecture called transformer
The name "Transformer" comes from the core concept of transforming sequences of data. Introduced by Vaswani
et al. in their 2017 paper "Attention Is All You Need," the Transformer architecture fundamentally changed
how we process sequential data like text.
- Attention Mechanisms:
Transformers use attention mechanisms to transform input sequences into output sequences by focusing
on relevant parts of the input.
Unlike previous models that processed data sequentially, Transformers use attention to allow for
parallel processing, thus transforming how sequences are handled.
-
Sequential Data Transformation:
The architecture is designed to transform entire sequences of input data (e.g., sentences) into
meaningful output sequences (e.g., translated sentences) in one go.
This transformation is done through layers of self-attention and feedforward neural networks.
- Context window:
The context window plays a crucial role in this process. It defines the span of input tokens that
the model considers when processing each token. By leveraging the context window, the model can
focus on relevant parts of the input sequence, ensuring that the transformation captures
dependencies and relationships across the entire sequence.
During the transformation, the self-attention mechanism looks at all tokens within the context
window to compute attention scores. These scores determine the relevance of each token in relation
to every other token within the window. The calculated attention weights are then used to generate
context vectors, which aggregate information from the entire context window. This allows the model
to produce coherent and contextually relevant output sequences, effectively capturing the essence of
the input data.
- Flexibility and Versatility:
Transformers are not limited to a specific type of data; they can transform various types of
sequential data, including text, audio, and even images.
This ability to handle diverse data types highlights their transformative nature.
In essence, the Transformer architecture is named for its revolutionary approach to transforming data
sequences with attention mechanisms, allowing for more efficient and effective processing of information.
The transformer process step by step
-
Lookup Meaning from Vocabulary Table
→
Meaning Encoded
When a word (or token) is fed into the model, it gets converted into a numerical
vector,
called an embedding. Embeddings capture semantic meaning, allowing the model to understand
relationships between words (e.g., "king" is related to "queen").
These embeddings are stored in a lookup table, which can be thought of as a large matrix where each
row corresponds to a token and contains its embedding vector.
When a token is encountered in the input, its embedding is retrieved from this lookup table.
-
Add Sine/Cosine Positional Embeddings
→
Position and Meaning Encoded
To understand the order of words, positional encodings are added to the embeddings.
This
helps the model differentiate between "The cat sat on the mat" and "The mat sat on the cat." The
positional
encodings are vectors added to the word embeddings, incorporating information about the position of
each
token in the sequence.
-
Multiply with Attention Score
→
Position, Meaning and Context Encoded
Attention mechanisms, particularly self-attention, allow the model to weigh the
importance of
different words in the context of a sentence. For example, in "The cat sat on the mat," the word
"cat" might
pay more attention to "sat" and "mat" than to the word "the." The attention mechanism calculates
attention
scores through matrix multiplications, enhancing embeddings with context from other relevant tokens.
-
Project
→
Position, Meaning and Context further Encoded
The embeddings, enriched with positional and contextual
information,
pass through multiple ANN layers. These layers apply a series of linear transformations and
non-linear
activations (like ReLU) to further process and refine the embeddings. Each layer extracts
higher-level
features, making the embeddings more informative.
-
Project last Embedding to Vocabulary
→
Vocabulary Prediction
The final layer projects the refined embeddings to a vector the size
of the
vocabulary. This is done through a linear transformation, resulting in logits.
The logits are then passed through a softmax function, which converts them into probabilities. This
probability distribution indicates the likelihood of each word in the vocabulary being the next
token. The temperature parameter
T
is introduced here to control the randomness of the predictions.
-
Apply Temperature and Top-k Sampling
→
Select Token
The model uses these probabilities to sample the next token. This can be
done using
different strategies like greedy sampling, top-k sampling, or nucleus (top-p) sampling:
Greedy Sampling: Selects the token with the highest probability.
Top-k Sampling: Considers only the top k tokens and samples from them.
Nucleus Sampling: Considers tokens until the cumulative probability exceeds a certain threshold
(e.g., 0.9).
What is the difference to a large language model
This simulation shows a small language model but its architecture corresponds exactly to its big brothers.
If we increase the embedding dimensions from 4 to 768,
the sequence length from 3 to 1024
(giving us an attention score table with 1048576 cells)
and the vocabulary to 50,257,
add another network layer with 3072 neurons,
as well as divide the self-attention into 12 heads
and place the transformer block 11 times in a row,
then we are already at GPT-2.
This could be achieved with a few lines of additional code and some configuration changes.
It's all just a question of configuration and less of architecture.
If we ignore small improvements like residual connections and normalization layers.
However, this model would no longer be able to be trained efficiently in the browser.
For even larger models, the aim is to reduce the number of active parameters using technologies such as
mixture of experts (MOE) or multi-head latent attention (MLA).
Methods that may seem complex but essentially reveal themselves through matrix operations.
The selection and preparation of the training data also plays an important role.
But wait, there's this Think Deeper button
The "Think Deeper" feature might give the impression that the LLM (Large Language Model) is thinking more
deeply, but in reality, it leverages multiple passes through the same model to improve the quality of the
response. It doesn't change the underlying architecture or fundamentally alter how the model processes
information.
In other words, the model isn't "thinking" in the human sense but is using its existing capabilities
iteratively to refine and enhance its output. This process leads to more accurate and contextually relevant
responses without actually changing the way the model operates.
Here's a brief explanation of how it works:
-
Initial Response:
The model generates an initial response based on the input query.
-
Iterative Refinement:
The response is then analyzed, and the model is re-invoked with additional context or specific
instructions to improve and refine the output. This may involve breaking down the question into
smaller components or considering alternative angles.
-
Enhanced Answer:
By iterating through the LLM multiple times, the final answer becomes more accurate, detailed, and
contextually relevant.
So, while the architecture remains unchanged, the iterative process improves the model's reasoning and
response quality. It's like taking multiple looks at a problem to come up with a well-rounded solution.
How was it built
This software was created using Vibe Coding by a Large Language Model LLM / chatbot
and reworked in look & feel.
As a result, neither Copilot nor Gemini nor Chat-GPT (as of February 2025) could create the LLM, but only a
framework code. Many problems such as backpropagation in the attention layer or convolution to generate the
MLP mapping for the sequence had to be programmed on foot.
Nevertheless, the bots were a great help.
It should be noted that the final implementation was based on JavaScript Tensorflow, for which there is
significantly less trainable example code. The bots had to creatively find solutions using analogies.
The following prompts were used on Copilot to create a python example llm5.py
"create a python tensorflow model for an llm. Use 2 dimenstional embeddings. Train the model on a
sample text corpus. implement a predict function that takes a text as input and outputs a probability
distribution for the next token."
"run a softmax on the predictions and print them line by line"
"where is the q,k,v attention layer in the createmodel"
"Iterating over a symbolic KerasTensor is not supported."
"should the predict sequence not hold the whole senence"
"use a CustomMultiHeadAttention class instead"
"shapes used to initialize variables must be fully-defined (no `None` dimensions). Received:
shape=(None, 128) for variable path='dense_4/kernel'"
"you need to change dense_layer = Dense(128, activation='relu')(flatten_layer) instead"
"dense_layer = Dense(128, activation='relu')(flatten_layer) throws shapes used to initialize variables
must be fully-defined"
"graph execution error Only one input size may be -1, not both 0 and 1 in model.fit"
"only one input size may be -1, not both 0 and 1 [[{{node functional_1/flatten_1/Reshape}}]]
[Op:__inference_multi_step_on_iterator_2309] File "/Users/ichapple/Documents/Python/llm5.py", line 90,
in model.fit(X, y, epochs=100, verbose=1)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error: Detected at
node functional_1/flatten_1/Reshape"
"CustomMultiHeadAttention returns shape with null"
"CustomMultiHeadAttention has _shape = (None, None, 8)"
The following prompts were used on Gemini to translate the python to javascript and to solve several
problems.
"create a single webpage translate llm5.py to javascript and insert it to the webpage"
"llm5.py has its own training methods there is no need to specify weights and bias. translate the
python one to one to javascript so it will work exactly the same"
"it throws: Class being registered does not have the static className property defined."
"generic_utils.js:243 Uncaught n: Unknown initializer: glorot_uniform. This may be due to one of the
following reasons:
The initializer is defined in Python, in which case it needs to be ported to TensorFlow.js or your
JavaScript code.
The custom initializer is defined in JavaScript, but is not registered properly with
tf.serialization.registerClass()."
"topology.js:143 Uncaught TypeError: Cannot read properties of null (reading 'length') in model.add(new
CustomMultiHeadAttention({key_dim: embedding_dim, num_heads: num_heads, name: 'mha', kernel_initializer:
'glorotNormal', bias_initializer: 'zeros'})); // tf.customLayer"
"topology.js:773 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'rank')"
"util_base.js:153 Uncaught (in promise) Error: Error in matMul: inner shapes (8) and (2) of Tensors
with shapes 8,3,8,8 and 8,3,2,8 and transposeA=false and transposeB=false"
"tensor_util_env.js:92 Uncaught (in promise) Error: Argument 'x' passed to 'floor' must be float32
tensor, but got int32 tensor"
"display a html table with the trained vocabulary use one row for each token, display the index, name
and the embeddings of each token."
"for each head display html tables for the q,v,k matrices. Also display a html table with the attention
scores"
"tensor.js:461 Uncaught (in promise) Error: Tensor is disposed.
at e.value (tensor.js:461:13)
at r5.slice (slice.js:32:8)
at displayAttentionVisualizations (a4.html:408:80)
at async predict_next_token (a4.html:309:14)"
"add position embeddings to the code and display them as a html table"
"you did not add positionEmbeddingLayer to the model so it is not part of the training process"
"but you apply the position_tensor only in the predict_next_token method but it is not used in the
train method"
"instead of manually adding the position embeddings in the prodict and train method it would be better
to make it part of the model. Can you do that"
"const positionTensor = tf.tensor2d([positions], [batchSize, seqLength], 'int32'); throws Uncaught (in
promise) Error: Based on the provided shape, [8,3], the tensor should have 24 values but has 3"
"why are you doing this: const predicted_prob = model.predict(embeddingsAfterTensor).dataSync();
instead of const predicted_prob = model.predict(input_tensor).dataSync();"
"But the positional embeddings are already added in the model with the custom AddPositionEmbedding
class"
"it is still redundant because also the norm layer is part of the model so
model.predict(input_tensor).dataSync() should be enough"
"for the positions could you implement a sinus / cosines curve"
"positionEmbeddingsTensor = addPosLayer.positionEmbeddingLayer.getWeights()[0]; does not work"
"explaint the CustomMultiHeadAttention class with respect to self attention"
"If the sequence_len = 3 then for 2 heads there should be 2 attention_scores matrices with 3 by 3
values, giving a total of 18 scores in attention_scores right ?"
"so for a batch_size of 1 the shape would be [1,3,2,3] ?"
"but when I run the code it gives me [1,3,2,2] attention_scores"
"query_reshaped and key_reshaped both have shape 1,3,2,2 is this correct ?"
"but this.attention_scores = tf.matMul(this.query_reshaped, this.key_reshaped, true); computes the
score matrix which we agree should have shape 1,3,2,3 but it turns out to be 1,3,2,2. Is there anything
wrong with the matMul ?"
"The problem could be solved by changing the cols (batch_size, seq_len, num_heads, key_dim) to
(batch_size, num_heads, seq_len, key_dim) then when transposing it would give [1, 2, 3, 2] matmul [1, 2,
2, 3] which would yield [1, 3, 2, 3] is that right ?"
"implement a Top-k Sampling sampling for a given array of softmax values jusing javascript only"
At this point the code was not yet executable and the lengthy troubleshooting and implementation of missing
features began with the support of Gemini, Copilot and Chat-GPT.