Vibe Coding Home

LLMs Part 3: Understand long context through AI

This software was largely created by AI Vibe Coding
Created by YouMinds
The attention mechanism revolutionized NLP by enabling models to focus on important parts of the input sequence, allowing for contextually accurate responses even in long sequences.
Introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al., self-attention has become a cornerstone of modern AI and the transformer architecture of LLMs.
Self-attention simulation
In the previous chapters we learned how each word in a sentence is converted into a vector representation, so called embeddings.
At this point, the embeddings only carry information about the individual word's meaning and its position in the sequence.
Now self-attention comes into play, where each word's embedding is updated with context from surrounding words. This means the model can better capture the nuances and relationships within the sentence, leading to a more enriched and context-aware representation of the input tokens.
For this simulation we use 2 dimensional embeddings, which enable arrow representation and a context window size of only 4 tokens and a simple text corpus.
Click Create and train Model to create the transformer model and train it on the text corpus above.


Finished:  % Loss:  Accuracy:  %
Watch the embedding vectors (arrows) that are aligned in the same way receive a high score. I.e. here the question Q has received a suitable answer K, which in turn indicates contextual connection.
What is a self-attention anyway
Self-attention is a mechanism in neural networks that helps the model focus on different parts of the input when generating an output. Let's break down how it works: In summary, the self-attention mechanism calculates the importance of each word in the context of every other word by using the dot product of their embedding vectors. The context window is visualized as a matrix with attention scores, where higher scores indicate greater relevance.
What can I see here
The rows show the words for which contextual reference is being sought. The columns provide the answer. If the vectors run in parallel, expressed by a large dot product, then the word has a high relationship to the corresponding other word in the sequence.
The words "likes" and "air" are highly related to the word brown. From the text corpus used in training, we see that the words "likes" and "air" can only be predicted based on the word "brown". Accordingly, the model finds a contextual reference,as indicated by the light table cells.
The text corpus used during training, see above.
The red   wolf loves  sun
The blue  wolf eats   meat
The white wolf drinks water
The brown wolf likes  air
        
Why is the score for future words always zero
In language generation tasks, the model predicts the next word based on the previous words. By masking out future words, the model is forced to only consider the context of the words it has already seen, ensuring a causal dependency. This way, the prediction of each word depends only on the preceding words and not on the words that come after.
To summarize, masking out scores for words that come after the current word in self-attention ensures that the model generates text in a logical and sequential manner, maintaining causal dependencies and preventing information leakage.
It's just math
Transformation of each token in the input sequence (X) into the (Q), (K) and (V) vectors using the learned weights (Wk), (Wq) and (Wv)
Q = Wq · X
K = Wk · X
V = Wv · X
Create the dot product that results in an attention score (A) indicating relevance between words.
A = Q · KT
To obtain the output embeddings (O), you perform a matrix multiplication of the attention score matrix (A) with the value matrix (V). This operation combines the value vectors using the attention scores as weights, resulting in the final context vectors, which are the output embeddings.
O = A × V
In essence, the matrix multiplication is a compact way of expressing the summation operation where each context vector (Ci) is the weighted sum of the value vectors (Vj), with weights given by the attention scores (Aij). This process integrates the information from the entire input sequence, influenced by the attention mechanism, to form the final context vectors or output embeddings (O).
How was it built
This software was created using Vibe Coding by a Large Language Model LLM / chatbot and reworked in look & feel.
As a result, neither Copilot nor Gemini nor Chat-GPT (as of February 2025) could create the LLM, but only a framework code. Many problems such as backpropagation in the attention layer or convolution to generate the MLP mapping for the sequence had to be programmed on foot.
Nevertheless, the bots were a great help.
It should be noted that the final implementation was based on JavaScript Tensorflow, for which there is significantly less trainable example code. The bots had to creatively find solutions using analogies.
The following prompts were used on Copilot to create a python example llm5.py
"create a python tensorflow model for an llm. Use 2 dimenstional embeddings. Train the model on a sample text corpus. implement a predict function that takes a text as input and outputs a probability distribution for the next token."
"run a softmax on the predictions and print them line by line"
"where is the q,k,v attention layer in the createmodel"
"Iterating over a symbolic KerasTensor is not supported."
"should the predict sequence not hold the whole senence"
"use a CustomMultiHeadAttention class instead"
"shapes used to initialize variables must be fully-defined (no `None` dimensions). Received: shape=(None, 128) for variable path='dense_4/kernel'"
"you need to change dense_layer = Dense(128, activation='relu')(flatten_layer) instead"
"dense_layer = Dense(128, activation='relu')(flatten_layer) throws shapes used to initialize variables must be fully-defined"
"graph execution error Only one input size may be -1, not both 0 and 1 in model.fit"
"only one input size may be -1, not both 0 and 1 [[{{node functional_1/flatten_1/Reshape}}]] [Op:__inference_multi_step_on_iterator_2309] File "/Users/ichapple/Documents/Python/llm5.py", line 90, in model.fit(X, y, epochs=100, verbose=1) tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error: Detected at node functional_1/flatten_1/Reshape"
"CustomMultiHeadAttention returns shape with null"
"CustomMultiHeadAttention has _shape = (None, None, 8)"

The following prompts were used on Gemini to translate the python to javascript and to solve several problems.
"create a single webpage translate llm5.py to javascript and insert it to the webpage"
"llm5.py has its own training methods there is no need to specify weights and bias. translate the python one to one to javascript so it will work exactly the same"
"it throws: Class being registered does not have the static className property defined."
"generic_utils.js:243 Uncaught n: Unknown initializer: glorot_uniform. This may be due to one of the following reasons: The initializer is defined in Python, in which case it needs to be ported to TensorFlow.js or your JavaScript code. The custom initializer is defined in JavaScript, but is not registered properly with tf.serialization.registerClass()."
"topology.js:143 Uncaught TypeError: Cannot read properties of null (reading 'length') in model.add(new CustomMultiHeadAttention({key_dim: embedding_dim, num_heads: num_heads, name: 'mha', kernel_initializer: 'glorotNormal', bias_initializer: 'zeros'})); // tf.customLayer"
"topology.js:773 Uncaught (in promise) TypeError: Cannot read properties of undefined (reading 'rank')"
"util_base.js:153 Uncaught (in promise) Error: Error in matMul: inner shapes (8) and (2) of Tensors with shapes 8,3,8,8 and 8,3,2,8 and transposeA=false and transposeB=false"
"tensor_util_env.js:92 Uncaught (in promise) Error: Argument 'x' passed to 'floor' must be float32 tensor, but got int32 tensor"
"display a html table with the trained vocabulary use one row for each token, display the index, name and the embeddings of each token."
"for each head display html tables for the q,v,k matrices. Also display a html table with the attention scores"
"tensor.js:461 Uncaught (in promise) Error: Tensor is disposed. at e.value (tensor.js:461:13) at r5.slice (slice.js:32:8) at displayAttentionVisualizations (a4.html:408:80) at async predict_next_token (a4.html:309:14)"
"add position embeddings to the code and display them as a html table"
"you did not add positionEmbeddingLayer to the model so it is not part of the training process"
"but you apply the position_tensor only in the predict_next_token method but it is not used in the train method"
"instead of manually adding the position embeddings in the prodict and train method it would be better to make it part of the model. Can you do that"
"const positionTensor = tf.tensor2d([positions], [batchSize, seqLength], 'int32'); throws Uncaught (in promise) Error: Based on the provided shape, [8,3], the tensor should have 24 values but has 3"
"why are you doing this: const predicted_prob = model.predict(embeddingsAfterTensor).dataSync(); instead of const predicted_prob = model.predict(input_tensor).dataSync();"
"But the positional embeddings are already added in the model with the custom AddPositionEmbedding class"
"it is still redundant because also the norm layer is part of the model so model.predict(input_tensor).dataSync() should be enough"
"for the positions could you implement a sinus / cosines curve"
"positionEmbeddingsTensor = addPosLayer.positionEmbeddingLayer.getWeights()[0]; does not work"
"explaint the CustomMultiHeadAttention class with respect to self attention"
"If the sequence_len = 3 then for 2 heads there should be 2 attention_scores matrices with 3 by 3 values, giving a total of 18 scores in attention_scores right ?"
"so for a batch_size of 1 the shape would be [1,3,2,3] ?"
"but when I run the code it gives me [1,3,2,2] attention_scores"
"query_reshaped and key_reshaped both have shape 1,3,2,2 is this correct ?"
"but this.attention_scores = tf.matMul(this.query_reshaped, this.key_reshaped, true); computes the score matrix which we agree should have shape 1,3,2,3 but it turns out to be 1,3,2,2. Is there anything wrong with the matMul ?"
"The problem could be solved by changing the cols (batch_size, seq_len, num_heads, key_dim) to (batch_size, num_heads, seq_len, key_dim) then when transposing it would give [1, 2, 3, 2] matmul [1, 2, 2, 3] which would yield [1, 3, 2, 3] is that right ?"
"implement a Top-k Sampling sampling for a given array of softmax values jusing javascript only"

At this point the code was not yet executable and the lengthy troubleshooting and implementation of missing features began with the support of Gemini, Copilot and Chat-GPT.