Vibe Coding Home

LLMs Part 2: Understand token position through AI

This software was largely created by AI Vibe Coding
Created by YouMinds
Unlike traditional sequence models like Recurrent Neural Networks (RNNs), which process tokens sequentially, Transformers process all tokens in parallel and thus lack an inherent understanding of token order.
Positional encoding addresses this challenge by encoding positional information into the token embeddings.
Consider the sentence: "The bear likes sun"
Example for two-dimensional embeddings as we have seen in LLMs Part 1: Understand the meaning of words through AI.
original embeddings
Pos. word X Y
0 the -0.30 0.68
1 bear -0.23 -0.45
2 likes -0.34 -0.51
3 sun -0.46 0.47
+
positional embeddings
Pos. X Y
0 0.00 1.00
1 0.84 0.54
2 0.91 -0.42
3 0.14 -0.99
=
resulting embeddings
Pos. X Y
0 -0.30 1.68
1 0.61 0.09
2 0.57 -0.93
3 -0.32 -0.52
But what happens if we change the order of the words in the sentence and create: "The sun likes bear" instead.
Through positional encoding the new meaning of the converted sentence becomes visible in the encodings.
original embeddings
Pos. word X Y
0 the -0.30 0.68
1 sun -0.46 0.47
2 likes -0.34 -0.51
3 bear -0.23 -0.45
+
positional embeddings
Pos. X Y
0 0.00 1.00
1 0.84 0.54
2 0.91 -0.42
3 0.14 -0.99
=
resulting embeddings
Pos. X Y
0 -0.30 1.68
1 0.38 1.01
2 0.57 -0.93
3 -0.09 -1.44
You can think of positional encoding as a way to "twist" or "turn" the original embedding vector by adding sine and cosine values, which are functions of the position in the sequence. This gives each position a unique representation and allows the model to understand the order and relative positions of tokens in the sequence. Notice how the values ​​from the sine and cosine curves below are used here.
This is how positional encodings are calculated



Change the parameters and press recalculate to display the positional encodings according to the parameters. Notice that for high-dimensional embedding vectors, the wavelength increases, resulting in unique embeddings for each position.
Resulting sine and cosine curves.
Encodings on a gray scale map
The following graphic illustrates the encoding values on a gray scale map. The dimensions of the embedding vectors are shown vertically. The horizontal axis shows the token position. The color reflects the value added to the embedding.
Vector Dimension
0Token Position Index
What is positional encoding anyway
Positional encoding is a technique used in the architecture of large language models (LLMs) to provide the model with information about the position of each token in an input sequence.
In the standard sinusoidal positional encodings, each position in the input sequence is mapped to a vector of a fixed dimension, where each dimension in the vector is a function of both the position and the frequency, using sine and cosine functions of different frequencies. This allows the model to learn to attend to tokens at specific positions.
\(\text{PE}(pos, 2i) = \sin \left( \frac{pos}{N^{2i/d}} \right)\)
for even dimension indices
\(\text{PE}(pos, 2i+1) = \cos \left( \frac{pos}{N^{2i/d}} \right)\)
for odd dimension indices
pos = Token Position Index
i = Vector Dimension Index
N = free parameter (10000 default)
So the positional encoding is repeated according to the wavelength?
In large language models (LLMs), the choice of wavelengths for positional encoding is designed to cover a wide range of positional information, from short-term to long-term dependencies. The wavelengths are implicitly defined by the positional encoding formulas using sine and cosine functions with different frequencies.
The sine and cosine functions used in positional encoding will repeat their values periodically due to their nature. The period of repetition, or wavelength, is determined by the functions' parameters. This repetition is intentional and ensures that the positional encoding can accommodate sequences of various lengths. The choice of using different frequencies for sine and cosine functions helps the model capture both short-term and long-term dependencies in the input sequence. By encoding positions in this way, the model can effectively learn the relative positions of tokens and understand the order of the sequence.
The more dimensions the embedding vector has, the longer the sequence before the patterns repeat.

In essence, as the dimensionality of the embedding vector increases, the wavelengths of the sine and cosine functions span a broader range, thereby extending the sequence length before the patterns start to repeat.
Are there other methods for positional encoding?
Each of these methods has its own advantages and trade-offs, and the choice of positional encoding can depend on the specific task and model architecture.
How was it built
This software was created using Vibe Coding by a Large Language Model LLM / chatbot and reworked in look & feel.
Some features had to be implemented manually and corrections and improvements had to be made.
The following Vibe Coding prompts were used on Copilot:
"create a single page html website using javascript and chart.js. Create positional encoding for D dimension embeddings and indices from 0 to 100. Draw the embeddings on the chart. Use a graph for each of the D dimensions. Let the user set the D value via an input control."
"add a input field for the number of indices. Add a canvas that displays a color map. Use the indices horizontally and the position vertically. Color the dots on the canvas according to the embedding value."
"the number of indices or dimensions field does not update the actual dimensions of the vectors. The canvas is to small, make it 100% wide and 300 pt high."