Vibe Coding Project: LLMs Part 2: Understand token position through AI

Unlike traditional sequence models like Recurrent Neural Networks (RNNs), which process tokens sequentially, Transformers process all tokens in parallel and thus lack an inherent understanding of token order.

Positional encoding addresses this challenge by encoding positional information into the token embeddings.

Consider the sentence: "The bear likes sun"

Example for two-dimensional embeddings as we have seen in LLMs Part 1: Understand the meaning of words through AI.

original embeddings
Pos.	word	X	Y
0	the	-0.30	0.68
1	bear	-0.23	-0.45
2	likes	-0.34	-0.51
3	sun	-0.46	0.47

+

positional embeddings
Pos.	X	Y
0	0.00	1.00
1	0.84	0.54
2	0.91	-0.42
3	0.14	-0.99

=

resulting embeddings
Pos.	X	Y
0	-0.30	1.68
1	0.61	0.09
2	0.57	-0.93
3	-0.32	-0.52

But what happens if we change the order of the words in the sentence and create: "The sun likes bear" instead.

Through positional encoding the new meaning of the converted sentence becomes visible in the encodings.

original embeddings
Pos.	word	X	Y
0	the	-0.30	0.68
1	sun	-0.46	0.47
2	likes	-0.34	-0.51
3	bear	-0.23	-0.45

+

positional embeddings
Pos.	X	Y
0	0.00	1.00
1	0.84	0.54
2	0.91	-0.42
3	0.14	-0.99

=

resulting embeddings
Pos.	X	Y
0	-0.30	1.68
1	0.38	1.01
2	0.57	-0.93
3	-0.09	-1.44

You can think of positional encoding as a way to "twist" or "turn" the original embedding vector by adding sine and cosine values, which are functions of the position in the sequence. This gives each position a unique representation and allows the model to understand the order and relative positions of tokens in the sequence. Notice how the values from the sine and cosine curves below are used here.

This is how positional encodings are calculated

the number of vector dimensions
the number of positional token indices
free parameter N. Affects the wavelength of higher dimensions.

Change the parameters and press recalculate to display the positional encodings according to the parameters. Notice that for high-dimensional embedding vectors, the wavelength increases, resulting in unique embeddings for each position.

Resulting sine and cosine curves.

Encodings on a gray scale map

The following graphic illustrates the encoding values on a gray scale map. The dimensions of the embedding vectors are shown vertically. The horizontal axis shows the token position. The color reflects the value added to the embedding.

Vector Dimension
	0Token Position Index

What is positional encoding anyway

Positional encoding is a technique used in the architecture of large language models (LLMs) to provide the model with information about the position of each token in an input sequence.

In the standard sinusoidal positional encodings, each position in the input sequence is mapped to a vector of a fixed dimension, where each dimension in the vector is a function of both the position and the frequency, using sine and cosine functions of different frequencies. This allows the model to learn to attend to tokens at specific positions.

\(\text{PE}(pos, 2i) = \sin \left( \frac{pos}{N^{2i/d}} \right)\)

for even dimension indices

\(\text{PE}(pos, 2i+1) = \cos \left( \frac{pos}{N^{2i/d}} \right)\)

for odd dimension indices

pos = Token Position Index
i = Vector Dimension Index
N = free parameter (10000 default)

So the positional encoding is repeated according to the wavelength?

In large language models (LLMs), the choice of wavelengths for positional encoding is designed to cover a wide range of positional information, from short-term to long-term dependencies. The wavelengths are implicitly defined by the positional encoding formulas using sine and cosine functions with different frequencies.

The sine and cosine functions used in positional encoding will repeat their values periodically due to their nature. The period of repetition, or wavelength, is determined by the functions' parameters. This repetition is intentional and ensures that the positional encoding can accommodate sequences of various lengths. The choice of using different frequencies for sine and cosine functions helps the model capture both short-term and long-term dependencies in the input sequence. By encoding positions in this way, the model can effectively learn the relative positions of tokens and understand the order of the sequence.

The more dimensions the embedding vector has, the longer the sequence before the patterns repeat.

In essence, as the dimensionality of the embedding vector increases, the wavelengths of the sine and cosine functions span a broader range, thereby extending the sequence length before the patterns start to repeat.

Are there other methods for positional encoding?

Learnable Positional Embeddings Instead of using fixed sine and cosine functions, learnable positional embeddings use trainable parameters to represent positional information. Each position in the sequence has its own embedding vector, which is learned during training. This allows the model to potentially capture more complex positional relationships.
Relative Positional Encoding Relative positional encoding represents the position of tokens relative to each other rather than absolute positions. This method allows the model to focus on the relative distance between tokens, which can be useful for handling long sequences where absolute positions may not be as important.
Rotary Positional Embedding (RoPE) Rotary Positional Embedding incorporates rotational matrix transformations to encode position information. It modifies the token embeddings through rotations determined by their positions, preserving the model's ability to generalize across different sequence lengths and enhancing its performance in capturing contextual relationships.
Explicit Position Indicators In some cases, models may use explicit position indicators or flags to mark the position of tokens. This can be done by concatenating position indicators to the token embeddings or using separate position-specific tokens.
Bucketed Positional Embeddings Bucketed positional embeddings group positions into fixed-size buckets and assign the same positional embedding to all positions within a bucket. This approach is useful for handling very long sequences where precise positional information may be less critical.
Gaussian Positional Embedding Gaussian positional embedding uses Gaussian distributions centered around each token to represent positional information. The embedding for a position is determined by a Gaussian function, providing a smooth and continuous way to encode positions.

Each of these methods has its own advantages and trade-offs, and the choice of positional encoding can depend on the specific task and model architecture.

How was it built

This software was created using Vibe Coding by a Large Language Model LLM / chatbot and reworked in look & feel.

Some features had to be implemented manually and corrections and improvements had to be made.

The following Vibe Coding prompts were used on Copilot:

"create a single page html website using javascript and chart.js. Create positional encoding for D dimension embeddings and indices from 0 to 100. Draw the embeddings on the chart. Use a graph for each of the D dimensions. Let the user set the D value via an input control."

"add a input field for the number of indices. Add a canvas that displays a color map. Use the indices horizontally and the position vertically. Color the dots on the canvas according to the embedding value."

"the number of indices or dimensions field does not update the actual dimensions of the vectors. The canvas is to small, make it 100% wide and 300 pt high."