Consider the sentence: "The bear likes sun"
Example for two-dimensional embeddings
as we have seen in LLMs Part 1: Understand the meaning of words through AI.
original embeddings |
Pos. |
word |
X |
Y |
0 |
the |
-0.30 |
0.68 |
1 |
bear |
-0.23 |
-0.45 |
2 |
likes |
-0.34 |
-0.51 |
3 |
sun |
-0.46 |
0.47 |
+
positional embeddings |
Pos. |
X |
Y |
0 |
0.00 |
1.00 |
1 |
0.84 |
0.54 |
2 |
0.91 |
-0.42 |
3 |
0.14 |
-0.99 |
=
resulting embeddings |
Pos. |
X |
Y |
0 |
-0.30 |
1.68 |
1 |
0.61 |
0.09 |
2 |
0.57 |
-0.93 |
3 |
-0.32 |
-0.52 |
But what happens if we change the order of the words in the sentence and
create: "The sun likes bear" instead.
Through positional encoding the new meaning of the converted sentence becomes visible in the encodings.
original embeddings |
Pos. |
word |
X |
Y |
0 |
the |
-0.30 |
0.68 |
1 |
sun |
-0.46 |
0.47 |
2 |
likes |
-0.34 |
-0.51 |
3 |
bear |
-0.23 |
-0.45 |
+
positional embeddings |
Pos. |
X |
Y |
0 |
0.00 |
1.00 |
1 |
0.84 |
0.54 |
2 |
0.91 |
-0.42 |
3 |
0.14 |
-0.99 |
=
resulting embeddings |
Pos. |
X |
Y |
0 |
-0.30 |
1.68 |
1 |
0.38 |
1.01 |
2 |
0.57 |
-0.93 |
3 |
-0.09 |
-1.44 |
You can think of positional encoding as a way to "twist" or "turn" the original embedding vector by
adding
sine and cosine values, which are functions of the position in the sequence. This gives each
position a
unique representation and allows the model to understand the order and relative positions of tokens
in the
sequence.
Notice how the values from the sine and cosine curves below are used here.
This is how positional encodings are calculated
Change the parameters
and press recalculate to display the positional
encodings according to the parameters.
Notice
that for high-dimensional embedding vectors, the wavelength increases, resulting in unique
embeddings for each position.
Resulting sine and cosine curves.
Encodings on a gray scale map
The following graphic illustrates the encoding values on a gray scale map. The dimensions of the embedding
vectors are shown vertically. The horizontal axis shows the token position.
The color reflects the value added to the embedding.
Vector Dimension
|
|
|
0Token Position Index
|
What is positional encoding anyway
Positional encoding is a technique used in the architecture of large language models (LLMs)
to provide the model with information about the position of each token in an input sequence.
In the standard sinusoidal positional encodings, each position in the input sequence is mapped to a vector
of a fixed dimension, where each dimension in the vector is a function of both the position and the
frequency, using sine and cosine functions of different frequencies. This allows the model to learn to
attend to tokens at specific positions.
\(\text{PE}(pos, 2i) = \sin \left( \frac{pos}{N^{2i/d}} \right)\)
for even dimension indices
\(\text{PE}(pos, 2i+1) = \cos \left( \frac{pos}{N^{2i/d}} \right)\)
for odd dimension indices
pos = Token Position Index
i = Vector Dimension Index
N = free parameter (10000 default)
So the positional encoding is repeated according to the wavelength?
In large language models (LLMs), the choice of wavelengths for positional encoding is designed to cover a
wide range of positional information, from short-term to long-term dependencies. The wavelengths are
implicitly defined by the positional encoding formulas using sine and cosine functions with different
frequencies.
The sine and cosine functions used in positional encoding will repeat their values periodically due to their
nature. The period of repetition, or wavelength, is determined by the functions' parameters. This repetition
is intentional and ensures that the positional encoding can accommodate sequences of various lengths.
The choice of using different frequencies for sine and cosine functions helps the model capture both
short-term and long-term dependencies in the input sequence. By encoding positions in this way, the model
can effectively learn the relative positions of tokens and understand the order of the sequence.
The more dimensions the embedding vector has, the longer the sequence before the patterns repeat.
In essence, as the dimensionality of the embedding vector increases, the wavelengths of the sine and cosine
functions span a broader range, thereby extending the sequence length before the patterns start to repeat.
Are there other methods for positional encoding?
-
Learnable Positional Embeddings
Instead of using fixed sine and cosine functions, learnable positional embeddings use trainable
parameters to represent positional information. Each position in the sequence has its own embedding
vector, which is learned during training. This allows the model to potentially capture more complex
positional relationships.
-
Relative Positional Encoding
Relative positional encoding represents the position of tokens relative to each other rather than
absolute positions. This method allows the model to focus on the relative distance between tokens,
which can be useful for handling long sequences where absolute positions may not be as important.
-
Rotary Positional Embedding (RoPE)
Rotary Positional Embedding incorporates rotational matrix transformations to encode position
information. It modifies the token embeddings through rotations determined by their positions,
preserving the model's ability to generalize across different sequence lengths and enhancing its
performance in capturing contextual relationships.
-
Explicit Position Indicators
In some cases, models may use explicit position indicators or flags to mark the position of tokens.
This can be done by concatenating position indicators to the token embeddings or using separate
position-specific tokens.
-
Bucketed Positional Embeddings
Bucketed positional embeddings group positions into fixed-size buckets and assign the same
positional embedding to all positions within a bucket. This approach is useful for handling very
long sequences where precise positional information may be less critical.
-
Gaussian Positional Embedding
Gaussian positional embedding uses Gaussian distributions centered around each token to represent
positional information. The embedding for a position is determined by a Gaussian function, providing
a smooth and continuous way to encode positions.
Each of these methods has its own advantages and trade-offs, and the choice of positional encoding can
depend on the specific task and model architecture.
How was it built
This software was created using Vibe Coding by a Large Language Model LLM / chatbot
and reworked in look & feel.
Some features had to be implemented manually and
corrections and improvements had to be made.
The following Vibe Coding prompts were used on Copilot:
"create a single page html website using javascript and chart.js. Create positional encoding for D
dimension embeddings and indices from 0 to 100. Draw the embeddings on the chart. Use a graph for each
of the D dimensions. Let the user set the D value via an input control."
"add a input field for the number of indices. Add a canvas that displays a color map. Use the indices
horizontally and the position vertically. Color the dots on the canvas according to the embedding
value."
"the number of indices or dimensions field does not update the actual dimensions of the vectors. The
canvas is to small, make it 100% wide and 300 pt high."