LLMs Part 1: Understand the meaning of words through AI
This software was largely created by AI Vibe Coding
Created by YouMinds
This AI embedding algorithm
visualizes the relationships and meanings of words in a 2D vector space,
where every word tells a story through its relative position.
This deep comprehension of word meanings forms the basic building block for large language models (LLMs),
enabling them to generate coherent, contextually relevant text.
The meaning of words
The following text contains words whose relative meanings to each other should be determined and represented
as machine-readable numerical values.
Press the Tokenize Text button below to split the given text
into tokens and create a vocabulary.
Optionally enter your own text as input.
That's the vocabulary.
Tokens are the fundamental units of text, and they can be words or smaller parts of words.
Notice how each token is given a sequential number
Now the text is expressed as a sequence of token indices.
Press the Train Embeddings button below to start the learning
process for the given token sequence using a skip-gram model.
Skip-gram predicts the context words (surrounding words) given a target word or token.
The loss value characterizes the quality of the trainiing process.
Training loss value:
The position of the individual words on the map indicates their meaning.
These positions can be viewed as coordinates or vectors whose direction and lengths encode the
meaning of the words to each other.
The skip-gram model uses these vectors as input to predict the probability of the context words
(surrounding words).
Notice how words with similar meanings form clusters on the map.
The final embedding
These vectors can now be represented in a table as groups of numbers.
Now we have an encoded version for the meaning of each word.
Follow the instructions above to create the embeddings.
Wait for the process to complete.
Click on a word
to view the probability distribution of context words for the specified target word.
What is word embedding or encoding anyway
Children learn by observing relationships between objects and their context without needing to know
their exact meanings initially. They understand how objects interact and relate to each other, much
like
word embeddings capture the semantic relationships between words based solely on context and co-occurrence
patterns.
This relational learning allows for a deep and flexible understanding even without explicit
definitions.
Word embeddings are a type of word representation that allows words to be represented as vectors in a
continuous vector space. This representation captures semantic relationships between words, such that words
with similar meanings are located closer together in the vector space. Word embeddings are generated using
techniques such as Word2Vec, GloVe, or more advanced methods like BERT.
Here is how it works in general;
Training Data: Word embeddings are trained on a large corpus of text. This corpus could be
anything from books, articles, or any text where words occur in meaningful contexts.
Context Window: A context window is defined, usually a fixed number of words before and after
the target word. For example, in the sentence "The cat sat on the mat," with a context window of 2,
the context for the word "sat" would be ["The", "cat", "on", "the"].
Co-occurrence: The model records how often words appear together within these context
windows. Words that frequently appear together in similar contexts are assumed to have similar
meanings.
Vector Representation: Each word in the vocabulary is represented as a vector in a
high-dimensional space. Initially, these vectors are randomly assigned.
Optimization: The model uses algorithms like Word2Vec, GloVe, or others to adjust the word
vectors. This is done by maximizing the probability of the target word given its context words
(skip-gram model) or vice versa (CBOW - Continuous Bag of Words model). During this process, vectors
of words that appear in similar contexts are moved closer together.
Why do the embeddings change with each training process?
Word2Vec, like many machine learning models, is initialized with random weights. When you restart the
training process, these weights are re-randomized, leading to different embeddings each time. Additionally,
factors like stochastic gradient descent and the shuffling of the training data can introduce variability.
Why it matters for LLMs
Word embeddings enable large language models (LLMs) to understand the context and meaning of words in a way
that is similar to human understanding, enhancing their ability to perform various natural language
processing tasks such as translation, sentiment analysis, and text generation effectively.
Word embeddings are a foundational concept for LLMs, bridging the gap between raw text data and the model's
ability to interpret and generate human-like language.
However, in a large language model, much higher dimensional embeddings are used. These embeddings can be
represented in vector spaces with thousands of dimensions. Thus, embedding tables can have thousands of
values for each word. In this example, there were only 2 dimensions, which can be easily visualized on
a map.
What do the vector dimensions say?
Different dimensions of the embedding vector may capture different attributes of words.
For example:
Gender: Dimensions can capture gender differences in words (e.g., "king" vs. "queen").
Size: Dimensions can capture notions of size (e.g., "small" vs. "large").
Color: Dimensions can capture color attributes (e.g., "red" vs. "blue").
Nationality: Dimensions can encode cultural or national attributes (e.g., "Italian" vs. "French").
If we were to visualize embedding vectors using techniques like t-SNE or PCA (dimensionality reduction
methods), we could see clusters of semantically related words. For instance:
Words related to "Italy" would form a cluster.
Words related to "colors" (e.g., "red", "blue", "green") would form another cluster.
Here’s a simplified conceptual representation:
4 Dimensions >
1. Gender
2. Size
3. Color
4. "Italieness"
King
1.0
0.5
0.2
0.1
Queen
-1.0
0.5< /td>
0.2
0.1
Red
0.0
0.2
1.0
0.0
Italy
0.0
0.3
0.1
1.0
Pasta
0.0
0.2
0.1
0.9
However, it is important to understand that the encoded attributes do not necessarily have to have a
connection to real-world attributes.
Here are a few key points to consider:
Abstract Representations: Some dimensions may encode abstract features that are useful for
the model’s tasks but do not have a clear real-world interpretation. These could be patterns or
dependencies that the model has learned from the data.
Contextual Nuances: Embeddings can capture contextual nuances that are specific to the
dataset or the linguistic patterns in the data. For example, certain dimensions might encode
specific syntactic or grammatical rules.
Task-Specific Features: During fine-tuning on specific tasks, the model might learn
dimensions that are particularly relevant to those tasks, even if they are not easily interpretable
in a real-world context.
High-Dimensional Space: In a high-dimensional embedding space, different dimensions can
interact in complex ways. The resulting embeddings might capture a blend of meaningful attributes
and more abstract, task-specific features.
The famous vector operation King – Man + Woman = Queen often works as a compelling demonstration of
how word embeddings can capture relationships and analogies.
But it may not always be accurate.
In other words, the model finds the attributes (dimensions) based on optimization, not on real meanings.
This is because the model doesn't truly understand the meanings; it only recognizes the relationships
between the words. As a result, the attributes it discovers can, but do not necessarily have to, hold
meaning for people.
How was it built
This software was created using Vibe Coding by a Large Language Model LLM / chatbot
and reworked in look & feel.
Some features had to be implemented manually and
corrections and improvements had to be made.
The following Vibe Coding prompts were used on Copilot:
"create a single html page with tensorflow that takes a piece of text and creates 2d embeddings.
display
the embeddings on a canvas."