Vibe Coding Home

LLMs Part 1: Understand the meaning of words through AI

This software was largely created by AI Vibe Coding
Created by YouMinds
This AI embedding algorithm visualizes the relationships and meanings of words in a 2D vector space, where every word tells a story through its relative position.
This deep comprehension of word meanings forms the basic building block for large language models (LLMs), enabling them to generate coherent, contextually relevant text.
The meaning of words
The following text contains words whose relative meanings to each other should be determined and represented as machine-readable numerical values.
Press the Tokenize Text button below to split the given text into tokens and create a vocabulary. Optionally enter your own text as input.
Training loss value:
The position of the individual words on the map indicates their meaning. These positions can be viewed as coordinates or vectors whose direction and lengths encode the meaning of the words to each other. The skip-gram model uses these vectors as input to predict the probability of the context words (surrounding words). Notice how words with similar meanings form clusters on the map.
The final embedding
These vectors can now be represented in a table as groups of numbers. Now we have an encoded version for the meaning of each word.
Follow the instructions above to create the embeddings.
What is word embedding or encoding anyway
Children learn by observing relationships between objects and their context without needing to know their exact meanings initially. They understand how objects interact and relate to each other, much like word embeddings capture the semantic relationships between words based solely on context and co-occurrence patterns.
This relational learning allows for a deep and flexible understanding even without explicit definitions.
Word embeddings are a type of word representation that allows words to be represented as vectors in a continuous vector space. This representation captures semantic relationships between words, such that words with similar meanings are located closer together in the vector space. Word embeddings are generated using techniques such as Word2Vec, GloVe, or more advanced methods like BERT.
Here is how it works in general;
  1. Training Data: Word embeddings are trained on a large corpus of text. This corpus could be anything from books, articles, or any text where words occur in meaningful contexts.
  2. Context Window: A context window is defined, usually a fixed number of words before and after the target word. For example, in the sentence "The cat sat on the mat," with a context window of 2, the context for the word "sat" would be ["The", "cat", "on", "the"].
  3. Co-occurrence: The model records how often words appear together within these context windows. Words that frequently appear together in similar contexts are assumed to have similar meanings.
  4. Vector Representation: Each word in the vocabulary is represented as a vector in a high-dimensional space. Initially, these vectors are randomly assigned.
  5. Optimization: The model uses algorithms like Word2Vec, GloVe, or others to adjust the word vectors. This is done by maximizing the probability of the target word given its context words (skip-gram model) or vice versa (CBOW - Continuous Bag of Words model). During this process, vectors of words that appear in similar contexts are moved closer together.
Why do the embeddings change with each training process?
Word2Vec, like many machine learning models, is initialized with random weights. When you restart the training process, these weights are re-randomized, leading to different embeddings each time. Additionally, factors like stochastic gradient descent and the shuffling of the training data can introduce variability.
Why it matters for LLMs
Word embeddings enable large language models (LLMs) to understand the context and meaning of words in a way that is similar to human understanding, enhancing their ability to perform various natural language processing tasks such as translation, sentiment analysis, and text generation effectively. Word embeddings are a foundational concept for LLMs, bridging the gap between raw text data and the model's ability to interpret and generate human-like language.
However, in a large language model, much higher dimensional embeddings are used. These embeddings can be represented in vector spaces with thousands of dimensions. Thus, embedding tables can have thousands of values for each word. In this example, there were only 2 dimensions, which can be easily visualized on a map.
What do the vector dimensions say?
Different dimensions of the embedding vector may capture different attributes of words. For example:
If we were to visualize embedding vectors using techniques like t-SNE or PCA (dimensionality reduction methods), we could see clusters of semantically related words. For instance:
Here’s a simplified conceptual representation:

4 Dimensions > 1. Gender 2. Size 3. Color 4. "Italieness"
King 1.0 0.5 0.2 0.1
Queen -1.0 0.5 0.2 0.1
Red 0.0 0.2 1.0 0.0
Italy 0.0 0.3 0.1 1.0
Pasta 0.0 0.2 0.1 0.9
However, it is important to understand that the encoded attributes do not necessarily have to have a connection to real-world attributes.
Here are a few key points to consider:
The famous vector operation King – Man + Woman = Queen often works as a compelling demonstration of how word embeddings can capture relationships and analogies. But it may not always be accurate.
In other words, the model finds the attributes (dimensions) based on optimization, not on real meanings. This is because the model doesn't truly understand the meanings; it only recognizes the relationships between the words. As a result, the attributes it discovers can, but do not necessarily have to, hold meaning for people.
How was it built
This software was created using Vibe Coding by a Large Language Model LLM / chatbot and reworked in look & feel.
Some features had to be implemented manually and corrections and improvements had to be made.
The following Vibe Coding prompts were used on Copilot:
"create a single html page with tensorflow that takes a piece of text and creates 2d embeddings. display the embeddings on a canvas."
"does not work also include a progress bar"
"the process gets stuck"