Tokens, Embeddings, and Vectors: Key Concepts Every AI Developer Should Know

When you start working seriously with LLMs, three concepts appear constantly in documentation, papers, and technical conversations: tokens, embeddings, and vectors. They're frequently used without explanation, assuming the reader already knows them.

This guide explains all three from scratch with concrete examples and functional code. Not abstract theory — this is what you need to understand to make correct technical decisions when building with AI.

Tokens: How Models Read Text

LLMs don't process text character by character or word by word. They process tokens — fragments of text that can be complete words, parts of words, punctuation marks, or spaces.

Tokenization is the process of converting text into tokens before the model processes it. Each model uses its own tokenizer, though many share the same base system.

How Tokenization Works

The most common tokenization in modern LLMs uses an algorithm called Byte Pair Encoding (BPE). The basic idea: frequent words become a single token, rare words get split into several more common tokens.

Concrete examples with the GPT-4o tokenizer:

  • "hello" → 1 token
  • "programming" → 1 token
  • "embeddings" → 2 tokens (embed + dings)
  • "tokenization" → 3 tokens (token + iz + ation)
  • " " (space) → often part of the following token

You can experiment with any text in the official OpenAI tokenizer to see exactly how it gets split.

Why Tokenization Matters

Cost: commercial models charge per token, not per word or character. Understanding tokenization helps estimate costs accurately.

Context limits: the context window is measured in tokens. A 10,000-word document in English can be 13,000-15,000 tokens.

Model behavior: models reason over tokens, not words. This explains some strange behaviors — for example, models counting letters in a word incorrectly, because each letter may not be an individual token.

Counting Tokens with Code

import tiktoken

# tiktoken is OpenAI's tokenization library
# pip install tiktoken

encoder = tiktoken.encoding_for_model("gpt-4o")

text = "Embeddings are vector representations of the meaning of text."
tokens = encoder.encode(text)

print(f"Text: {text}")
print(f"Number of tokens: {len(tokens)}")
print(f"Token IDs: {tokens}")

# Decode individual tokens to see them
for token_id in tokens:
    print(f"  {token_id} → '{encoder.decode([token_id])}'")

The tiktoken library is open source and the same one OpenAI uses internally.

Embeddings: Representing Meaning as Numbers

An embedding is a numerical representation of the meaning of a piece of text. Instead of working with words, the model works with vectors — lists of numbers — where texts with similar meanings produce similar vectors.

The Intuition Behind Embeddings

Imagine you had to represent the meaning of words in a two-dimensional space. You might place "dog" and "cat" close together (both are pets), far from "automobile", and "puppy" very close to "dog" (related concept).

Embeddings do exactly this, but in spaces of hundreds or thousands of dimensions. Each dimension captures some aspect of meaning — though individual dimensions don't have a direct human interpretation.

What does have a direct interpretation is the distance between vectors: two texts with similar meaning have nearby vectors. Two texts with different meaning have distant vectors.

Generating Embeddings with OpenAI

import os
from openai import OpenAI
import numpy as np

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def generate_embedding(text: str) -> list[float]:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Generate embeddings for several sentences
sentences = [
    "The dog plays in the garden.",
    "The puppy runs through the park.",
    "Artificial intelligence is transforming industry.",
    "Language models process text sequences."
]

embeddings = [generate_embedding(s) for s in sentences]

print(f"Dimensions of each embedding: {len(embeddings[0])}")
# text-embedding-3-small produces vectors of 1536 dimensions

The text-embedding-3-small model produces vectors of 1536 dimensions. The text-embedding-3-large model produces 3072 dimensions with higher semantic precision. Pricing and specifications are in the OpenAI embeddings documentation.

Measuring Similarity Between Embeddings

The most widely used metric for comparing embeddings is cosine similarity — it measures the angle between two vectors. A value of 1 means identical, 0 means unrelated, -1 means opposite.

def cosine_similarity(vec1: list, vec2: list) -> float:
    a = np.array(vec1)
    b = np.array(vec2)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Compare sentences against each other
for i, sentence_a in enumerate(sentences):
    for j, sentence_b in enumerate(sentences):
        if i < j:
            sim = cosine_similarity(embeddings[i], embeddings[j])
            print(f"Similarity between sentence {i+1} and sentence {j+1}: {sim:.3f}")
            print(f"  '{sentence_a[:50]}'")
            print(f"  '{sentence_b[:50]}'")

Expected result: sentences 1 and 2 (about the dog/puppy) will have high similarity (~0.85-0.92). Sentences 3 and 4 (about AI) will also be similar. Between groups, similarity will be low (~0.2-0.4).

Open Source Embedding Models

If you don't want to depend on the OpenAI API for embeddings, there are high-quality open source models:

You can use them with the sentence-transformers library:

# pip install sentence-transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

sentences = ["The dog plays in the garden.", "The puppy runs through the park."]
embeddings = model.encode(sentences)

sim = cosine_similarity(embeddings[0], embeddings[1])
print(f"Similarity: {sim:.3f}")

Vectors and Vector Databases

Once you have embeddings, you need to store them and search through them efficiently. Traditional relational databases aren't designed for this — finding the most similar vector among millions requires operations that SQL doesn't handle well.

Vector databases are optimized specifically for similarity search at scale.

How Vector Search Works

The basic process:

  1. You index your documents: generate an embedding for each one and store it in the vector database
  2. When a query arrives: generate the embedding of the query
  3. Search for the N most similar vectors to the query (nearest neighbor search)
  4. Return the corresponding documents

Exact nearest neighbor search across millions of vectors would be computationally prohibitive. Vector databases use approximate nearest neighbor (ANN) algorithms that trade a small amount of precision for speed. The most widely used algorithm is HNSW (Hierarchical Navigable Small World).

Example with Chroma

Chroma is the simplest open source vector database to get started with:

# pip install chromadb
import chromadb
from openai import OpenAI
import os

openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
chroma_client = chromadb.Client()

collection = chroma_client.create_collection(
    name="documents",
    metadata={"hnsw:space": "cosine"}
)

# Documents to index
documents = [
    {"id": "doc1", "text": "Python is an interpreted, high-level programming language."},
    {"id": "doc2", "text": "JavaScript is the primary language for frontend web development."},
    {"id": "doc3", "text": "Transformers are the base architecture of modern LLMs."},
    {"id": "doc4", "text": "Supervised learning requires labeled data to train models."},
]

# Generate embeddings and index
for doc in documents:
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=doc["text"]
    )
    embedding = response.data[0].embedding

    collection.add(
        ids=[doc["id"]],
        embeddings=[embedding],
        documents=[doc["text"]]
    )

# Semantic search
query = "What language is used to build websites?"
query_response = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=query
)
query_embedding = query_response.data[0].embedding

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=2
)

print(f"Query: {query}")
print("Most relevant results:")
for doc in results["documents"][0]:
    print(f"  - {doc}")

The search finds "JavaScript is the primary language for frontend web development" even though the query uses different words — because embeddings capture meaning, not exact words.

Vector Databases for Production

For production at larger scale, the main options:

  • Pinecone: managed service, easy to scale, free tier available
  • Qdrant: open source with cloud option, excellent performance and advanced filtering
  • Weaviate: open source with cloud option, good LangChain integration
  • pgvector: PostgreSQL extension for vectors, ideal if you already use Postgres and volume isn't massive
  • Milvus: open source designed for very high scale, more complex to operate

How These Concepts Connect

The three concepts form a chain:

Tokens → the model reads text as tokens Embeddings → converts those tokens into vectors that represent meaning Vectors → stored in vector databases for efficient semantic search

In a complete RAG system:

Document → Tokenization → Embedding → Vector store
Query → Tokenization → Embedding → Similarity search → Relevant documents → LLM → Response

Each step transforms text into a more useful representation for the next:

# The complete flow in code
def complete_rag_pipeline(query: str, collection, openai_client, llm_client) -> str:

    # 1. Convert query to embedding
    emb_response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    )
    query_embedding = emb_response.data[0].embedding

    # 2. Search for similar documents in the vector store
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=3
    )
    context = "
".join(results["documents"][0])

    # 3. Generate response with the LLM using retrieved context
    llm_response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "Answer using only the information in the context. If it's not there, say so."
            },
            {
                "role": "user",
                "content": f"Context:
{context}

Question: {query}"
            }
        ]
    )

    return llm_response.choices[0].message.content

Resources for Going Deeper

If you want to understand the math behind embeddings, the original Word2Vec paper — the model that popularized word embeddings — is at arxiv.org/abs/1301.3781. The transformer paper, which is the foundation of modern LLMs, is Attention Is All You Need.

For multilingual embeddings and comparative benchmarks, the MTEB Leaderboard on Hugging Face is the most up-to-date reference.