Embeddings and Vector Stores
What Are Embeddings
Concept
An embedding is a dense, fixed-length numerical vector that represents the semantic meaning of text (or any other modality). The key property: semantically similar inputs map to nearby points in the embedding space, so similarity in vector space approximates similarity in meaning.
How embeddings are created: A neural network (the embedding model) is trained on a large corpus with an objective that forces semantically related text to produce similar output vectors. For sentence transformers, the training objective is typically contrastive learning: similar sentences have high cosine similarity, dissimilar sentences have low cosine similarity.
Dimensionality:
- text-embedding-004 (Google): 768 dimensions
- text-embedding-3-large (OpenAI): 3072 dimensions (also supports Matryoshka compression)
- bge-large-en-v1.5 (open source): 1024 dimensions
- Larger dimensions ≠ always better. Matryoshka Representation Learning (MRL) allows truncating to lower dimensions with minimal quality loss.
Bi-encoder vs Cross-encoder:
| Bi-encoder | Cross-encoder | |
|---|---|---|
| How | Query and document embedded independently | Query+document concatenated, scored together |
| Speed | Fast — embed query once, search all docs | Slow — must run model on each (query, doc) pair |
| Quality | Good | Excellent |
| Use in RAG | First-stage retrieval | Re-ranking top-k results |
| Example | text-embedding-004 |
cross-encoder/ms-marco-MiniLM-L-6-v2 |
Why this architecture matters: You can't afford to run a cross-encoder at query time against 10M documents. Bi-encoder is fast enough for ANN search; cross-encoder is fast enough to re-score only the top-100 retrieved candidates.
Code
from google.cloud import aiplatform
from vertexai.language_models import TextEmbeddingModel
import numpy as np
# Google's text-embedding-004 via Vertex AI
model = TextEmbeddingModel.from_pretrained("text-embedding-004")
def embed_text(texts: list[str]) -> list[list[float]]:
embeddings = model.get_embeddings(texts)
return [e.values for e in embeddings]
# Embed query and document
query_vec = embed_text(["What is the refund policy?"])[0]
doc_vec = embed_text(["Customers may request a full refund within 30 days."])[0]
# Cosine similarity
def cosine_sim(a, b):
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
print(f"Similarity: {cosine_sim(query_vec, doc_vec):.4f}")
How Semantic Search Works
Concept
Semantic search is approximate nearest neighbor (ANN) search in embedding space. The full pipeline:
Query text → embedding model → query vector (768-dim)
↓
ANN index (HNSW / IVF)
↓
top-k most similar vectors → retrieve document chunks
Why ANN and not exact nearest neighbor? Exact k-NN requires computing the distance from the query to every vector in the index — O(n × d) where n is corpus size and d is dimensions. At 10M documents with 768 dimensions, that's 7.68 billion multiplications per query. ANN algorithms trade a small recall loss (typically 1-5%) for 10-1000x speedup.
The Recall@k metric: For a given query, Recall@k = (number of relevant docs in top-k) / (total relevant docs). At Recall@10 = 0.85, 15% of relevant documents are missed by the first-stage ANN retrieval.
Indexing trade-off (build time vs query time vs recall):
- More connections in HNSW (higher M parameter) → better recall, slower build
- More IVF clusters → better recall at scale, slower build
- Adding PQ compression → smaller memory footprint, slightly lower recall
Code
import faiss
import numpy as np
# Build an HNSW index (good default for most production use cases)
d = 768 # embedding dimensions
M = 32 # number of connections per node (higher = better recall, more memory)
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = 200 # search depth during construction (higher = better recall)
# Add document vectors
doc_vectors = np.array(embed_text(documents)).astype('float32')
index.add(doc_vectors)
# Search at query time
index.hnsw.efSearch = 64 # search depth at query time (can be tuned after build)
query_vector = np.array(embed_text(["refund policy"])[0]).astype('float32').reshape(1, -1)
distances, indices = index.search(query_vector, k=4) # returns top 4 nearest neighbors
retrieved_docs = [documents[i] for i in indices[0]]
Similarity Metrics
Concept
Three similarity/distance metrics are commonly used. Knowing which to use when is a genuine interview question.
Cosine Similarity
Measures the angle between vectors, ignoring magnitude. Two vectors pointing in the same direction have cosine similarity = 1, regardless of their length. Best for: text embeddings, because the magnitude of an embedding often reflects the "strength" of features, not relevance.Dot Product (Inner Product)
Measures both angle AND magnitude. If vectors are L2-normalized (unit vectors), dot product equals cosine similarity. Best for: embeddings that have been explicitly trained with dot product (e.g., some bi-encoder training regimes optimize dot product directly).Euclidean Distance (L2)
Measures straight-line distance in the vector space. For embeddings with unit norm, L2 distance and cosine similarity are equivalent (monotonically related). Best for: low-dimensional embeddings, or when magnitude differences are meaningful (e.g., image embeddings).When cosine similarity outperforms dot product for text:
Text embedding models like text-embedding-004 output vectors with varying L2 norms. The norm often correlates with confidence or lexical density, not semantic relevance. Normalizing to unit vectors (which cosine similarity does implicitly) removes this bias. Using raw dot product would unfairly rank verbose, information-dense documents higher regardless of topical relevance.
Practical note: Most vector databases use cosine similarity as default for text. Always check what metric your embedding model was trained with — using a different metric degrades recall.
Code
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
def dot_product(a: list[float], b: list[float]) -> float:
return float(np.dot(np.array(a), np.array(b)))
def euclidean_distance(a: list[float], b: list[float]) -> float:
return float(np.linalg.norm(np.array(a) - np.array(b)))
# When vectors are L2-normalized: cosine ≈ dot product
a_norm = a / np.linalg.norm(a)
b_norm = b / np.linalg.norm(b)
assert abs(cosine_similarity(a_norm, b_norm) - dot_product(a_norm, b_norm)) < 1e-6
ANN Algorithms: HNSW, IVF, and PQ
Concept
HNSW (Hierarchical Navigable Small World) A graph-based ANN algorithm. Builds a multilayer graph where nodes are vectors and edges connect neighbors. Search navigates this graph greedily, starting from an entry point at the top layer and descending to find nearest neighbors.
- Pros: Very high recall at low k, fast query time, works well in-memory
- Cons: Build is slower for large corpora; memory-heavy (each vector has M outgoing edges stored)
- When to use: Default choice for most RAG systems up to ~10M vectors
IVF (Inverted File Index)
Clusters vectors into nlist Voronoi cells (typically using k-means). At query time, searches only nprobe of the closest clusters rather than all clusters.
- Pros: More memory-efficient than HNSW, scales to hundreds of millions of vectors
- Cons: Lower recall unless
nprobeis set high; recall degrades for out-of-distribution queries - When to use: Very large corpora (>10M vectors) where HNSW's memory footprint is prohibitive
PQ (Product Quantization) Compresses each vector by dividing it into M sub-vectors and quantizing each sub-vector to a codebook entry. Reduces memory 4-16x at the cost of 5-10% recall drop.
- Typically combined with IVF:
IndexIVFPQin FAISS = clustering + compression - When to use: When memory is the primary constraint and 5-10% recall loss is acceptable
Vertex AI Vector Search uses HNSW internally and handles replication, scaling, and updates for you.
Code
import faiss
d = 768 # embedding dimension
n_vecs = 1_000_000 # 1M documents
# HNSW — best recall for moderate scale
hnsw_index = faiss.IndexHNSWFlat(d, 32) # M=32
# IVF — for large scale
nlist = 4096 # number of clusters (rule of thumb: sqrt(n_vecs) to 4*sqrt(n_vecs))
quantizer = faiss.IndexFlatL2(d)
ivf_index = faiss.IndexIVFFlat(quantizer, d, nlist)
ivf_index.train(doc_vectors) # must train on a representative sample
ivf_index.nprobe = 64 # search 64 of 4096 clusters (tune for recall vs speed)
# IVF + PQ — maximum compression
M_pq = 8 # number of sub-vectors
nbits = 8 # bits per sub-vector (256 centroids)
ivfpq_index = faiss.IndexIVFPQ(quantizer, d, nlist, M_pq, nbits)
ivfpq_index.train(doc_vectors)
# Memory: d*4 bytes/vector (flat) → M_pq bytes/vector (PQ) → 96x compression for d=768, M_pq=8
Vector Databases
Concept
A vector database stores embedding vectors alongside metadata and provides ANN search, filtering, and CRUD operations at scale. Choosing the right one is a legitimate system design decision.
| Database | Type | Scale | Filter support | Managed | When to use |
|---|---|---|---|---|---|
| Chroma | In-process / local | Small (<1M) | Basic | No | Local dev, prototypes |
| Weaviate | Self-hosted / cloud | Medium-Large | Rich (GraphQL) | Yes (cloud) | Multi-tenancy, complex filters |
| Pinecone | Fully managed | Very large | Metadata filters | Yes | Teams wanting no infra |
| Qdrant | Self-hosted / cloud | Large | Rich (JSON payload) | Yes (cloud) | Full control + payload filtering |
| pgvector | PostgreSQL extension | Medium | Full SQL | No (self-host) | Existing Postgres infra |
| Vertex AI Vector Search | GCP managed | Very large (1B+) | Numeric/string | Yes | GCP-native, Gemini integration |
| AlloyDB (pgvector) | GCP managed Postgres | Medium-Large | Full SQL | Yes | SQL + vector, GCP |
Key capabilities to evaluate: - Metadata pre-filtering — filter by document source, date, category BEFORE ANN search (significant performance win) - Multi-tenancy — namespace/tenant isolation so one tenant's data isn't retrievable by another - Hybrid search — built-in BM25 + vector search (Weaviate, Qdrant) - Streaming updates — can you upsert vectors in real-time or only in batch?
Code (Chroma local + Pinecone production pattern)
# Local development with Chroma
from langchain_community.vectorstores import Chroma
local_vs = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
collection_metadata={"hnsw:space": "cosine"} # explicit metric
)
# Production with Pinecone
from langchain_pinecone import PineconeVectorStore
import pinecone
pc = pinecone.Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index = pc.Index("knowledge-base")
prod_vs = PineconeVectorStore(
index=index,
embedding=embeddings,
namespace="enterprise-docs" # tenant isolation
)
# Metadata pre-filtering
results = prod_vs.similarity_search(
query="refund policy",
k=4,
filter={"department": "legal", "year": {"$gte": 2023}} # pre-filter before ANN
)
Embedding Model Selection
Concept
Picking the wrong embedding model is one of the most common production RAG mistakes. The model determines the ceiling of retrieval quality — no amount of downstream reranking recovers from poor embeddings.
MTEB (Massive Text Embedding Benchmark) is the authoritative benchmark for evaluating embedding models. Key tasks: retrieval, reranking, classification, clustering. Always check MTEB leaderboard scores in your domain.
Selection framework:
| Criterion | Questions to Ask |
|---|---|
| Quality | What's the MTEB Retrieval score? Does it have a domain-specific eval? |
| Dimension | 768 is practical; 1536+ improves quality but increases storage/compute cost |
| Latency | Can you embed queries within your SLA? (typical: <50ms for bi-encoder) |
| Language | Multilingual required? → multilingual-e5-large or text-multilingual-embedding-002 |
| Cost | API calls per query → cached? Off-the-shelf? Self-hosted? |
| Domain | Medical, legal, code — domain-specific models often outperform general-purpose by 5-15% |
Key models (2024-2025):
| Model | Dims | MTEB Avg | Notes |
|---|---|---|---|
text-embedding-004 (Google) |
768 | 66.3 | Solid all-around, GCP native |
text-embedding-3-large (OpenAI) |
3072 | 64.6 | MRL: can truncate to 256 |
bge-large-en-v1.5 (BAAI) |
1024 | 64.2 | Best open-source for English |
e5-mistral-7b-instruct |
4096 | 66.6 | Instruction-tuned, best quality |
multilingual-e5-large |
1024 | 62.2 | Best open multilingual |
text-multilingual-embedding-002 (Google) |
768 | ~62 | GCP multilingual |
Matryoshka Representation Learning (MRL): Models trained with MRL (like OpenAI's text-embedding-3 series) encode information hierarchically — truncating the vector to fewer dimensions preserves most quality. This lets you store compressed vectors at low cost and expand to full dimensions for quality-critical reranking.
Code
# Comparing embedding models for domain-specific retrieval
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from sentence_transformers import SentenceTransformer
import numpy as np
# Google text-embedding-004
google_model = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
# BGE-large (open source, self-hosted)
bge_model = SentenceTransformer("BAAI/bge-large-en-v1.5")
def evaluate_retrieval(model, query: str, docs: list[str], relevant_idx: int) -> float:
"""Returns rank of the relevant document (lower = better)."""
if hasattr(model, 'embed_query'):
q_vec = model.embed_query(query)
d_vecs = model.embed_documents(docs)
else:
q_vec = model.encode([query])[0]
d_vecs = model.encode(docs)
scores = [np.dot(q_vec, d) / (np.linalg.norm(q_vec) * np.linalg.norm(d)) for d in d_vecs]
rank = sorted(range(len(scores)), key=lambda i: -scores[i]).index(relevant_idx) + 1
return rank
# Run on your domain-specific test set before committing to a model
Interview Q&A
Q: Why must you use the same embedding model for indexing and querying? [Easy]
A: Each embedding model defines its own vector space — a specific mapping from text to points in n-dimensional space. The similarity scores (cosine, dot product) are only meaningful within the same coordinate system. If you index with model A and query with model B, the query vector lives in a completely different space, making similarity scores meaningless. The result: the system silently returns random-looking results with no error or warning.
Q: When does cosine similarity outperform dot product for text retrieval? [Medium]
A: When embedding vectors have non-unit norms that don't correlate with relevance. Text embedding models often produce vectors with varying magnitudes — verbose, information-dense chunks get higher magnitude vectors. Dot product unfairly boosts these high-magnitude vectors even when they're less topically relevant. Cosine similarity normalizes away magnitude, so only the angular direction (semantic direction) matters. Exception: if your model was explicitly trained with dot product as the objective (some DPR variants), use dot product.
Q: How does HNSW achieve fast approximate nearest neighbor search? [Hard]
A: HNSW builds a multilayer graph where each node represents a vector. The top layers are sparse (long-range connections for fast navigation), and lower layers are dense (local connections for precision). At query time, search starts at the top layer and greedily navigates toward the query vector, then descends to lower layers for refinement. The key insight is that greedy graph navigation in high-dimensional space finds approximate nearest neighbors much faster than exhaustive search. The efSearch parameter controls how many candidate neighbors to explore at each layer — higher values give better recall at the cost of query latency.
Q: Your RAG system needs to serve 10 tenants with 5M documents each (50M total) with strict data isolation. What vector database architecture would you use? [Hard]
A: Namespace/tenant-isolated indices. Most vector databases (Pinecone, Qdrant, Weaviate) support namespace or collection-level isolation where data and access controls are separated per tenant. Architecture: one namespace per tenant, metadata filtering enforces tenant boundaries at the application layer. Alternative: separate indices per tenant — stronger isolation, but higher operational overhead. For 50M vectors, Pinecone or Vertex AI Vector Search are good managed options; Weaviate or Qdrant if you want self-hosted control. Always benchmark: at this scale, namespace isolation in a single index is often better than 10 separate indices due to index build costs.
Q: What is Matryoshka Representation Learning and why is it useful for production RAG? [Hard]
A: MRL trains embedding models to encode information in a nested way: the first N dimensions of the vector capture the most important semantic information, and adding more dimensions adds progressively finer detail. This means you can truncate an MRL-trained vector (like OpenAI's text-embedding-3 series) to 256 or 512 dimensions with minimal quality loss instead of storing the full 3072 dimensions. In production, this enables a two-stage retrieval: first-stage retrieval with compressed 256-dim vectors (fast, cheap), followed by re-scoring with the full 3072-dim vectors for the top-k candidates (precision). This cuts storage and query costs significantly.
Q: What is the difference between a bi-encoder and a cross-encoder, and when would you use each? [Medium]
A: A bi-encoder embeds the query and document independently, producing two separate vectors that are compared by cosine similarity. This is fast because documents can be pre-embedded offline — at query time, you only embed the query. A cross-encoder takes a (query, document) pair as a single input and produces a scalar relevance score; it sees the interaction between query and document terms, giving much higher precision. Cross-encoders are 10-100x slower because they must process each (query, document) pair at query time. The standard pattern: bi-encoder for first-stage ANN retrieval (fast, moderate precision), cross-encoder for reranking top-k candidates (slow, high precision).
Q: How would you reduce embedding costs in a RAG system with 10 million documents and 100K daily queries? [Medium]
A: Three levers: (1) Batch indexing — embed documents in batches of 100-2048 (most API providers give bulk discounts and lower per-call overhead), not one by one. (2) Query embedding cache — hash the query string and cache the embedding; many users ask near-identical questions. A TTL of 24 hours with an LRU cache can hit 30-60% cache rates for customer support use cases. (3) Smaller embedding model — if quality is acceptable, a 256-dim MRL-truncated model costs 6x less to store and compute than the full 1536-dim model. (4) Embedding model self-hosting — open-source models (BGE, E5) hosted on Cloud Run or GKE eliminate per-token API costs at the cost of infrastructure management.