Documentation Index
Fetch the complete documentation index at: https://langchain-5e9cc07a-preview-opensw-1778693861-0079777.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Overview
This overview covers text-based embedding models. LangChain does not currently support multimodal embeddings.See top embedding models.
How it works
- Vectorization — The model encodes each input string as a high-dimensional vector.
- Similarity scoring — Vectors are compared using mathematical metrics to measure how closely related the underlying texts are.
Similarity metrics
Several metrics are commonly used to compare embeddings:- Cosine similarity — measures the angle between two vectors.
- Euclidean distance — measures the straight-line distance between points.
- Dot product — measures how much one vector projects onto another.
Interface
LangChain provides a standard interface for text embedding models (e.g., OpenAI, Cohere, Hugging Face) via the Embeddings interface. Two main methods are available:embed_documents(texts: List[str]) → List[List[float]]: Embeds a list of documents.embed_query(text: str) → List[float]: Embeds a single query.
The interface allows queries and documents to be embedded with different strategies, though most providers handle them the same way in practice.
Top integrations
Common deployment patterns
In practice, most teams converge on one of four patterns:- Hosted, flagship: OpenAI
text-embedding-3-large, Cohereembed-english-v3, Googlegemini-embedding-001, Voyagevoyage-3. One API call, best-in-class quality out of the box, no local infrastructure. Per-call cost and a data-egress dependency. - Local, open-source:
BAAI/bge-*,mixedbread-ai/mxbai-embed-*,Qwen/Qwen3-Embedding-*,nomic-ai/modernbert-embed-*,sentence-transformers/all-*. Download once, run anywhere. No per-call cost, data never leaves your environment. Likely slower on CPU than a hosted API at small scale; competitive or faster with a GPU. - Local, open-source, specialist: a fine-tuned model targeting your specific domain, language, or task. Starting from a strong open base (e.g.
BAAI/bge-m3) and fine-tuning on even a few thousand in-domain query/document pairs often beats hosted flagships on retrieval accuracy for that domain. - Self-hosted at production scale: the same open models (base or fine-tuned) served via Text Embeddings Inference (TEI) or Ollama. Gives you the economics of local inference with the horizontal scaling and API ergonomics of a hosted provider.
Embeddings subclass and hand it to your vector store or retriever. Patterns (2) and (3) use HuggingFaceEmbeddings; pattern (4) uses HuggingFaceEndpointEmbeddings or OllamaEmbeddings.
Factors to weigh
Quality
Start from the MTEB leaderboard. MTEB benchmarks embedding models across retrieval, clustering, classification, and reranking tasks, and is the de-facto industry reference. Filter by your language(s) and by task (retrieval is the most common for RAG). Leaderboard numbers don’t always transfer, so run a small evaluation on your own data before committing. LangSmith has tooling for this; see the evaluation guides.Cost
Hosted embeddings typically price in the range of a few cents to ~$0.15 per million tokens. For a corpus embedded once and queried thousands of times a day, cost is often dominated by the query side. Local inference has zero per-call cost but requires CPU (slow) or GPU (capital or cloud cost). The crossover is workload-dependent: low-volume personal projects are essentially free on CPU; for mid-volume production, a single GPU serving a local model via TEI often beats hosted on unit economics.Latency
Hosted embedding APIs add roughly 50-200ms of network latency per request. Local models on CPU take 10-100ms for a short query with a small model (all-MiniLM-L6-v2-class), and 50-500ms for larger models. On GPU, local inference is typically faster than a round-trip to a hosted API.
For batch indexing, latency per request matters less than throughput. TEI and multi-process local inference batch aggressively. Consider e.g. encode_kwargs={"batch_size": 64} or higher on HuggingFaceEmbeddings when running on GPU.
Dimensionality
Embedding dimension affects vector store storage and query compute. Typical sizes:- 384 (small Sentence Transformers models,
all-MiniLM-L6-v2) - 768 (mid-size ST models,
all-mpnet-base-v2,bge-base) - 1024 (
bge-large, Cohere v3, Voyage) - 1536 (OpenAI
text-embedding-3-small, Qwen3-Embedding-0.6B) - 3072+ (OpenAI
text-embedding-3-large, Qwen3-Embedding-4B/8B)
text-embedding-3-*, mixedbread-ai/mxbai-embed-large-v1, Matryoshka-trained ST models, Qwen3-Embedding) support truncation: slice the vector to a smaller dimension with graceful quality degradation. Useful for fitting more vectors into a smaller index.
Context length
Most classic embedding models cap out at 512 tokens (all-mpnet-base-v2, classic BGE). Newer models support longer contexts:
nomic-ai/modernbert-embed-base: 8192 tokensAlibaba-NLP/gte-multilingual-base: 8192 tokensBAAI/bge-m3: 8192 tokens- OpenAI
text-embedding-3-*: 8191 tokens
Multilingual support
For multilingual retrieval, pick a model trained on your languages. Strong defaults:- Open:
BAAI/bge-m3,intfloat/multilingual-e5-*,Alibaba-NLP/gte-multilingual-*,Qwen/Qwen3-Embedding-*(viaHuggingFaceEmbeddings) - Hosted: Cohere
embed-multilingual-v3, OpenAItext-embedding-3-*
Query and document prompts
Several modern open models (E5, BGE, Qwen3-Embedding, GTE) are trained with different text prefixes for queries versus documents. Using the wrong prefix at query time is a common quality regression. When usingHuggingFaceEmbeddings, pass prompts explicitly:
Licensing
Most popular open embedding models are permissively licensed (Apache 2.0, MIT). A few recent specialist models require a commercial license for production use. Check each model’s license before shipping.Beyond single-vector dense embeddings
A single dense vector per chunk is the default, but not the only option.Sparse and hybrid retrieval
Dense embeddings don’t handle exact-match queries (product codes, named entities, code identifiers) as well as keyword-based indexes. Hybrid retrieval combines a dense index with BM25 or a sparse neural index (SPLADE,BAAI/bge-m3’s sparse output) to cover both cases.
Late-interaction and multi-vector
ColBERT-style models produce a vector per token rather than per chunk, then score queries against documents via late interaction. This is typically more accurate than single-vector dense retrieval on complex queries, at the cost of higher storage and more complex indexing. Current open models in this space includejinaai/jina-colbert-v2, answerdotai/answerai-colbert-small-v1, and newer late-interaction variants such as lightonai/LateOn. LangChain’s built-in retrievers target single-vector embeddings; late interaction typically requires a specialist index (Vespa, Qdrant’s multi-vector support, or PyLate).
Starting points
If you just want a working starting point:- Quick prototype, hosted:
OpenAIEmbeddings(model="text-embedding-3-small") - Quick prototype, local, no API key:
HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2", encode_kwargs={"normalize_embeddings": True}) - Production, hosted, quality-first:
VoyageAIEmbeddings(model="voyage-3")orOpenAIEmbeddings(model="text-embedding-3-large") - Production, open, quality-first:
HuggingFaceEmbeddings(model_name="BAAI/bge-m3", encode_kwargs={"normalize_embeddings": True})served via TEI - Multilingual, open:
HuggingFaceEmbeddings(model_name="intfloat/multilingual-e5-large")with query and document prompts configured
Caching
Embeddings can be stored or temporarily cached to avoid needing to recompute them. Caching embeddings can be done using aCacheBackedEmbeddings. This wrapper stores embeddings in a key-value store, where the text is hashed and the hash is used as the key in the cache.
The main supported way to initialize a CacheBackedEmbeddings is from_bytes_store. It takes the following parameters:
underlying_embedder: The embedder to use for embedding.document_embedding_cache: AnyByteStorefor caching document embeddings.batch_size: (optional, defaults toNone) The number of documents to embed between store updates.namespace: (optional, defaults to"") The namespace to use for the document cache. Helps avoid collisions (e.g., set it to the embedding model name).query_embedding_cache: (optional, defaults toNone) AByteStorefor caching query embeddings, orTrueto reuse the same store asdocument_embedding_cache.
All embedding models
Aleph Alpha
Anyscale
Ascend
AI/ML API
AwaDB
AzureOpenAI
Baichuan Text Embeddings
Baidu Qianfan
Baseten
Bedrock
BGE on Hugging Face
Bookend AI
Clarifai
Cloudflare Workers AI
Clova Embeddings
Cohere
DashScope
Databricks
DeepInfra
EDEN AI
Elasticsearch
Embaas
Fake Embeddings
FastEmbed by Qdrant
Fireworks
Google Gemini
Google Vertex AI
GPT4All
Gradient
GreenNode
Hugging Face
IBM watsonx.ai
Infinity
Instruct Embeddings
IPEX-LLM CPU
IPEX-LLM GPU
Isaacus
Intel Extension for Transformers
Jina
John Snow Labs
LASER
Lindorm
Llama.cpp
LLMRails
LocalAI
MiniMax
MistralAI
Model2Vec
ModelScope
MosaicML
Naver
Nebius
Netmind
NLP Cloud
Nomic
NVIDIA NIMs
Oracle Cloud Infrastructure
Ollama
OpenClip
OpenAI
OpenVINO
Optimum Intel
Oracle AI Database
OVHcloud
Pinecone Embeddings
PredictionGuard
Perplexity
PremAI
SageMaker
SambaNova
Self Hosted
Sentence Transformers
Solar
SpaCy
SparkLLM
TensorFlow Hub
Text Embeddings Inference
TextEmbed
Titan Takeoff
Together AI
Upstage
Volc Engine
Voyage AI
Xinference
YandexGPT
ZhipuAI
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.

