Skip to content

Embeddings & Spaces

pgVectorDB introduces the concept of Spaces to handle multi-embedding and multimodal RAG. Instead of being restricted to one vector per document, you can configure multiple Spaces — each encoding a different signal (text, price, category, recency) into its own vector column.

For the full multimodal search pipeline, see Multimodal Search.


Space Types Overview

Space Import Input Best For
TextSpace pgvectordb.spaces Text string Semantic text similarity
VectorSpace pgvectordb.spaces Raw float list Pre-computed embeddings (images, audio)
NumberSpace pgvectordb.spaces float / int Numeric fields (price, rating, views)
CategorySpace pgvectordb.spaces Category string Categorical fields (city, product type)
RecencySpace pgvectordb.spaces datetime / timestamp Time-based relevance

TextSpace

Encodes text using the pgVectorDB embedding model. If dimensions=0, the dimension count is auto-detected from the model at register_spaces() time.

from pgvectordb.spaces import TextSpace

text_space = TextSpace(
    name="description",   # Creates column: embedding_description
    field="content",      # "content" maps to document.page_content; any other value maps to metadata
    dimensions=384,       # Set to 0 for auto-detection
    weight=1.0
)

VectorSpace

Accepts arbitrary pre-computed dense vectors. Use for image embeddings (CLIP), audio embeddings, or any external embedding pipeline.

from pgvectordb.spaces import VectorSpace

image_space = VectorSpace(
    name="image_clip",    # Creates column: embedding_image_clip
    dimensions=512,       # Must match your model's output size
    weight=1.5            # Higher weight = more influence in fusion
)

Inserting with pre-computed vectors:

# Compute externally
image_embeddings = clip_model.embed_images(["shoe.jpg", "jacket.jpg"])

multi_embeddings = [
    {"text_desc": text_embeddings[0], "image_clip": image_embeddings[0]},
    {"text_desc": text_embeddings[1], "image_clip": image_embeddings[1]},
]

await db.add_embeddings(
    texts=["Red running shoe", "Blue jacket"],
    embeddings=multi_embeddings
)

NumberSpace

Encodes numeric metadata fields (price, rating, age, views) into a vector representation using a configurable mode.

from pgvectordb.spaces import NumberSpace, NumberMode

price_space = NumberSpace(
    name="price",
    field="price",            # Key in document.metadata
    min_value=0,
    max_value=1_000_000,
    mode=NumberMode.NORMALIZED,
    weight=0.3
)

NumberMode Values

Mode Behavior Use When
NORMALIZED Linearly scales to [0, 1] You want balanced numeric similarity
MINIMUM Encodes "smaller is better" bias Price (cheaper is better)
MAXIMUM Encodes "larger is better" bias Rating, views (higher is better)
# Price: cheaper is better
price_space = NumberSpace(
    name="price", field="price",
    min_value=0, max_value=5_000_000,
    mode=NumberMode.MINIMUM,   # Prefers lower prices
    weight=0.3
)

# Rating: higher is better
rating_space = NumberSpace(
    name="rating", field="rating",
    min_value=1, max_value=5,
    mode=NumberMode.MAXIMUM,   # Prefers higher ratings
    weight=0.2
)

CategorySpace

Encodes categorical string labels as dense vectors using one-hot-style encoding. Documents with matching categories will have lower distance between them.

from pgvectordb.spaces import CategorySpace

city_space = CategorySpace(
    name="city",
    field="city",                                 # Key in document.metadata
    categories=["NYC", "LA", "Chicago", "Houston"],
    weight=0.2
)

Note

If a document's category value is not in the categories list, it is treated as an unknown category. Extend the list when you add new categories.


RecencySpace

Encodes timestamps so that more recent documents receive higher similarity scores. Useful for news, feed ranking, or any freshness-sensitive search.

from pgvectordb.spaces import RecencySpace, TimeUnit

recency_space = RecencySpace(
    name="recency",
    field="published_at",    # Key in document.metadata (datetime or Unix timestamp)
    time_unit=TimeUnit.DAYS,
    weight=0.15
)

TimeUnit Values

TimeUnit Granularity Best For
SECONDS Very fine Real-time feeds
MINUTES Fine Intraday content
HOURS Medium Hourly updates
DAYS Coarse News, blog posts

Using Multiple Spaces Together

Initialization with Spaces

from pgvectordb import pgVectorDB
from pgvectordb.spaces import TextSpace, NumberSpace, CategorySpace

db = pgVectorDB(
    collection_name="products",
    embedding_model=my_embeddings,
    connection_string="postgresql+asyncpg://user:pass@localhost/db"
)
await db.initialize()

spaces = [
    TextSpace(name="description", field="content",   weight=0.5),
    NumberSpace(name="price",     field="price",
                min_value=0, max_value=1_000_000,    weight=0.3),
    CategorySpace(name="category", field="category",
                  categories=["Electronics", "Clothing", "Home"], weight=0.2),
]
db.register_spaces(spaces)

Inserting Documents

from langchain_core.documents import Document

docs = [
    Document(
        page_content="Wireless noise-cancelling headphones",
        metadata={"price": 299.99, "category": "Electronics"}
    ),
    Document(
        page_content="Slim-fit cotton dress shirt",
        metadata={"price": 79.99, "category": "Clothing"}
    ),
]

await db.add_documents_multimodal(docs)

Searching

results = await db.multimodal_search(
    query_params={
        "description": "headphones for working from home",
        "price": 200.0,
        "category": "Electronics",
    },
    weights={"description": 0.5, "price": 0.3, "category": 0.2},
    k=5
)

for r in results:
    print(f"{r['score']:.4f} | {r['content']}")

See Multimodal Search for the complete pipeline including index creation and monitoring.


Weight Tuning Tips

Weights control how much influence each space has on the final ranking. They are normalized internally so they don't need to sum to 1.0.

# Text-dominant: semantic similarity drives ranking
weights = {"description": 0.7, "price": 0.2, "category": 0.1}

# Price-dominant: use for e-commerce filtered browsing
weights = {"description": 0.3, "price": 0.6, "category": 0.1}

# Balanced: equal contribution from all signals
weights = {"description": 0.33, "price": 0.33, "category": 0.34}

Use RAGEvaluator to find optimal weights

Define a ground-truth dataset and use metrics.py's RAGEvaluator to compare different weight configurations by Hit Rate, MRR, and NDCG. See Metrics & Evaluation.