The Architecture of Modern Information Retrieval Systems

Information retrieval is a field of computer science concerned with the organization,
storage, retrieval, and evaluation of information from document repositories, particularly
textual information. The history of information retrieval stretches back to the earliest
libraries and card catalogs, but the digital revolution transformed it into one of the most
critical technologies underlying modern computing infrastructure.

At its core, a retrieval system must solve a deceptively simple problem: given a collection
of documents and a user query, return the documents most relevant to that query. The
difficulty lies in the fact that relevance is a deeply subjective concept. What is relevant
to one user may be entirely irrelevant to another, and even the same user's notion of
relevance shifts over time and with context.

Early retrieval systems relied on exact keyword matching. A query for the word "river"
would retrieve only documents containing exactly that string. While simple to implement,
this approach fails in a number of important ways. It cannot handle synonyms, so a search
for "stream" would miss documents about rivers that use only that synonym. It cannot handle
morphological variation, so "running" would not match "run" unless stemming was applied.
And it provides no ranking — every matching document is treated as equally relevant.

The introduction of the vector space model in the 1970s represented a significant conceptual
advance. In this model, both documents and queries are represented as vectors in a
high-dimensional space, where each dimension corresponds to a term in the vocabulary. The
relevance of a document to a query is computed as the cosine similarity between their
respective vectors. Documents are ranked by this similarity score, so users receive results
in decreasing order of estimated relevance rather than an unordered set.

The term-frequency inverse document frequency weighting scheme, commonly known as TF-IDF,
became the standard approach for populating these vectors. A term's weight in a document
is proportional to how often it appears in that document, but inversely proportional to
how common it is across the entire corpus. This has the intuitive effect of giving high
weight to terms that are characteristic of a particular document but rare in the collection
as a whole.

The emergence of neural networks and deep learning over the past decade has produced yet
another paradigm shift. Dense retrieval systems represent documents and queries as dense
low-dimensional vectors, typically with 384 or 768 dimensions, generated by transformer
models trained on large text corpora. These dense vectors capture semantic meaning in a
way that sparse bag-of-words representations cannot. Two sentences that express the same
idea in completely different words will have similar dense vector representations, enabling
true semantic search rather than mere keyword matching.

Sentence transformers, a family of models built on the BERT architecture, have proven
particularly effective for this task. Models such as all-MiniLM-L6-v2 are designed to
produce embeddings that are optimal for semantic similarity tasks. They are trained using
contrastive learning objectives on large datasets of sentence pairs labeled for similarity,
which teaches the model to place semantically similar sentences close together in the
embedding space and dissimilar sentences far apart.

The storage and querying of dense vector embeddings is handled by specialized vector
databases. These systems implement approximate nearest neighbor search algorithms that
can retrieve the most similar vectors from a collection of millions or billions in
milliseconds. Qdrant is one such system, notable for its Rust implementation which
provides exceptional performance, and its support for embedded operation where the database
runs as a library within the host process rather than as a separate server. This makes it
ideal for applications that need the power of vector search without the operational
complexity of a separate database process.

Chunking strategy is a critical design decision in any retrieval system that operates over
long documents. A large language model can only process a limited number of tokens at once,
so documents must be divided into smaller pieces before embedding. A naive approach would
split documents at fixed character counts, but this risks splitting sentences mid-thought
and produces chunks whose boundaries do not align with the tokenizer's vocabulary. The
correct approach is to split at token boundaries, using the same tokenizer as the embedding
model, and to apply a sliding window with overlap to ensure that information near chunk
boundaries is captured by multiple chunks.

The retrieval pipeline in an agentic RAG system typically involves multiple stages. The
initial retrieval stage uses approximate nearest neighbor search to identify a candidate
set of chunks that are semantically similar to the query. A reranking stage then applies
a more expensive but more accurate model, such as a cross-encoder, to reorder the candidates
by relevance. The final stage assembles the top-ranked chunks into a context window that
is passed to the language model along with the original query.

Metadata filtering adds another dimension to retrieval quality. In many use cases, not
all documents in a collection are relevant to every query. A user may want to search only
within a specific time range, from a particular author, or documents with a certain tag.
Vector databases support filtering on metadata fields alongside the vector similarity
search, allowing these constraints to be applied efficiently without post-processing.

The measurement of retrieval quality is a research area in its own right. Standard metrics
include precision at k, which measures the fraction of the top k retrieved documents that
are relevant, and recall at k, which measures the fraction of all relevant documents that
appear in the top k results. Mean reciprocal rank captures where in the ranked list the
first relevant result appears. These metrics require human judgments of relevance, which
are expensive to collect and vary across annotators.

Modern RAG systems are increasingly agentic, meaning that the retrieval process itself
is driven by the reasoning capabilities of a language model rather than a fixed algorithm.
The model may decompose a complex query into multiple sub-queries, evaluate the relevance
of retrieved chunks, and decide whether additional retrieval is needed before generating
a final response. This closes the gap between the retrieval system and the reasoning
system, enabling more sophisticated and accurate answers to complex questions.

Privacy is a paramount concern in many applications of RAG systems. Users who work with
sensitive documents — legal contracts, medical records, proprietary research — cannot
afford to send their data to a cloud service for embedding or storage. Fully local RAG
systems, where all computation occurs on the user's own hardware, are essential for these
use cases. The combination of local embedding models like sentence-transformers and
embedded vector databases like Qdrant makes it possible to build fully local RAG
pipelines that provide cloud-grade retrieval quality without any data leaving the machine.

The packaging and distribution of such systems must account for the significant size of
the machine learning models they depend on. A base installation of a sentence-transformers
model may require downloading hundreds of megabytes of weights. Best practice is to
download models lazily on first use rather than bundling them with the package, and to
cache them in a standard location such as the HuggingFace cache directory so they are
shared across applications and not re-downloaded unnecessarily.

Future directions in information retrieval include multimodal retrieval over mixed
collections of text, images, and audio; retrieval-augmented generation for code; and
the use of structured knowledge graphs to supplement dense vector retrieval with explicit
relational reasoning. Each of these extensions builds on the foundational principles of
semantic similarity and dense vector representations that have proven so effective for
text retrieval.
