Retrieval quality without retraining

When a RAG system gives wrong answers, the instinct is to reach for fine-tuning. In production, the cheaper and more durable fix is almost always upstream: better retrieval.

Three levers move the needle before you touch a model: chunking strategy, a reranking pass, and an evaluation harness that tells you whether either helped.

A cross-encoder rerank pass

Bi-encoder vector search is fast but lossy. A cross-encoder rerank over the top-k recovers precision cheaply:

rerank.py

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("BAAI/bge-reranker-base")

def rerank(query: str, candidates: list[str], k: int = 5):
    pairs = [(query, c) for c in candidates]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
    return [c for c, _ in ranked[:k]]

Pair this with an eval set of question/answer/expected-source triples and you can prove each change helps — instead of guessing.

precision@5 by retrieval method — Precision@5 climbs with reranking and an eval loop.