When a RAG system gives wrong answers, the instinct is to reach for fine-tuning. In production, the cheaper and more durable fix is almost always upstream: better retrieval.
Three levers move the needle before you touch a model: chunking strategy, a reranking pass, and an evaluation harness that tells you whether either helped.
A cross-encoder rerank pass
Bi-encoder vector search is fast but lossy. A cross-encoder rerank over the top-k recovers precision cheaply:
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("BAAI/bge-reranker-base")
def rerank(query: str, candidates: list[str], k: int = 5):
pairs = [(query, c) for c in candidates]
scores = reranker.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [c for c, _ in ranked[:k]]Pair this with an eval set of question/answer/expected-source triples and you can prove each change helps — instead of guessing.