Knowledge base search API relevance ranking - keyword matching vs semantic search

We’re experiencing poor search relevance in our knowledge base API and evaluating whether to enhance traditional keyword matching or move to semantic search with embeddings.

Current implementation uses basic keyword matching with some TF-IDF weighting, but users complain they can’t find relevant articles even when using exact terminology from the content. For example, searching “refund processing time” returns articles about payment methods ranked higher than our actual refund policy article.

I’ve been researching BM25 keyword ranking algorithms and semantic search using embeddings. BM25 seems like a natural evolution of our current approach, while embeddings promise better conceptual matching but require significant infrastructure for indexing and vector similarity search.

Has anyone implemented hybrid ranking strategies that combine both approaches? Also curious about A/B testing methodology for measuring search relevance improvements - what metrics actually correlate with user satisfaction?

A/B testing methodology is critical for validating improvements. We ran 50/50 traffic split for 4 weeks comparing old vs new search. Primary metrics: click-through rate on top 3 results, zero-result query rate, search refinement rate (users modifying query), and time-to-article-view. Also tracked session-level metrics like successful case resolution without escalation. The semantic search variant reduced zero-result queries by 42% and search refinements by 35% - strong signals that relevance improved significantly.

Embedding indexing was our biggest challenge. With 50K knowledge articles, generating embeddings took 6 hours initially. We optimized by batching articles (32 per batch) and running parallel workers. Now full reindexing completes in 45 minutes. For incremental updates, we generate embeddings on article publish/update events and upsert into our vector store (we use Pinecone). Query-time embedding generation is fast - 25-35ms for typical search queries. The infrastructure investment is real but the relevance improvement justifies it.

Implementation tip for hybrid ranking: use a weighted scoring approach where you can tune the balance between keyword and semantic signals. We use 0.6 weight for BM25 score and 0.4 for embedding similarity, but this varies by query characteristics. Short queries (1-2 words) weight BM25 higher (0.75/0.25), while longer natural language queries weight semantic higher (0.35/0.65). This adaptive weighting improved relevance another 8% beyond static weighting. Query classification happens at search time using simple heuristics - query length, presence of boolean operators, exact phrase quotes.