How Kira finds, ranks, and generates clinical answers.
A hybrid retrieval-augmented generation pipeline with multi-stage search, cross-encoder reranking, agentic tool orchestration, and constitutional safety — evaluated across 438 gold queries with bootstrap confidence intervals.
Eight stages from query to grounded response.
Each query passes through synonym expansion, dual-path retrieval, cross-encoder reranking, and agentic tool orchestration before reaching the LLM with constitutional safety enforcement.
Query
0msUser input parsed and classified
Expansion
~1ms80+ clinical synonym mappings
Embedding
~50ms384-dim dense vectors
Hybrid Search
~20ms60% vector + 40% BM25 fusion
Reranking
~300msCross-encoder 20 → 8
Agentic Loop
~1.5s4 tools, max 3 rounds
Safety
In-promptConstitutional + output guard
Response
~2s totalSSE stream with sources
Measured against 438 gold queries across five evaluation categories.
Scope, clinical depth, differential diagnosis, safety, and edge cases — each with bootstrap 95% confidence intervals.
Run the eval harness to generate benchmark data:
npm run evalThe numbers behind the pipeline.
Knowledge Base
88
DSM-5-TR disorders
8,753
Search chunks
10
Personality disorders
25
Screener instruments
Search Pipeline
384d
Embedding dimensions
60/40
Vector / BM25 weight
80+
Synonym mappings
20→8
Candidates → results
Safety Pipeline
3-tier
Safety classification
988
Crisis escalation
20/min
Rate limit burst
In-prompt
Constitutional principles
Evaluation
438
Gold evaluation queries
92%
Recall@5
4.55
Groundedness (of 5)
4.91
Relevance (of 5)
Each component earns its place in the pipeline.
Retrieval quality across search methods, measured on queries with known expected sources and bootstrap confidence intervals.
| Method | Recall@3 | Recall@5 | Recall@8 | MRR | NDCG@10 |
|---|---|---|---|---|---|
BM25 Only | 72.3% | 83.5% | 88.2% | 68.4% | 71.2% |
Hybrid (Vector + BM25)Production | 89.9% | 92.0% | 93.1% | 87.7% | 87.9% |
Hybrid + Reranking | 90.4% | 92.3% | 93.3% | 88.5% | 88.5% |
Hybrid fusion uses 60/40 vector/BM25 weighting with reciprocal rank fusion. Bootstrap 95% CIs computed over 1,000 resamples of the 107-query test split. Reranking adds ~300ms latency for marginal gains — disabled in production.
Clinical relationships extracted from structured DSM-5-TR data.
Interactive visualization of comorbidity links, screening tool associations, differential rule-outs, and diagnostic category membership across 57 conditions and 44 instruments.
Loading knowledge graph...
Try Kira on hard clinical questions and see the pipeline in action.
Ask about differential diagnosis, comorbidity patterns, screening interpretation, or treatment mechanisms — every answer is grounded in the knowledge base with source citations.