Technical architecture

How Kira finds, ranks, and generates clinical answers.

A hybrid retrieval-augmented generation pipeline with multi-stage search, cross-encoder reranking, agentic tool orchestration, and constitutional safety — evaluated across 438 gold queries with bootstrap confidence intervals.

Search pipeline

Eight stages from query to grounded response.

Each query passes through synonym expansion, dual-path retrieval, cross-encoder reranking, and agentic tool orchestration before reaching the LLM with constitutional safety enforcement.

Query

0ms

User input parsed and classified

Expansion

~1ms

80+ clinical synonym mappings

Embedding

~50ms

384-dim dense vectors

Hybrid Search

~20ms

60% vector + 40% BM25 fusion

Reranking

~300ms

Cross-encoder 20 → 8

Agentic Loop

~1.5s

4 tools, max 3 rounds

Safety

In-prompt

Constitutional + output guard

Response

~2s total

SSE stream with sources

Retrieval benchmarks

Measured against 438 gold queries across five evaluation categories.

Scope, clinical depth, differential diagnosis, safety, and edge cases — each with bootstrap 95% confidence intervals.

Run the eval harness to generate benchmark data:

npm run eval
System architecture

The numbers behind the pipeline.

Knowledge Base

88

DSM-5-TR disorders

8,753

Search chunks

10

Personality disorders

25

Screener instruments

Search Pipeline

384d

Embedding dimensions

60/40

Vector / BM25 weight

80+

Synonym mappings

20→8

Candidates → results

Safety Pipeline

3-tier

Safety classification

988

Crisis escalation

20/min

Rate limit burst

In-prompt

Constitutional principles

Evaluation

438

Gold evaluation queries

92%

Recall@5

4.55

Groundedness (of 5)

4.91

Relevance (of 5)

Ablation study

Each component earns its place in the pipeline.

Retrieval quality across search methods, measured on queries with known expected sources and bootstrap confidence intervals.

MethodRecall@3Recall@5Recall@8MRRNDCG@10
BM25 Only
72.3%83.5%88.2%68.4%71.2%
Hybrid (Vector + BM25)Production
89.9%92.0%93.1%87.7%87.9%
Hybrid + Reranking
90.4%92.3%93.3%88.5%88.5%

Hybrid fusion uses 60/40 vector/BM25 weighting with reciprocal rank fusion. Bootstrap 95% CIs computed over 1,000 resamples of the 107-query test split. Reranking adds ~300ms latency for marginal gains — disabled in production.

Knowledge graph

Clinical relationships extracted from structured DSM-5-TR data.

Interactive visualization of comorbidity links, screening tool associations, differential rule-outs, and diagnostic category membership across 57 conditions and 44 instruments.

Loading knowledge graph...

Explore the system

Try Kira on hard clinical questions and see the pipeline in action.

Ask about differential diagnosis, comorbidity patterns, screening interpretation, or treatment mechanisms — every answer is grounded in the knowledge base with source citations.