The problem with traditional RAG

Traditional RAG pipelines follow the same pattern: chunk the document, embed each chunk into a vector, store in a vector database, retrieve by cosine similarity, feed to the model. It's elegant in theory. In practice, it has three fatal flaws.

First, chunk size is a dead end. Small chunks retrieve precisely but lack context. Large chunks have context but retrieve poorly. No fixed size wins on both dimensions. Second, cosine similarity finds semantically similar text — not text that answers the question. Those aren't the same thing. Third, the retrieval step is pure math. The model has zero say in why something is retrieved.

A different approach

ClawIndex skips all of that. Instead of embedding chunks, we build a structured section index of the full document — headers, hierarchy, and enough context for a model to reason about what belongs where. Instead of cosine similarity, the model reads the index and decides what's relevant. The retrieval step is the reasoning step — one pass, not two.

We're calling this category Reasoning-Native Retrieval (RNR). The retriever isn't a lookup table. It's a reasoner.

Benchmark results

We tested both systems on the HotpotQA distractor validation set — 20 questions requiring multi-hop reasoning across multiple Wikipedia articles. FAISS missed one (the David Beckham / Manchester United multi-hop question). ClawIndex got all 20, with zero hallucinations.

System Hit Rate Hallucinations Avg Latency
ClawIndex (RNR) 100% (20/20) 0 ~4.8s
FAISS 95% (19/20) N/A ~62ms

The tradeoff

ClawIndex runs at ~4.8s per query. FAISS runs in milliseconds. That's a real difference and we're not pretending otherwise. But for async pipelines, batch document processing, compliance workflows, or any use case where a wrong answer costs more than a slow one — that's the right tradeoff.

Real-time chat is the one use case where FAISS wins on latency. Every other use case is a conversation.

No embeddings. No vector DB. No cosine similarity. No chunking. Just a model and a structured index. We think this changes something.

What's next

A 100-question stress test is underway. After that: temporal awareness (RNR + staleness scoring), confidence propagation, and a multi-agent memory layer where the reasoner queries across multiple indexes simultaneously. Results will be posted here.