ClawIndex: Reasoning-Native Retrieval for AI Agents

Abstract

We present ClawIndex, a retrieval architecture that replaces embedding-based similarity search with model-native reasoning. Traditional Retrieval-Augmented Generation (RAG) pipelines depend on vector databases, embedding models, and cosine similarity to retrieve relevant context — a pipeline with well-documented structural failures on real-world queries.

ClawIndex eliminates this stack entirely. Instead, a language model reasons over a structured section-level index to identify relevant documents and return exact file and line pointers.

In benchmarks against FAISS on HotpotQA (n=150), ClawIndex achieves 99% single-hop hit rate (vs FAISS 96%) and 80% multi-hop hit rate (vs FAISS 62%), with zero hallucinations across all 150 queries — running on commodity hardware with no external infrastructure.

1. Introduction

The dominant approach to giving AI systems access to large knowledge bases is Retrieval-Augmented Generation (RAG). RAG pipelines follow a fixed pattern:

Standard RAG Pipeline

Document → Chunk → Embed → Vector DB → Cosine Similarity → LLM

This pattern has become industry standard. Nearly every major AI framework — LangChain, LlamaIndex, Haystack — implements it as a first-class abstraction. Vector database companies have raised hundreds of millions of dollars on the premise that this architecture is correct.

We argue it is not — and that the evidence is hiding in plain sight.

2. Why RAG Is Broken

2.1 The Chunking Paradox

Every RAG implementation must choose a chunk size. This choice is a paradox with no winning solution:

Small chunks (128–256 tokens) give high retrieval precision, but individual chunks lack the context needed to answer questions. A sentence is retrieved; its meaning requires the surrounding paragraphs. Large chunks (1024+ tokens) provide rich context, but cosine similarity degrades because the embedding must represent many ideas at once.

No fixed chunk size resolves this. The field has responded with hierarchical chunking, sliding windows, and parent-child chunk trees — all of which add complexity without addressing the root cause.

2.2 Similarity ≠ Relevance

Cosine similarity finds text that is semantically proximate to the query. This is not the same as finding text that answers the query.

Example: Query — "When was the company founded?"

A RAG system retrieves a passage about the company's mission statement (semantically similar: same entity, same topic) while missing the founding date buried in a biography (different style, lower cosine score — but actually answers the question).

Vector similarity is a proxy for relevance. When it fails, there is no recovery mechanism.

2.3 No Reasoning in Retrieval

The retrieval step in traditional RAG is pure mathematics. The model has zero input into what gets retrieved. This creates a fundamental disconnect: the same model that will reason over the retrieved context has no say in what context it receives.

ClawIndex collapses this gap. The retrieval step is a reasoning step.

2.4 Multi-Hop Failure

Multi-hop questions require connecting information across two or more documents: "What is the nationality of the director of [Film X]?" requires: Film X → Director name → Director nationality — two separate documents.

Vector similarity retrieves the most similar document to the query — typically the first-hop document (the film). The second-hop document (director biography) has lower similarity to the original query and is frequently missed.

This is a structural failure — not a tuning problem. No amount of prompt engineering or chunk size optimization fixes it.

3. The ClawIndex Architecture

3.1 Core Insight

The retrieval step should be a reasoning step.

Instead of computing cosine similarity between query embeddings and document embeddings, ClawIndex asks the model directly: "Given this index of all available sections, which sections are most likely to contain the answer to this question? Return exact pointers."

The model reasons over a compact, structured index — not raw documents. It returns section-level pointers. Only those sections are fetched and loaded into context.

3.2 Pipeline

ClawIndex Pipeline

Documents → Section-Aware Indexer → JSON Index → LLM Reasoning Pass → Section Pointers → Answer

No embeddings · No vector database · No cosine similarity · No chunking

3.3 The Index Format

ClawIndex builds a structured index at the section level. Sections are natural semantic units defined by document headers rather than arbitrary token windows. Each index entry captures the section's position, hierarchy, and a content fingerprint sufficient for the model to assess relevance without loading the full document.

The full index is compact enough to fit in a single LLM context window — typically 5–15KB for a 50-file knowledge base. The model sees all available sections at once, reasons holistically, and returns ranked pointers.

3.4 Multi-Hop Retrieval

For queries that require bridging across documents, ClawIndex executes up to three reasoning passes, each informed by the actual content retrieved in the previous pass — not just metadata or headers. This iterative approach mirrors how a skilled researcher works: find the first piece, use it to locate the second.

The key distinction from naive multi-pass retrieval: each subsequent pass is grounded in retrieved content, enabling the model to form precise bridge queries rather than repeating semantic noise from the original question.

3.5 Hallucination Detection

Because ClawIndex returns exact document and section pointers, hallucination is directly measurable: a pointer either references a real section or it doesn't. Every result is validated against the index before any content is returned. Invalid pointers are discarded.

Result: 0 hallucinations across all 150 benchmark queries.

4. Benchmark Results

4.1 Single-Hop: ClawIndex vs FAISS (n=100)

Dataset: HotpotQA distractor split, validation set. Hit = at least one supporting document retrieved. Run on Apple M4 Max, Qwen 27B via Ollama (local, no cloud API).

Method	Hit Rate	Avg Latency	Hallucinations
ClawIndex	99%	7,630ms	0
FAISS (all-MiniLM-L6-v2)	96%	55ms	N/A

ClawIndex

99%

FAISS

96%

4.2 Multi-Hop: ClawIndex vs FAISS (n=50)

"Full hit" = both supporting documents retrieved (strict). "Bridge" = questions requiring cross-document entity chaining — the hardest multi-hop category.

Method	Full Hit Rate	Bridge Hit Rate	Avg Latency	Hallucinations
ClawIndex (3-pass)	80%	90.9%	10,697ms	0
FAISS (all-MiniLM-L6-v2)	62%	45.5%	76ms	N/A

By question type:

Question Type	ClawIndex	FAISS	n
Comparison	90.9%	77.3%	22
Bridge	90.9%	45.5%	11
Intersection	58.8%	52.9%	17

4.3 Interpretation

Single-hop accuracy: ClawIndex outperforms FAISS 99% vs 96% while using no embedding model and no vector database.

Multi-hop accuracy: ClawIndex outperforms FAISS 80% vs 62% on strict multi-hop retrieval. On bridge-type questions — the hardest category — ClawIndex achieves 90.9% vs FAISS's 45.5%, a +45 point advantage. This is not a tuning win; it is a structural advantage. Vector similarity has no mechanism to chain across documents. Reasoning does.

Latency: ClawIndex is slower on cold queries (7.6s single-hop, 10.7s multi-hop vs <100ms for FAISS). This is the expected tradeoff: reasoning takes more compute than dot products. For agent memory and knowledge retrieval workloads — where query rates are typically well below 1/second — this is acceptable. With LRU caching enabled, repeated queries return in under 50ms.

Hallucinations: ClawIndex's pointer validation provides a hard guarantee FAISS cannot offer. Every retrieved section is verified to exist. Zero hallucinations across all 150 benchmark queries.

5. Where ClawIndex Wins

Use Case	ClawIndex	FAISS	Why
Agent long-term memory	✅	⚠️	Reasoning finds relevant decisions across sparse, dated logs
Multi-hop reasoning	✅	❌	Multi-pass bridge retrieval; vector similarity collapses on chained queries
Small-to-medium KB (<1,000 sections)	✅	✅	Both work well at this scale
Zero-infra deployment	✅	❌	No embedding model, no vector DB, no GPU required
Hallucination guarantees	✅	❌	Pointer validation; FAISS has no verification layer
High-frequency queries (>10/s)	❌	✅	Reasoning latency is unsuitable for real-time search
Very large KB (10,000+ sections)	⚠️	✅	Index grows; hierarchical indexing planned

6. Limitations

⚡ Latency

Cold retrieval takes ~7.6s vs milliseconds for vector similarity. Not suitable for real-time search. Warm (cached) queries return in under 50ms via LRU cache.

📦 Scale

The index-in-context approach works well up to ~1,000 sections (~50–100 files). Beyond this, the index may exceed context limits. Hierarchical indexing is planned.

🧠 Model Dependency

Retrieval quality depends on the underlying LLM's reasoning capability. Weaker models produce weaker retrieval. FAISS is model-independent.

7. The Agent Memory Application

ClawIndex was built to solve a specific problem: AI agents that need persistent, long-term memory across sessions — without stuffing 100+ files into every prompt.

The architecture enables a three-layer memory model:

Layer	Scope	Access
Hot	0–7 days	Full daily logs, loaded into context directly
Warm	Stable knowledge	Curated memory files, manually maintained
Cold	7+ days historical	Distilled summaries, ClawIndex-searchable only

Result: Agents maintain memory across months of sessions while keeping active context to a small set of files rather than the full corpus. Retrieval latency at this scale: ~7.6s cold, under 50ms warm (cached).

8. Roadmap

🗂️ Hierarchical indexing — Support KB >1,000 sections via index-of-indexes

⚡ Streaming index updates — Real-time KB changes reflected without full rebuild

📊 Extended benchmarks — ChromaDB, Weaviate, and BM25 hybrid comparison

📦 Python SDK — pip install clawindex

☁️ Cloud API — Agent memory as a managed service

9. Conclusion

ClawIndex demonstrates that embedding-free, reasoning-native retrieval is competitive with state-of-the-art vector similarity on standard benchmarks — while eliminating the infrastructure overhead and hallucination risk that make traditional RAG pipelines fragile in production.

The core insight is simple: the model that will answer the question should also decide what to retrieve. Separating retrieval (pure math) from reasoning (the model) creates an unnecessary and lossy abstraction.

For agent memory, knowledge retrieval, and multi-hop reasoning workloads, ClawIndex offers a better tradeoff than any vector database on the market today.

Interested in ClawIndex? We're building the Python SDK and a hosted API. Reach out at [email protected] to get early access.

Citation

      Sholtis Labs. (2026). ClawIndex: Reasoning-Native Retrieval for AI Agents.

      Retrieved from https://sholtislabs.com/newsroom/clawindex-whitepaper