How It Works

The Cognitive
Pipeline

Seven deterministic stages that transform raw documents into validated, evidence-backed, structured intelligence.

01

Ingestion

Any source. Any format.

Recora accepts PDFs, Word documents, audio transcripts, video captions, web content, and raw text. The pipeline begins the moment a document enters the system — no preprocessing required from the user.

PDF, DOCX, TXT, HTML
Audio/video transcript ingestion
Batch upload support
Source metadata preserved
02

Normalization

Clean. Structured. Consistent.

Raw input is normalized into a clean, consistent text layer. Formatting artifacts, OCR noise, and irrelevant metadata are stripped. What remains is a reliable foundation for downstream reasoning.

OCR correction & cleanup
Format-agnostic normalization
Language detection
Encoding standardization
03

Chunking & Indexing

Semantic segmentation at scale.

The normalized text is segmented into semantically coherent chunks — not arbitrary character limits. Each chunk retains its position in the document hierarchy, so context is never severed from meaning.

Semantic boundary detection
Hierarchical position tracking
Cross-chunk context preservation
Vector index generation
04

Evidence Extraction

Every claim. Every source.

Recora identifies and extracts direct evidence — quotes, numerical data, clauses, obligations, risks — and anchors each piece to its exact location in the source document. Nothing is inferred at this stage.

Direct quote extraction
Entity & obligation detection
Risk signal identification
Source location anchoring
05

Structured IR

Schema-first intermediate representation.

All extracted evidence is mapped to a schema-first Intermediate Representation — a typed data model where every fact has a defined structure. This is the core of what makes Recora different from any chat-based AI tool.

Domain-specific schemas
Typed data models
Relationship mapping
Cross-document linking
06

LLM Transformation

LLM as engine — not oracle.

The LLM is applied only to the structured IR — not to raw text. It acts as a transformation engine that reshapes, summarizes, or analyzes structured data. It cannot hallucinate facts that don't exist in the IR.

IR-constrained generation
LLM-agnostic architecture
Hallucination prevention
Configurable reasoning depth
07

Artifact Generation

Outputs that can be trusted.

The final output is a structured artifact — a versioned, JSON-backed document with full citation chains linking every assertion back to its source. Artifacts are queryable, diffable, and integrable with your existing systems.

JSON-schema validated output
Full citation chains
Version control
API-ready artifact delivery

Why the IR Layer Changes Everything

Every AI tool available today passes raw text directly to an LLM and asks it to reason. Recora never does this. The Intermediate Representation (IR) layer means the LLM only ever sees structured, validated data. It cannot invent facts — because the only facts available to it are those already extracted and verified from your source documents.

This is the architectural difference between probabilistic AI and deterministic reasoning infrastructure.