How It Works

The Cognitive
Pipeline

Seven deterministic stages that transform raw documents into validated, evidence-backed, structured intelligence.

Ingestion

Any source. Any format.

Recora accepts PDFs, Word documents, audio transcripts, video captions, web content, and raw text. The pipeline begins the moment a document enters the system — no preprocessing required from the user.

PDF, DOCX, TXT, HTML

Audio/video transcript ingestion

Batch upload support

Source metadata preserved

Normalization

Clean. Structured. Consistent.

Raw input is normalized into a clean, consistent text layer. Formatting artifacts, OCR noise, and irrelevant metadata are stripped. What remains is a reliable foundation for downstream reasoning.

OCR correction & cleanup

Format-agnostic normalization

Language detection

Encoding standardization

Chunking & Indexing

Semantic segmentation at scale.

The normalized text is segmented into semantically coherent chunks — not arbitrary character limits. Each chunk retains its position in the document hierarchy, so context is never severed from meaning.

Semantic boundary detection

Hierarchical position tracking

Cross-chunk context preservation

Vector index generation

Evidence Extraction

Every claim. Every source.

Recora identifies and extracts direct evidence — quotes, numerical data, clauses, obligations, risks — and anchors each piece to its exact location in the source document. Nothing is inferred at this stage.

Direct quote extraction

Entity & obligation detection

Risk signal identification

Source location anchoring

Structured IR

Schema-first intermediate representation.

All extracted evidence is mapped to a schema-first Intermediate Representation — a typed data model where every fact has a defined structure. This is the core of what makes Recora different from any chat-based AI tool.

Domain-specific schemas

Typed data models

Relationship mapping

Cross-document linking

LLM Transformation

LLM as engine — not oracle.

The LLM is applied only to the structured IR — not to raw text. It acts as a transformation engine that reshapes, summarizes, or analyzes structured data. It cannot hallucinate facts that don't exist in the IR.

IR-constrained generation

LLM-agnostic architecture

Hallucination prevention

Configurable reasoning depth

Artifact Generation

Outputs that can be trusted.

The final output is a structured artifact — a versioned, JSON-backed document with full citation chains linking every assertion back to its source. Artifacts are queryable, diffable, and integrable with your existing systems.

JSON-schema validated output

Full citation chains

Version control

API-ready artifact delivery

Why the IR Layer Changes Everything

Every AI tool available today passes raw text directly to an LLM and asks it to reason. Recora never does this. The Intermediate Representation (IR) layer means the LLM only ever sees structured, validated data. It cannot invent facts — because the only facts available to it are those already extracted and verified from your source documents.

This is the architectural difference between probabilistic AI and deterministic reasoning infrastructure.

See It in Action →

The CognitivePipeline