Embedding Pipeline

Visual walkthrough of the 13-stage oracle pipeline. Each uploaded document passes through every stage in order. Stages marked skippable can be toggled off per request.

Phase 1 — Extraction
1 Extract extract.py

Converts binary or structured input (PDF, DOCX, HTML, TXT) into clean UTF-8 text. Uses format-specific parsers to pull raw content before any normalization.

Before
%PDF-1.7
3 0 obj << /Type /Page
/Contents 4 0 R >>
stream
BT /F1 12 Tf (Section 1.2 ...)Tj ET
endstream
After
Section 1.2 Access Control

Organizations shall implement access
control policies in accordance with
the risk assessment framework...
Phase 2 — Normalisation
2 Cleanup Extracted Markdown cleanup_extracted_markdown.py

Strips page numbers, copyright lines, dot-leader TOC entries, horizontal rules, and excess blank lines left behind by the extraction layer.

Before
Page 42

© 2024 Standards Corp.

Overview...............1
Scope..................3

---

## Overview
Organizations shall implement...
After
## Overview
Organizations shall implement...
3 Clean Standards OCR clean-standards-ocr.py skippable

Repairs OCR artefacts: rejoins split words (cont rolcontrol), compacts spaced numbers (2 0 1 6 / 6 7 92016/679), and cleans up residual LaTeX fragments.

Before
The re qui re ments for in formation
se cu ri ty cont rol frameworks
ref er ence Reg u la tion 2 0 1 6 / 6 7 9
as am end ed by Di rec tive 2 0 2 2 / 2 5 5 5.
After
The requirements for information
security control frameworks
reference Regulation 2016/679
as amended by Directive 2022/2555.
4 Cleanup Standards for Embedding cleanup_standards_for_embedding.py

Removes Table of Contents blocks, normative reference boilerplate, and other structural noise that would dilute embedding quality.

Before
## Table of Contents
Overview.......1
Scope..........3
Normative References.......5

## Overview
Organizations shall implement
access control policies...
After
## Overview
Organizations shall implement
access control policies...
5 Structure Standards Markdown structure_standards_markdown.py skippable

Detects bare section markers (numbered clauses without heading syntax) and promotes them to ## headings. Uses heuristics for numbering patterns like 1.2, A.3.1, Annex B.

Before
1.2 Security Controls

Organizations shall implement
access control policies.

1.3 Risk Assessment

Risk assessments shall be
conducted annually.
After
## 1.2 Security Controls

Organizations shall implement
access control policies.

## 1.3 Risk Assessment

Risk assessments shall be
conducted annually.
6 Demote Headings for Chunking demote_headings_for_chunking.py skippable

Shifts every heading down one # level so that ## becomes the reliable split boundary for the chunker. Prevents # title headings from creating oversized top-level chunks.

Before
# ISO 27001 Overview
## 1.2 Security Controls
### 1.2.1 Physical Access
After
## ISO 27001 Overview
### 1.2 Security Controls
#### 1.2.1 Physical Access
7 Split Long Paragraphs split_long_paragraphs_for_chunking.py skippable

Breaks paragraphs exceeding chunk_max_words (default 450) at the nearest sentence boundary to chunk_target_words (default 220). Prevents any single paragraph from dominating a chunk's token budget.

Before
/* ~500 word paragraph */
Organizations shall implement access
control policies in accordance with
the risk assessment framework. The
framework shall define roles ...
... periodic reviews of all access
rights granted to personnel. [500w]
After
/* split near word 220 */
Organizations shall implement access
control policies in accordance with
the risk assessment framework. [~220w]

The framework shall define roles ...
... periodic reviews of all access
rights granted to personnel. [~280w]
Phase 3 — Metadata Enrichment
8 Bootstrap Frontmatter bootstrap-local-oracle-frontmatter.py

Injects a YAML frontmatter block at the top of the document with canonical oracle metadata: oracle_id, title, tier, frameworks, and source provenance fields. Existing frontmatter is merged, not overwritten.

Before
## ISO 27001 Overview

Organizations shall implement
access control policies...
After
---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
frameworks:
  - iso_27001
  - iso_27002
source_format: pdf
---
## ISO 27001 Overview

Organizations shall implement
access control policies...
9 Infer S.I.R.E. infer-sire-from-keywords.py

NLP keyword-frequency extraction that classifies the document's content into four routing buckets: Subject (what it's about), Included (top 24 terms it covers), Relevant (12 related terms), and Excluded (8 terms it explicitly does not address). Used downstream by the RAG retriever for precision routing.

Before
---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
frameworks:
  - iso_27001
---
## ISO 27001 Overview
...
After
---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
frameworks:
  - iso_27001
sire_subject: "information security"
sire_included:
  - access control
  - risk assessment
  - asset management
  # ... up to 24 terms
sire_relevant:
  - business continuity
  - incident response
  # ... up to 12 terms
sire_excluded:
  - financial auditing
  - marketing analytics
  # ... up to 8 terms
---
10 Semantic Description pipeline.py (inline)

Auto-generates a dense prose semantic_description from the oracle's title, frameworks, tier, and S.I.R.E. fields. This paragraph acts as a routing signal for the RAG retriever — it's embedded alongside the chunks so the retriever can match intent, not just keywords.

Before
---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
sire_subject: "information security"
# no semantic_description
---
After
---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
sire_subject: "information security"
semantic_description: >-
  ISO 27001 Overview is a tier_1 oracle
  covering information security within
  the iso_27001, iso_27002 frameworks.
  Primary subjects include access control,
  risk assessment, and asset management.
---
11 Ensure Chunk Heading pipeline.py (inline)

Safety net for documents that contain no ## headings after normalization. Inserts ## {title} at the top of the body so the chunker always has at least one split boundary. Without this, the entire document would become a single oversized chunk.

Before
---
title: "ISO 27001 Overview"
---
Organizations shall implement
access control policies in
accordance with the risk
assessment framework...
After
---
title: "ISO 27001 Overview"
---
## ISO 27001 Overview

Organizations shall implement
access control policies in
accordance with the risk
assessment framework...
Phase 4 — Provenance & Export
12 Inject Chunk Watermarks inject-chunk-watermarks.py

Appends a deterministic HTML comment to each ## section. The watermark encodes provenance: oracle_id, chunk_id, section title, authority tier, and S.I.R.E. subject. Survives round-trips through markdown renderers and enables downstream traceability.

Before
## 1.2 Security Controls

Organizations shall implement
access control policies in
accordance with the risk
assessment framework.
After
## 1.2 Security Controls

Organizations shall implement
access control policies in
accordance with the risk
assessment framework.

<!-- watermark:
  oracle_id: iso-27001-overview
  chunk_id: iso-27001-overview:0
  section: 1.2 Security Controls
  authority: tier_1
  sire: information security
-->
13 Export Chunks JSONL export-local-chunks-jsonl.py

Splits the watermarked markdown on ## boundaries into individual chunks. Each chunk gets a deterministic chunk_id, SHA-256 text_hash, estimated token_count, and full metadata envelope. Output format is newline-delimited JSON (JSONL).

Before
/* full watermarked markdown */
---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
...
---
## 1.2 Security Controls
Organizations shall ...
<!-- watermark: ... -->

## 1.3 Risk Assessment
Risk assessments shall ...
<!-- watermark: ... -->
After
/* chunks.jsonl */
{
  "chunk_id": "iso-27001-overview:0",
  "text": "## 1.2 Security Controls\n...",
  "text_hash": "a1b2c3...",
  "token_count": 187,
  "metadata": {
    "oracle_id": "iso-27001-overview",
    "tier": "tier_1",
    "section": "1.2 Security Controls"
  }
}
Request Configuration

Skip Flags

Toggle individual normalization stages per request.

FieldDefaultSkips Stage
skip_ocr_cleanupfalseStage 3 — Clean Standards OCR
skip_structure_headingsfalseStage 5 — Structure Standards Markdown
skip_demote_headingsfalseStage 6 — Demote Headings
skip_split_paragraphsfalseStage 7 — Split Long Paragraphs

Tuning Parameters

Control chunking and heading behaviour.

FieldDefaultEffect
chunk_target_words220Target word count for paragraph splitting (Stage 7).
chunk_max_words450Maximum words before a paragraph is force-split (Stage 7).
max_heading_depth5Deepest heading level used in structure promotion (Stage 5).

Output Contract

Job envelope returned by POST /api/ingest.

{
  "envelope_id": "c4fe90d3117f4a2b",
  "request_id": "c4fe90d3117f4a2b",
  "status": "completed",
  "input_filenames": ["iso-27001.pdf"],
  "input_count": 1,
  "input_bytes": 284610,
  "pipeline_options": { /* frozen config snapshot */ },
  "pipeline_profile": "aws-canonical-v1",
  "build_sha": "abc1234",
  "triggered_by": "ui",
  "results": [
    {
      "job_id": "a1b2c3d4e5f67890",
      "normalized_markdown": "---\noracle_id: ...\n---\n## ...",
      "chunks": [ /* JSONL chunk objects */ ],
      "steps": [
        { "name": "extract", "duration_ms": 142 },
        { "name": "cleanup_extracted_markdown", "duration_ms": 18 },
        /* ... all 13 stages ... */
      ],
      "normalization_degraded": false
    }
  ],
  "total_files": 1,
  "total_chunks": 42,
  "total_tokens": 18500,
  "duration_ms": 3200,
  "normalization_degraded": false,
  "stage_timing": { "extract": 142, "cleanup_extracted_markdown": 18, /* ... */ },
  "s3_envelope_key": "envelopes/c4fe90d3117f4a2b.json",
  "created_at": "2026-03-04T15:22:31.000Z",
  "completed_at": "2026-03-04T15:22:34.200Z",
  "errors": [],
  "warnings": []
}
Job Envelope

What Is the Job Envelope?

Every POST /api/ingest call returns a job envelope — a single JSON document that records everything about the request for auditability and downstream traceability.

The envelope follows the same provenance pattern used by the Goober chat system (goober_chat_envelopes) and the oracle ingestion pipeline (oracle_pipeline_envelopes). It captures:

  • Identity — a unique envelope_id and request_id tied to the X-Request-ID header.
  • Request context — the original filenames, file count, and a frozen snapshot of all pipeline options exactly as they were at invocation time.
  • Per-file results — each file gets its own PipelineResult with job ID, chunks, stage timings, normalized markdown, and error state.
  • Aggregated telemetry — total chunks, total tokens, and wall-clock duration aggregated across all files in the request.
  • S3 provenance — the S3 bucket, the envelope object key, and per-file chunk JSONL keys so you can trace every artifact back to its source request.
  • Timing — ISO 8601 timestamps for when the request was received (created_at) and when it finished (completed_at).
  • Error and warning aggregation — errors from all per-file results are surfaced at the envelope level; non-fatal issues (e.g., S3 upload retries) appear in warnings.

How It Works

Lifecycle of a job envelope through a single ingest request.

StepWhat Happens
1 Client sends POST /api/ingest with files and/or text. The request-ID middleware assigns a deterministic X-Request-ID header (or honours the one the client sent).
2 A JobEnvelope is created with status running, the frozen pipeline_options snapshot, the build SHA, and the pipeline profile. No results yet.
3 Each file (or pasted text) is processed through the 13‐stage pipeline. As each file completes, its PipelineResult is appended to the envelope and its JSONL chunk file is uploaded to S3.
4 After all files finish, the envelope aggregates totals (chunks, tokens, duration), collects errors, and sets status to completed or failed.
5 The finalized envelope JSON is uploaded to S3 at envelopes/{envelope_id}.json for long-term audit storage.
6 The envelope is returned as the HTTP response body. The results array is at the same path as before, so existing consumers continue to work.

Envelope Field Reference

Every field in the top-level job envelope.

FieldTypeDescription
envelope_idstringUnique envelope identifier (matches request_id).
request_idstringThe X-Request-ID for this ingest call.
statusstringpending | running | completed | failed
input_filenamesstring[]Original filenames as uploaded (includes pasted.md for text input).
input_countintNumber of input items processed.
input_bytesintTotal size of all uploaded files and pasted text in bytes.
pipeline_optionsobjectFrozen snapshot of all request parameters: oracle_id, title, skip flags, chunk targets, heading depth.
pipeline_profilestringServer-side pipeline profile (e.g. aws-canonical-v1).
build_shastringGit SHA of the deployed harness build.
triggered_bystringRequest origin: ui, api, or custom value from X-Triggered-By header.
resultsPipelineResult[]Per-file results array — each contains job_id, chunks, steps, normalized markdown, errors.
total_filesintCount of files successfully processed.
total_chunksintSum of all chunks across all files.
total_tokensintSum of estimated tokens across all chunks.
duration_msintTotal wall-clock time for the entire ingest request in milliseconds.
normalization_degradedbooltrue if any file's normalisation fell back to a degraded path (e.g. OCR heuristics failed).
stage_timingobjectAggregated per-stage durations in ms across all files, e.g. {"extract": 284, "chunk": 52}.
s3_bucketstring?S3 bucket used for artifact storage (null if S3 is not configured).
s3_envelope_keystring?S3 object key where this envelope was persisted.
s3_chunk_keysstring[]S3 object keys for each per-file chunks.jsonl upload.
created_atstringISO 8601 timestamp when the envelope was created (request received).
completed_atstring?ISO 8601 timestamp when processing finished (null while running).
errorsstring[]Aggregated errors from all per-file results.
warningsstring[]Non-fatal issues (e.g., S3 envelope upload failure).