Embedding Pipeline

Visual walkthrough of the 13-stage oracle pipeline. Each uploaded document passes through every stage in order. Stages marked skippable can be toggled off per request.

Phase 1 — Extraction

1 Extract extract.py

Converts binary or structured input (PDF, DOCX, HTML, TXT) into clean UTF-8 text. Uses format-specific parsers to pull raw content before any normalization.

Before

%PDF-1.7
3 0 obj << /Type /Page
/Contents 4 0 R >>
stream
BT /F1 12 Tf (Section 1.2 ...)Tj ET
endstream

After

Section 1.2 Access Control

Organizations shall implement access
control policies in accordance with
the risk assessment framework...

Phase 2 — Normalisation

2 Cleanup Extracted Markdown cleanup_extracted_markdown.py

Strips page numbers, copyright lines, dot-leader TOC entries, horizontal rules, and excess blank lines left behind by the extraction layer.

Before

Page 42

© 2024 Standards Corp.

Overview...............1
Scope..................3

---

## Overview
Organizations shall implement...

After

## Overview
Organizations shall implement...

▼

3 Clean Standards OCR clean-standards-ocr.py skippable

Repairs OCR artefacts: rejoins split words (cont rol → control), compacts spaced numbers (2 0 1 6 / 6 7 9 → 2016/679), and cleans up residual LaTeX fragments.

Before

The re qui re ments for in formation
se cu ri ty cont rol frameworks
ref er ence Reg u la tion 2 0 1 6 / 6 7 9
as am end ed by Di rec tive 2 0 2 2 / 2 5 5 5.

After

The requirements for information
security control frameworks
reference Regulation 2016/679
as amended by Directive 2022/2555.

▼

4 Cleanup Standards for Embedding cleanup_standards_for_embedding.py

Removes Table of Contents blocks, normative reference boilerplate, and other structural noise that would dilute embedding quality.

Before

## Table of Contents
Overview.......1
Scope..........3
Normative References.......5

## Overview
Organizations shall implement
access control policies...

After

## Overview
Organizations shall implement
access control policies...

▼

5 Structure Standards Markdown structure_standards_markdown.py skippable

Detects bare section markers (numbered clauses without heading syntax) and promotes them to ## headings. Uses heuristics for numbering patterns like 1.2, A.3.1, Annex B.

Before

1.2 Security Controls

Organizations shall implement
access control policies.

1.3 Risk Assessment

Risk assessments shall be
conducted annually.

After

## 1.2 Security Controls

Organizations shall implement
access control policies.

## 1.3 Risk Assessment

Risk assessments shall be
conducted annually.

▼

6 Demote Headings for Chunking demote_headings_for_chunking.py skippable

Shifts every heading down one # level so that ## becomes the reliable split boundary for the chunker. Prevents # title headings from creating oversized top-level chunks.

Before

# ISO 27001 Overview
## 1.2 Security Controls
### 1.2.1 Physical Access

After

## ISO 27001 Overview
### 1.2 Security Controls
#### 1.2.1 Physical Access

▼

7 Split Long Paragraphs split_long_paragraphs_for_chunking.py skippable

Breaks paragraphs exceeding chunk_max_words (default 450) at the nearest sentence boundary to chunk_target_words (default 220). Prevents any single paragraph from dominating a chunk's token budget.

Before

/* ~500 word paragraph */
Organizations shall implement access
control policies in accordance with
the risk assessment framework. The
framework shall define roles ...
... periodic reviews of all access
rights granted to personnel. [500w]

After

/* split near word 220 */
Organizations shall implement access
control policies in accordance with
the risk assessment framework. [~220w]

The framework shall define roles ...
... periodic reviews of all access
rights granted to personnel. [~280w]

Phase 3 — Metadata Enrichment

8 Bootstrap Frontmatter bootstrap-local-oracle-frontmatter.py

Injects a YAML frontmatter block at the top of the document with canonical oracle metadata: oracle_id, title, tier, frameworks, and source provenance fields. Existing frontmatter is merged, not overwritten.

Before

## ISO 27001 Overview

Organizations shall implement
access control policies...

After

---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
frameworks:
  - iso_27001
  - iso_27002
source_format: pdf
---
## ISO 27001 Overview

Organizations shall implement
access control policies...

▼

9 Infer S.I.R.E. infer-sire-from-keywords.py

NLP keyword-frequency extraction that classifies the document's content into four routing buckets: Subject (what it's about), Included (top 24 terms it covers), Relevant (12 related terms), and Excluded (8 terms it explicitly does not address). Used downstream by the RAG retriever for precision routing.

Before

---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
frameworks:
  - iso_27001
---
## ISO 27001 Overview
...

After

---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
frameworks:
  - iso_27001
sire_subject: "information security"
sire_included:
  - access control
  - risk assessment
  - asset management
  # ... up to 24 terms
sire_relevant:
  - business continuity
  - incident response
  # ... up to 12 terms
sire_excluded:
  - financial auditing
  - marketing analytics
  # ... up to 8 terms
---

▼

10 Semantic Description pipeline.py (inline)

Auto-generates a dense prose semantic_description from the oracle's title, frameworks, tier, and S.I.R.E. fields. This paragraph acts as a routing signal for the RAG retriever — it's embedded alongside the chunks so the retriever can match intent, not just keywords.

Before

---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
sire_subject: "information security"
# no semantic_description
---

After

---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
sire_subject: "information security"
semantic_description: >-
  ISO 27001 Overview is a tier_1 oracle
  covering information security within
  the iso_27001, iso_27002 frameworks.
  Primary subjects include access control,
  risk assessment, and asset management.
---

▼

11 Ensure Chunk Heading pipeline.py (inline)

Safety net for documents that contain no ## headings after normalization. Inserts ## {title} at the top of the body so the chunker always has at least one split boundary. Without this, the entire document would become a single oversized chunk.

Before

---
title: "ISO 27001 Overview"
---
Organizations shall implement
access control policies in
accordance with the risk
assessment framework...

After

---
title: "ISO 27001 Overview"
---
## ISO 27001 Overview

Organizations shall implement
access control policies in
accordance with the risk
assessment framework...

Phase 4 — Provenance & Export

12 Inject Chunk Watermarks inject-chunk-watermarks.py

Appends a deterministic HTML comment to each ## section. The watermark encodes provenance: oracle_id, chunk_id, section title, authority tier, and S.I.R.E. subject. Survives round-trips through markdown renderers and enables downstream traceability.

Before

## 1.2 Security Controls

Organizations shall implement
access control policies in
accordance with the risk
assessment framework.

After

## 1.2 Security Controls

Organizations shall implement
access control policies in
accordance with the risk
assessment framework.

<!-- watermark:
  oracle_id: iso-27001-overview
  chunk_id: iso-27001-overview:0
  section: 1.2 Security Controls
  authority: tier_1
  sire: information security
-->

▼

13 Export Chunks JSONL export-local-chunks-jsonl.py

Splits the watermarked markdown on ## boundaries into individual chunks. Each chunk gets a deterministic chunk_id, SHA-256 text_hash, estimated token_count, and full metadata envelope. Output format is newline-delimited JSON (JSONL).

Before

/* full watermarked markdown */
---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
...
---
## 1.2 Security Controls
Organizations shall ...
<!-- watermark: ... -->

## 1.3 Risk Assessment
Risk assessments shall ...
<!-- watermark: ... -->

After

/* chunks.jsonl */
{
  "chunk_id": "iso-27001-overview:0",
  "text": "## 1.2 Security Controls\n...",
  "text_hash": "a1b2c3...",
  "token_count": 187,
  "metadata": {
    "oracle_id": "iso-27001-overview",
    "tier": "tier_1",
    "section": "1.2 Security Controls"
  }
}

Request Configuration

Skip Flags

Toggle individual normalization stages per request.

Field	Default	Skips Stage
`skip_ocr_cleanup`	`false`	Stage 3 — Clean Standards OCR
`skip_structure_headings`	`false`	Stage 5 — Structure Standards Markdown
`skip_demote_headings`	`false`	Stage 6 — Demote Headings
`skip_split_paragraphs`	`false`	Stage 7 — Split Long Paragraphs

Tuning Parameters

Control chunking and heading behaviour.

Field	Default	Effect
`chunk_target_words`	`220`	Target word count for paragraph splitting (Stage 7).
`chunk_max_words`	`450`	Maximum words before a paragraph is force-split (Stage 7).
`max_heading_depth`	`5`	Deepest heading level used in structure promotion (Stage 5).

Output Contract

Job envelope returned by POST /api/ingest.

{
  "envelope_id": "c4fe90d3117f4a2b",
  "request_id": "c4fe90d3117f4a2b",
  "status": "completed",
  "input_filenames": ["iso-27001.pdf"],
  "input_count": 1,
  "input_bytes": 284610,
  "pipeline_options": { /* frozen config snapshot */ },
  "pipeline_profile": "aws-canonical-v1",
  "build_sha": "abc1234",
  "triggered_by": "ui",
  "results": [
    {
      "job_id": "a1b2c3d4e5f67890",
      "normalized_markdown": "---\noracle_id: ...\n---\n## ...",
      "chunks": [ /* JSONL chunk objects */ ],
      "steps": [
        { "name": "extract", "duration_ms": 142 },
        { "name": "cleanup_extracted_markdown", "duration_ms": 18 },
        /* ... all 13 stages ... */
      ],
      "normalization_degraded": false
    }
  ],
  "total_files": 1,
  "total_chunks": 42,
  "total_tokens": 18500,
  "duration_ms": 3200,
  "normalization_degraded": false,
  "stage_timing": { "extract": 142, "cleanup_extracted_markdown": 18, /* ... */ },
  "s3_envelope_key": "envelopes/c4fe90d3117f4a2b.json",
  "created_at": "2026-03-04T15:22:31.000Z",
  "completed_at": "2026-03-04T15:22:34.200Z",
  "errors": [],
  "warnings": []
}

Job Envelope

What Is the Job Envelope?

Every POST /api/ingest call returns a job envelope — a single JSON document that records everything about the request for auditability and downstream traceability.

The envelope follows the same provenance pattern used by the Goober chat system (goober_chat_envelopes) and the oracle ingestion pipeline (oracle_pipeline_envelopes). It captures:

Identity — a unique envelope_id and request_id tied to the X-Request-ID header.
Request context — the original filenames, file count, and a frozen snapshot of all pipeline options exactly as they were at invocation time.
Per-file results — each file gets its own PipelineResult with job ID, chunks, stage timings, normalized markdown, and error state.
Aggregated telemetry — total chunks, total tokens, and wall-clock duration aggregated across all files in the request.
S3 provenance — the S3 bucket, the envelope object key, and per-file chunk JSONL keys so you can trace every artifact back to its source request.
Timing — ISO 8601 timestamps for when the request was received (created_at) and when it finished (completed_at).
Error and warning aggregation — errors from all per-file results are surfaced at the envelope level; non-fatal issues (e.g., S3 upload retries) appear in warnings.

How It Works

Lifecycle of a job envelope through a single ingest request.

Step	What Happens
1	Client sends `POST /api/ingest` with files and/or text. The request-ID middleware assigns a deterministic `X-Request-ID` header (or honours the one the client sent).
2	A `JobEnvelope` is created with status `running`, the frozen `pipeline_options` snapshot, the build SHA, and the pipeline profile. No results yet.
3	Each file (or pasted text) is processed through the 13‐stage pipeline. As each file completes, its `PipelineResult` is appended to the envelope and its JSONL chunk file is uploaded to S3.
4	After all files finish, the envelope aggregates totals (chunks, tokens, duration), collects errors, and sets status to `completed` or `failed`.
5	The finalized envelope JSON is uploaded to S3 at `envelopes/{envelope_id}.json` for long-term audit storage.
6	The envelope is returned as the HTTP response body. The `results` array is at the same path as before, so existing consumers continue to work.

Envelope Field Reference

Every field in the top-level job envelope.

Field	Type	Description
`envelope_id`	string	Unique envelope identifier (matches `request_id`).
`request_id`	string	The `X-Request-ID` for this ingest call.
`status`	string	`pending` \| `running` \| `completed` \| `failed`
`input_filenames`	string[]	Original filenames as uploaded (includes `pasted.md` for text input).
`input_count`	int	Number of input items processed.
`input_bytes`	int	Total size of all uploaded files and pasted text in bytes.
`pipeline_options`	object	Frozen snapshot of all request parameters: oracle_id, title, skip flags, chunk targets, heading depth.
`pipeline_profile`	string	Server-side pipeline profile (e.g. `aws-canonical-v1`).
`build_sha`	string	Git SHA of the deployed harness build.
`triggered_by`	string	Request origin: `ui`, `api`, or custom value from `X-Triggered-By` header.
`results`	PipelineResult[]	Per-file results array — each contains job_id, chunks, steps, normalized markdown, errors.
`total_files`	int	Count of files successfully processed.
`total_chunks`	int	Sum of all chunks across all files.
`total_tokens`	int	Sum of estimated tokens across all chunks.
`duration_ms`	int	Total wall-clock time for the entire ingest request in milliseconds.
`normalization_degraded`	bool	`true` if any file's normalisation fell back to a degraded path (e.g. OCR heuristics failed).
`stage_timing`	object	Aggregated per-stage durations in ms across all files, e.g. `{"extract": 284, "chunk": 52}`.
`s3_bucket`	string?	S3 bucket used for artifact storage (`null` if S3 is not configured).
`s3_envelope_key`	string?	S3 object key where this envelope was persisted.
`s3_chunk_keys`	string[]	S3 object keys for each per-file `chunks.jsonl` upload.
`created_at`	string	ISO 8601 timestamp when the envelope was created (request received).
`completed_at`	string?	ISO 8601 timestamp when processing finished (`null` while running).
`errors`	string[]	Aggregated errors from all per-file results.
`warnings`	string[]	Non-fatal issues (e.g., S3 envelope upload failure).