Embedding Pipeline
Visual walkthrough of the 13-stage oracle pipeline. Each uploaded document passes through every stage in order. Stages marked skippable can be toggled off per request.
Converts binary or structured input (PDF, DOCX, HTML, TXT) into clean UTF-8 text. Uses format-specific parsers to pull raw content before any normalization.
%PDF-1.7 3 0 obj << /Type /Page /Contents 4 0 R >> stream BT /F1 12 Tf (Section 1.2 ...)Tj ET endstream
Section 1.2 Access Control Organizations shall implement access control policies in accordance with the risk assessment framework...
Strips page numbers, copyright lines, dot-leader TOC entries, horizontal rules, and excess blank lines left behind by the extraction layer.
Page 42 © 2024 Standards Corp. Overview...............1 Scope..................3 --- ## Overview Organizations shall implement...
## Overview Organizations shall implement...
Repairs OCR artefacts: rejoins split words (cont rol → control),
compacts spaced numbers (2 0 1 6 / 6 7 9 → 2016/679),
and cleans up residual LaTeX fragments.
The re qui re ments for in formation se cu ri ty cont rol frameworks ref er ence Reg u la tion 2 0 1 6 / 6 7 9 as am end ed by Di rec tive 2 0 2 2 / 2 5 5 5.
The requirements for information security control frameworks reference Regulation 2016/679 as amended by Directive 2022/2555.
Removes Table of Contents blocks, normative reference boilerplate, and other structural noise that would dilute embedding quality.
## Table of Contents Overview.......1 Scope..........3 Normative References.......5 ## Overview Organizations shall implement access control policies...
## Overview Organizations shall implement access control policies...
Detects bare section markers (numbered clauses without heading syntax) and
promotes them to ## headings. Uses heuristics for numbering patterns
like 1.2, A.3.1, Annex B.
1.2 Security Controls Organizations shall implement access control policies. 1.3 Risk Assessment Risk assessments shall be conducted annually.
## 1.2 Security Controls Organizations shall implement access control policies. ## 1.3 Risk Assessment Risk assessments shall be conducted annually.
Shifts every heading down one # level so that ## becomes
the reliable split boundary for the chunker. Prevents # title headings
from creating oversized top-level chunks.
# ISO 27001 Overview ## 1.2 Security Controls ### 1.2.1 Physical Access
## ISO 27001 Overview ### 1.2 Security Controls #### 1.2.1 Physical Access
Breaks paragraphs exceeding chunk_max_words (default 450) at the nearest
sentence boundary to chunk_target_words (default 220). Prevents any single
paragraph from dominating a chunk's token budget.
/* ~500 word paragraph */
Organizations shall implement access
control policies in accordance with
the risk assessment framework. The
framework shall define roles ...
... periodic reviews of all access
rights granted to personnel. [500w]
/* split near word 220 */
Organizations shall implement access
control policies in accordance with
the risk assessment framework. [~220w]
The framework shall define roles ...
... periodic reviews of all access
rights granted to personnel. [~280w]
Injects a YAML frontmatter block at the top of the document with canonical oracle
metadata: oracle_id, title, tier,
frameworks, and source provenance fields. Existing frontmatter is
merged, not overwritten.
## ISO 27001 Overview Organizations shall implement access control policies...
--- oracle_id: iso-27001-overview title: "ISO 27001 Overview" tier: tier_1 frameworks: - iso_27001 - iso_27002 source_format: pdf --- ## ISO 27001 Overview Organizations shall implement access control policies...
NLP keyword-frequency extraction that classifies the document's content into four routing buckets: Subject (what it's about), Included (top 24 terms it covers), Relevant (12 related terms), and Excluded (8 terms it explicitly does not address). Used downstream by the RAG retriever for precision routing.
--- oracle_id: iso-27001-overview title: "ISO 27001 Overview" tier: tier_1 frameworks: - iso_27001 --- ## ISO 27001 Overview ...
--- oracle_id: iso-27001-overview title: "ISO 27001 Overview" tier: tier_1 frameworks: - iso_27001 sire_subject: "information security" sire_included: - access control - risk assessment - asset management # ... up to 24 terms sire_relevant: - business continuity - incident response # ... up to 12 terms sire_excluded: - financial auditing - marketing analytics # ... up to 8 terms ---
Auto-generates a dense prose semantic_description from the oracle's title,
frameworks, tier, and S.I.R.E. fields. This paragraph acts as a routing signal
for the RAG retriever — it's embedded alongside the chunks so the retriever
can match intent, not just keywords.
---
oracle_id: iso-27001-overview
title: "ISO 27001 Overview"
tier: tier_1
sire_subject: "information security"
# no semantic_description
---
--- oracle_id: iso-27001-overview title: "ISO 27001 Overview" tier: tier_1 sire_subject: "information security" semantic_description: >- ISO 27001 Overview is a tier_1 oracle covering information security within the iso_27001, iso_27002 frameworks. Primary subjects include access control, risk assessment, and asset management. ---
Safety net for documents that contain no ## headings after normalization.
Inserts ## {title} at the top of the body so the chunker always has
at least one split boundary. Without this, the entire document would become a
single oversized chunk.
--- title: "ISO 27001 Overview" --- Organizations shall implement access control policies in accordance with the risk assessment framework...
--- title: "ISO 27001 Overview" --- ## ISO 27001 Overview Organizations shall implement access control policies in accordance with the risk assessment framework...
Appends a deterministic HTML comment to each ## section.
The watermark encodes provenance: oracle_id, chunk_id,
section title, authority tier, and S.I.R.E. subject. Survives round-trips through
markdown renderers and enables downstream traceability.
## 1.2 Security Controls Organizations shall implement access control policies in accordance with the risk assessment framework.
## 1.2 Security Controls
Organizations shall implement
access control policies in
accordance with the risk
assessment framework.
<!-- watermark:
oracle_id: iso-27001-overview
chunk_id: iso-27001-overview:0
section: 1.2 Security Controls
authority: tier_1
sire: information security
-->
Splits the watermarked markdown on ## boundaries into individual chunks.
Each chunk gets a deterministic chunk_id, SHA-256 text_hash,
estimated token_count, and full metadata envelope. Output format is
newline-delimited JSON (JSONL).
/* full watermarked markdown */ --- oracle_id: iso-27001-overview title: "ISO 27001 Overview" ... --- ## 1.2 Security Controls Organizations shall ... <!-- watermark: ... --> ## 1.3 Risk Assessment Risk assessments shall ... <!-- watermark: ... -->
/* chunks.jsonl */ { "chunk_id": "iso-27001-overview:0", "text": "## 1.2 Security Controls\n...", "text_hash": "a1b2c3...", "token_count": 187, "metadata": { "oracle_id": "iso-27001-overview", "tier": "tier_1", "section": "1.2 Security Controls" } }
Skip Flags
Toggle individual normalization stages per request.
| Field | Default | Skips Stage |
|---|---|---|
skip_ocr_cleanup | false | Stage 3 — Clean Standards OCR |
skip_structure_headings | false | Stage 5 — Structure Standards Markdown |
skip_demote_headings | false | Stage 6 — Demote Headings |
skip_split_paragraphs | false | Stage 7 — Split Long Paragraphs |
Tuning Parameters
Control chunking and heading behaviour.
| Field | Default | Effect |
|---|---|---|
chunk_target_words | 220 | Target word count for paragraph splitting (Stage 7). |
chunk_max_words | 450 | Maximum words before a paragraph is force-split (Stage 7). |
max_heading_depth | 5 | Deepest heading level used in structure promotion (Stage 5). |
Output Contract
Job envelope returned by POST /api/ingest.
{
"envelope_id": "c4fe90d3117f4a2b",
"request_id": "c4fe90d3117f4a2b",
"status": "completed",
"input_filenames": ["iso-27001.pdf"],
"input_count": 1,
"input_bytes": 284610,
"pipeline_options": { /* frozen config snapshot */ },
"pipeline_profile": "aws-canonical-v1",
"build_sha": "abc1234",
"triggered_by": "ui",
"results": [
{
"job_id": "a1b2c3d4e5f67890",
"normalized_markdown": "---\noracle_id: ...\n---\n## ...",
"chunks": [ /* JSONL chunk objects */ ],
"steps": [
{ "name": "extract", "duration_ms": 142 },
{ "name": "cleanup_extracted_markdown", "duration_ms": 18 },
/* ... all 13 stages ... */
],
"normalization_degraded": false
}
],
"total_files": 1,
"total_chunks": 42,
"total_tokens": 18500,
"duration_ms": 3200,
"normalization_degraded": false,
"stage_timing": { "extract": 142, "cleanup_extracted_markdown": 18, /* ... */ },
"s3_envelope_key": "envelopes/c4fe90d3117f4a2b.json",
"created_at": "2026-03-04T15:22:31.000Z",
"completed_at": "2026-03-04T15:22:34.200Z",
"errors": [],
"warnings": []
}
What Is the Job Envelope?
Every POST /api/ingest call returns a job envelope — a single JSON document that records everything about the request for auditability and downstream traceability.
The envelope follows the same provenance pattern used by the Goober chat system
(goober_chat_envelopes) and the oracle ingestion pipeline
(oracle_pipeline_envelopes). It captures:
- Identity — a unique
envelope_idandrequest_idtied to the X-Request-ID header. - Request context — the original filenames, file count, and a frozen snapshot of all pipeline options exactly as they were at invocation time.
- Per-file results — each file gets its own
PipelineResultwith job ID, chunks, stage timings, normalized markdown, and error state. - Aggregated telemetry — total chunks, total tokens, and wall-clock duration aggregated across all files in the request.
- S3 provenance — the S3 bucket, the envelope object key, and per-file chunk JSONL keys so you can trace every artifact back to its source request.
- Timing — ISO 8601 timestamps for when the request was received (
created_at) and when it finished (completed_at). - Error and warning aggregation — errors from all per-file results are surfaced at the envelope level; non-fatal issues (e.g., S3 upload retries) appear in
warnings.
How It Works
Lifecycle of a job envelope through a single ingest request.
| Step | What Happens |
|---|---|
| 1 | Client sends POST /api/ingest with files and/or text. The request-ID middleware assigns a deterministic X-Request-ID header (or honours the one the client sent). |
| 2 | A JobEnvelope is created with status running, the frozen pipeline_options snapshot, the build SHA, and the pipeline profile. No results yet. |
| 3 | Each file (or pasted text) is processed through the 13‐stage pipeline. As each file completes, its PipelineResult is appended to the envelope and its JSONL chunk file is uploaded to S3. |
| 4 | After all files finish, the envelope aggregates totals (chunks, tokens, duration), collects errors, and sets status to completed or failed. |
| 5 | The finalized envelope JSON is uploaded to S3 at envelopes/{envelope_id}.json for long-term audit storage. |
| 6 | The envelope is returned as the HTTP response body. The results array is at the same path as before, so existing consumers continue to work. |
Envelope Field Reference
Every field in the top-level job envelope.
| Field | Type | Description |
|---|---|---|
envelope_id | string | Unique envelope identifier (matches request_id). |
request_id | string | The X-Request-ID for this ingest call. |
status | string | pending | running | completed | failed |
input_filenames | string[] | Original filenames as uploaded (includes pasted.md for text input). |
input_count | int | Number of input items processed. |
input_bytes | int | Total size of all uploaded files and pasted text in bytes. |
pipeline_options | object | Frozen snapshot of all request parameters: oracle_id, title, skip flags, chunk targets, heading depth. |
pipeline_profile | string | Server-side pipeline profile (e.g. aws-canonical-v1). |
build_sha | string | Git SHA of the deployed harness build. |
triggered_by | string | Request origin: ui, api, or custom value from X-Triggered-By header. |
results | PipelineResult[] | Per-file results array — each contains job_id, chunks, steps, normalized markdown, errors. |
total_files | int | Count of files successfully processed. |
total_chunks | int | Sum of all chunks across all files. |
total_tokens | int | Sum of estimated tokens across all chunks. |
duration_ms | int | Total wall-clock time for the entire ingest request in milliseconds. |
normalization_degraded | bool | true if any file's normalisation fell back to a degraded path (e.g. OCR heuristics failed). |
stage_timing | object | Aggregated per-stage durations in ms across all files, e.g. {"extract": 284, "chunk": 52}. |
s3_bucket | string? | S3 bucket used for artifact storage (null if S3 is not configured). |
s3_envelope_key | string? | S3 object key where this envelope was persisted. |
s3_chunk_keys | string[] | S3 object keys for each per-file chunks.jsonl upload. |
created_at | string | ISO 8601 timestamp when the envelope was created (request received). |
completed_at | string? | ISO 8601 timestamp when processing finished (null while running). |
errors | string[] | Aggregated errors from all per-file results. |
warnings | string[] | Non-fatal issues (e.g., S3 envelope upload failure). |