Ontic Embedding Harness — API Reference

Authentication

All write endpoints require an API key passed via the X-API-Key header. The key is stored in AWS Secrets Manager and configured as the ONTIC_HARNESS_API_KEY environment variable. In development mode (no key configured), authentication is bypassed.

# Example header
X-API-Key: your-api-key-here

POST /api/ingest

Upload one or more files and/or paste raw text. Each input is processed through the full script pipeline (extract, normalisation, frontmatter bootstrap, S.I.R.E., watermarking, local JSONL export). Returns RAG-ready chunks with timing traces and normalised markdown.

Request body is multipart/form-data.

Parameters

Field	Type	Default	Description
`files`	File[]	—	One or more files (PDF, DOCX, YAML, JSON, MD, TXT). Max 50 MB each.
`text`	string	—	Raw text input (alternative to file upload).
`oracle_id`	string	`""`	Identifier for the source document/oracle.
`title`	string	`""`	Document title, prepended to each chunk preamble.
`chunk_target_words`	integer	`220`	Target word count per chunk (50–2000).
`chunk_max_words`	integer	`450`	Maximum word count per chunk (100–5000).
`max_heading_depth`	integer	`5`	Max dotted-section depth to promote to headings (1–8).
`skip_ocr_cleanup`	boolean	`false`	Skip the OCR artifact cleanup stage.
`skip_structure_headings`	boolean	`false`	Skip auto-heading promotion for section markers.
`skip_demote_headings`	boolean	`false`	Skip heading level demotion.
`skip_split_paragraphs`	boolean	`false`	Skip splitting long paragraphs at sentence boundaries.

Response (200)

The API returns a job envelope wrapping per-file results with request context, telemetry, and S3 provenance.

{
  "envelope_id": "c4fe90d3117f4a2b",
  "request_id": "c4fe90d3117f4a2b",
  "status": "completed",
  "input_filenames": ["SOC2-report.pdf"],
  "input_count": 1,
  "input_bytes": 284610,
  "pipeline_options": {
    "oracle_id": "SOC2-2024",
    "title": "SOC 2 Type II Report",
    "chunk_target_words": 220,
    "chunk_max_words": 450,
    "max_heading_depth": 5
  },
  "pipeline_profile": "aws-canonical-v1",
  "build_sha": "abc1234",
  "triggered_by": "ui",
  "results": [
    {
      "job_id": "a1b2c3d4e5f67890",
      "request_id": "c4fe90d3117f...",
      "status": "completed",
      "input_format": "pdf",
      "oracle_id": "SOC2-2024",
      "title": "SOC 2 Type II Report",
      "chunks": [
        {
          "chunk_id": "SOC2-2024_chunk_0000",
          "oracle_id": "SOC2-2024",
          "section_title": "1 Scope",
          "text": "--- frontmatter ... --\n## 1 Scope\n...",
          "text_hash": "e3b0c44298fc1c149...",
          "word_count": 215,
          "token_estimate": 287,
          "chunk_index": 0,
          "total_chunks": 12,
          "metadata": { "title": "SOC 2 Type II Report", "section_title": "1 Scope" }
        }
      ],
      "total_chunks": 12,
      "total_tokens": 3440,
      "normalized_markdown": "---\noracle_id: SOC2-2024\n...",
      "normalization_degraded": false,
      "steps": [
        { "name": "extract", "duration_ms": 42 },
        { "name": "cleanup_extracted_markdown", "duration_ms": 8 },
        { "name": "clean_standards_ocr", "duration_ms": 31 },
        { "name": "cleanup_standards_for_embedding", "duration_ms": 5 },
        { "name": "structure_standards_markdown", "duration_ms": 3 },
        { "name": "demote_headings_for_chunking", "duration_ms": 2 },
        { "name": "split_long_paragraphs_for_chunking", "duration_ms": 4 },
        { "name": "bootstrap_frontmatter", "duration_ms": 15 },
        { "name": "infer_sire", "duration_ms": 25 },
        { "name": "inject_chunk_watermarks", "duration_ms": 7 },
        { "name": "export_local_chunks_jsonl", "duration_ms": 9 },
        { "name": "total", "duration_ms": 151 }
      ],
      "created_at": "2026-03-04T15:22:31.000Z",
      "errors": [],
      "s3_key": "results/a1b2c3d4e5f67890/chunks.jsonl",
      "s3_error": null
    }
  ],
  "total_files": 1,
  "total_chunks": 42,
  "total_tokens": 18500,
  "duration_ms": 3200,
  "normalization_degraded": false,
  "stage_timing": {
    "extract": 142,
    "cleanup_extracted_markdown": 18,
    "clean_standards_ocr": 12,
    "bootstrap_frontmatter": 15,
    "export_local_chunks_jsonl": 9
  },
  "s3_bucket": "ontic-harness-prod",
  "s3_envelope_key": "envelopes/c4fe90d3117f4a2b.json",
  "s3_chunk_keys": ["results/a1b2c3d4e5f67890/chunks.jsonl"],
  "created_at": "2026-03-04T15:22:31.000Z",
  "completed_at": "2026-03-04T15:22:34.200Z",
  "errors": [],
  "warnings": []
}

GET /api/health

Readiness probe. Returns service status, version, and uptime. No authentication required.

Response (200)

{
  "status": "ok",
  "version": "0.2.0",
  "uptime_seconds": 1234.5,
  "active_pipelines": 0,
  "max_pipelines": 3,
  "s3_status": "ok"
}

Data Models

Chunk

Field	Type	Description
`chunk_id`	string	Deterministic ID: `{oracle_id}_chunk_{index}`
`oracle_id`	string	Source document identifier.
`section_title`	string	Heading text for this chunk's section.
`text`	string	Full chunk text with preamble, ready for embedding.
`text_hash`	string	SHA-256 hex digest of the chunk text.
`word_count`	integer	Word count of the full chunk text.
`token_estimate`	integer	Estimated token count (words ÷ 0.75).
`chunk_index`	integer	Zero-based index within the document.
`total_chunks`	integer	Total chunks produced from this document.
`metadata`	object	Arbitrary key-value metadata (title, section, etc.).

StepTrace

Field	Type	Description
`name`	string	Script stage name (e.g. `cleanup_extracted_markdown`, `infer_sire`, `inject_chunk_watermarks`).
`duration_ms`	integer	Execution time in milliseconds.
`detail`	string	Optional detail string (e.g. format, char count).

InputFormat (enum)

One of: pdf, docx, markdown, yaml, json, text

JobStatus (enum)

One of: pending, running, completed, failed

JobEnvelope

The top-level response wrapper for every POST /api/ingest call. Follows the same provenance pattern as goober_chat_envelopes and oracle_pipeline_envelopes — a single immutable audit record capturing request context, pipeline configuration, per-file results, aggregated telemetry, and S3 artifact locations.

Field	Type	Description
`envelope_id`	string	Unique envelope identifier, equal to the `X-Request-ID`.
`request_id`	string	Request ID assigned by the middleware (or forwarded from the client header).
`status`	JobStatus	Final disposition: `completed` if all files succeeded, `failed` if any had errors.
`input_filenames`	string[]	Original upload filenames. Pasted text appears as `pasted.md`.
`input_count`	int	Total number of input items processed.
`input_bytes`	int	Total size of all uploaded files and pasted text in bytes.
`pipeline_options`	PipelineOptions	Frozen snapshot of every request parameter at invocation time.
`pipeline_profile`	string	Server-side pipeline profile name (e.g. `aws-canonical-v1`).
`build_sha`	string	Git commit SHA of the deployed harness build.
`triggered_by`	string	Request origin: `ui`, `api`, or custom value from `X-Triggered-By` header.
`results`	PipelineResult[]	Per-file results, each containing job_id, chunks, step timings, and normalised markdown.
`total_files`	int	Number of files in `results`.
`total_chunks`	int	Aggregate chunk count across all files.
`total_tokens`	int	Aggregate estimated token count across all chunks.
`duration_ms`	int	End-to-end wall-clock duration for this request in milliseconds.
`normalization_degraded`	bool	`true` if any file's normalisation fell back to a degraded path.
`stage_timing`	object	Aggregated per-stage durations in ms across all files, e.g. `{"extract": 284}`.
`s3_bucket`	string?	S3 bucket name, or `null` if S3 is not configured.
`s3_envelope_key`	string?	S3 key where the envelope JSON was persisted for audit.
`s3_chunk_keys`	string[]	Per-file S3 keys for the `chunks.jsonl` uploads.
`created_at`	string	ISO 8601 timestamp when the request was received.
`completed_at`	string?	ISO 8601 timestamp when processing finished.
`errors`	string[]	Aggregated error messages from all per-file results.
`warnings`	string[]	Non-fatal warnings (e.g. S3 upload retries).

Error Handling

All errors follow the standard FastAPI error format. The response includes an HTTP status code and a JSON body with a detail field.

Code	Meaning
`400`	Bad request — no input provided, unsupported file type, or invalid options.
`401`	Missing or invalid `X-API-Key` header.
`413`	File too large (exceeds max upload size).
`422`	Validation error — invalid field values.
`500`	Internal server error.

// Error response example
{
  "detail": "Unsupported file type: .exe. Allowed: ['.doc', '.docx', '.json', ...]"
}

Examples

cURL — Upload a PDF

curl -X POST https://oracle.onticlabs.ai/api/ingest \
  -H "X-API-Key: $ONTIC_API_KEY" \
  -F "files=@document.pdf" \
  -F "oracle_id=SOC2-2024" \
  -F "title=SOC 2 Type II Report"

cURL — Paste text

curl -X POST https://oracle.onticlabs.ai/api/ingest \
  -H "X-API-Key: $ONTIC_API_KEY" \
  -F "text=## Section 1\n\nThis is a sample document..." \
  -F "oracle_id=TEST-001"

Python

# pip install httpx
import httpx

resp = httpx.post(
    "https://oracle.onticlabs.ai/api/ingest",
    headers={"X-API-Key": API_KEY},
    files={"files": ("report.pdf", open("report.pdf", "rb"))},
    data={"oracle_id": "SOC2-2024", "title": "SOC 2 Type II"},
)
result = resp.json()
print(f"Envelope {result['envelope_id']}: {result['status']}")
print(f"{result['total_chunks']} chunks, {result['total_tokens']} tokens in {result['duration_ms']}ms")
first = result["results"][0]
print(f"First file: {first['total_chunks']} chunks, {first['total_tokens']} tokens")

TypeScript / Node.js

const form = new FormData();
form.append("files", fs.createReadStream("report.pdf"));
form.append("oracle_id", "SOC2-2024");

const res = await fetch("https://oracle.onticlabs.ai/api/ingest", {
  method: "POST",
  headers: { "X-API-Key": process.env.ONTIC_API_KEY! },
  body: form,
});

const data = await res.json();
console.log(`Envelope ${data.envelope_id}: ${data.status}`);
console.log(`${data.total_chunks} chunks, ${data.total_tokens} tokens in ${data.duration_ms}ms`);
const first = data.results?.[0];
console.log(`First file: ${first?.total_chunks ?? 0} chunks`);

Pipeline Stages

The script pipeline runs these stages in order (some are optional based on request flags):

#	Stage	Description
1	`extract`	Binary/structured format → UTF-8 plain text.
2	`cleanup_extracted_markdown`	Normalize raw extracted text and remove extraction noise.
3	`clean_standards_ocr`	Repair OCR artifacts (optional: `skip_ocr_cleanup`).
4	`cleanup_standards_for_embedding`	Standards-specific cleanup prior to structuring/chunking.
5	`structure_standards_markdown`	Promote structural markers into markdown headings (optional: `skip_structure_headings`).
6	`demote_headings_for_chunking`	Normalize heading levels for stable chunk split points (optional: `skip_demote_headings`).
7	`split_long_paragraphs_for_chunking`	Split oversized paragraphs at sentence boundaries (optional: `skip_split_paragraphs`).
8	`bootstrap_frontmatter`	Ensure frontmatter fields are present/normalized.
9	`infer_sire`	Infer and apply S.I.R.E. metadata from content.
10	`inject_chunk_watermarks`	Inject provenance watermark comments per chunk section.
11	`export_local_chunks_jsonl`	Emit deterministic JSONL chunk records.