Quickstart - Ingestor Docs

Get started in five steps

Convert any document into high quality, LLM ready data.

Step 1: Install the Ingestor SDK

Our SDK is currently available for Python and Node, with REST support coming soon.

pip install ingestorai

Step 2: Initialize client

Navigate to your dashboard to obtain a copy of your API key.

from ingestor import Ingestor

client = Ingestor(api_key="INGESTOR_API_KEY")

Step 3: Call the document parsing API

We use distributed systems with high concurrency to process your documents. Each document passes through multiple stages in the pipeline. Every stage builds on the last: splitting files, analyzing layout, extraction, splitting files, removing noise, formatting content, classification, enriching content, etc. The example below shows how to parse a loan application scan that may include multiple documents, such as the form, ID, and supporting materials.

# input formats: eml, pdf, docx, doc, xlsx, xls, csv, xml, jpg, jpeg, png, gif, bmp, tiff, pptx, ppt
# input can be a local path or file-like object

response = client.content.parse(
    input="loan_application_john_smith.pdf",  
    processing_options={
        "ocr_mode": "layout_aware", # Extracts tables, paragraphs, forms, signatures, etc.
        "classify_content": ["bank_statement", "loan_application", "id_document", "other"],  # Choose optional classifications to assign to docs
        "generate_document_title_metadescription": True, # Generates a title and meta description for semantic search, RAG, and frontend UI.
        "split_into_documents": True, # Splits a single document into its individual components (e.g., bank statement, ID document, etc.)
        "chunking_strategy": "page", # Or title_section / paragraph
        "contextualize_chunks": True, # Agentic process that adds missing context to each chunk so it can stand alone.
        "generate_table_summary": True, # Generates a summary of any tables found. Helps with search retrieval accuracy.
        "generate_figure_summary": True, # Generates a summary of any images found. Helps with search retrieval accuracy.
        "content_unique_id": True, # Helps prevent duplicate content being stored in your database (it happens). 
        "extract_figures": True, # Extracts images (photos, logos, etc.) embedded in documents
        "remove_blocks": ["FOOTER", "HEADER"] # Because some elements are almost always noise that can distract an LLM
    }
)

job_id = response["job_id"]

Step 4: Poll for job completion

Because jobs run through a multi-step pipeline, they may take several minutes to complete. Additional processing_options can further increase this time.

documents = client.jobs.get_document_response(job_id, include=["layout_bbox", "blocks", "document_images"])
print(documents)

response

{
    "job_id": "b7e8c2a1-4f3d-4e2a-9c1b-2d5f6a7e8c2b",
    "status": "COMPLETED",
    "usage": { "credits": 400, "pages": 200 },
    "has_more": false,
    "result": [
      // 1 object per split document
        {
            "doc_id": "b1234567-89ab-4cde-f012-3456789abcde",
            "title": "JPMorgan Chase Bank Statement 10/31/24",
            "meta_description": "Bank statement for John Smith covering transactions from October 1-31, 2024, with closing balance of $3,247.85",
            "classification": "bank_statement",
            "file_type": "pdf",
            "chunks": [
                {
                    "id": "b7e8c2a1-4f3d-4e2a-9c1b-2d5f6a7e8c2b",
                    "chunk_type": "page",
                    "order": 1,
                    "orig_page_number": 1,
                    "orig_page_image_url": "https://...signed", // present when include contains "document_images"
                    "content": "Statement of account...",
                    "md_content": "# Statement of account...",
                    "contextualized_content": "This chunk is from a JPMorgan Chase bank statement for John Smith... # Statement of account...",
                    "blocks": [
                        { "type": "title", "text": "Statement of account...", "bbox": { "top": 30, "left": 150, "width": 300, "height": 50 }, "confidence": 0.98 },
                        { "type": "figure", "summary": "JPMorgan Chase logo", "image_url": "https://...signed", "bbox": { "top": 35, "left": 30, "width": 80, "height": 40 }, "confidence": 0.89 },
                        { 
                            "type": "table", 
                            "summary": "This table shows the opening and closing balance for October 2024, with the opening balance being $2,500.00 and the closing balance being $3,247.85.", 
                            "bbox": { "top": 90, "left": 400, "width": 320, "height": 120 }, 
                            "cells": [
                                // … truncated for brevity …
                            ] 
                        },
                        { "type": "paragraph", "text": "Your profile is 100% complete...", "bbox": { "top": 220, "left": 30, "width": 600, "height": 40 }, "confidence": 0.96 },
                        // … truncated for brevity …
                    ]
                }
            ],
            "source": {
                "start_index": 0
            }
        },
        // Additional documents truncated for brevity …
    ]
}

Step 5: Extract key-value fields (optional)

Use our SDK to extract specific key-value pairs from the parsed document response.
You have two options:

Use our hosted endpoint (default): We’ll handle the extraction and bill you for usage.
Bring your own OpenAI API key: Pass your own openai_api_key and we won’t charge you for extraction, OpenAI will bill you directly.

Best practices are applied to ensure reliable structured outputs.

import os
from pydantic import BaseModel, Field

class BankStatement(BaseModel):
    account_holder: str = Field(..., description="First and last name")
    closing_balance: float
    statement_period_start: str = Field(..., description="ISO date")
    statement_period_end: str = Field(..., description="ISO date")

for doc in documents["result"]:
    if doc.get("classification") == "bank_statement":
        extracted = client.content.extract(
            # To use your own OpenAI key and avoid Ingestor charges, uncomment the next line:
            # openai_api_key=os.getenv("OPENAI_API_KEY"),
            parsed_document=doc,
            extraction_schema=BankStatement,
        )
        print(extracted)

response

{
    "status": "succeeded",
    "extracted": {
        "account_holder": "John Smith",
        "closing_balance": 3247.85,
        "statement_period_start": "2024-10-01",
        "statement_period_end": "2024-10-31",
    },
    "usage": { "credits": 2, "pages": 4 },
}

Next steps

Now that you’ve parsed your first document, explore these key concepts:

Parse content

Learn about different document parsing functions.

Extract structured outputs

Learn best practices to ensure reliable structured outputs.

Agentic chunking

Learn how to get contextually rich chunks so your agent never misses critical context.

API reference

Explore endpoints, schemas, and examples to integrate programmatically.

Need help? Join our Discord.

​Get started in five steps

​Step 1: Install the Ingestor SDK

​Step 2: Initialize client

​Step 3: Call the document parsing API

​Step 4: Poll for job completion

​Step 5: Extract key-value fields (optional)

​Next steps

Parse content

Extract structured outputs

Agentic chunking

API reference

Get started in five steps

Step 1: Install the Ingestor SDK

Step 2: Initialize client

Step 3: Call the document parsing API

Step 4: Poll for job completion

Step 5: Extract key-value fields (optional)

Next steps