Extract text from complex documents — tables, forms, handwriting, and full page layouts — using Chandra, a high-accuracy OCR engine built for real-world document complexity.

Overview

Chandra OCR handles the document types that trip up standard OCR: multi-column tables with merged cells, mixed print and handwriting, and complex page layouts. It outputs structured data (DataFrames, JSON) and supports GPU acceleration for batch processing.

Instructions

Installation

bash

pip install chandra-ocr

For GPU acceleration (recommended for batch processing):

bash

pip install chandra-ocr[gpu]

Basic Text Extraction

python

from chandra import OCR

ocr = OCR()

result = ocr.read("document.png")
print(result.text)

# From a PDF
result = ocr.read("report.pdf")
for page in result.pages:
    print(f"--- Page {page.number} ---")
    print(page.text)

Layout-Preserved Extraction

python

result = ocr.read("document.png", preserve_layout=True)

for block in result.blocks:
    print(f"Type: {block.type}")  # paragraph, table, header, handwriting
    print(f"Text: {block.text}")
    print(f"Confidence: {block.confidence:.2f}")

Table Extraction

python

result = ocr.read("invoice.png", extract_tables=True)

for table in result.tables:
    print(f"Table: {table.rows} rows x {table.cols} columns")
    df = table.to_dataframe()
    print(df.head())
    table.to_csv("extracted_table.csv")

Handwriting Recognition

python

result = ocr.read("handwritten_form.jpg", mode="handwriting")

for block in result.blocks:
    if block.type == "handwriting":
        print(f"Handwritten: {block.text} (conf: {block.confidence:.2f})")

Mixed Documents (Print + Handwriting)

python

result = ocr.read("filled_form.png", mode="mixed")

for block in result.blocks:
    print(f"[{block.type}] {block.text} (conf: {block.confidence:.2f})")

Batch Processing

python

import glob
from chandra import OCR
import json

ocr = OCR(device="cuda")
files = glob.glob("documents/*.pdf")

for file_path in files:
    result = ocr.read(file_path, extract_tables=True)
    output = {
        "file": file_path,
        "pages": len(result.pages),
        "text": result.text,
        "tables": [t.to_dict() for t in result.tables],
    }
    with open(file_path.replace(".pdf", ".json"), "w") as f:
        json.dump(output, f, indent=2)

Examples

Example 1: Extract Invoice Tables to CSV

python

from chandra import OCR

ocr = OCR()
result = ocr.read("invoice-2025-0342.pdf", extract_tables=True)

for i, table in enumerate(result.tables):
    df = table.to_dataframe()
    df.to_csv(f"invoice_table_{i}.csv", index=False)
    print(f"Table {i}: {table.rows} rows — columns: {list(df.columns)}")
# Output:
# Table 0: 12 rows — columns: ['Item', 'Qty', 'Unit Price', 'Total']
# Table 1: 3 rows — columns: ['Tax Type', 'Rate', 'Amount']

Example 2: Process Handwritten Medical Forms

python

from chandra import OCR
import requests

ocr = OCR()

result = ocr.read("patient_intake_form.jpg", mode="mixed", extract_tables=True)

extracted = {}
for block in result.blocks:
    extracted[block.label] = {
        "value": block.text,
        "confidence": block.confidence,
        "needs_review": block.confidence < 0.85,
    }

review_fields = {k: v for k, v in extracted.items() if v["needs_review"]}
print(f"Fields needing review: {list(review_fields.keys())}")
# Output:
# Fields needing review: ['allergies', 'signature']

Guidelines

Use device="cuda" for batch processing — 5-10x faster than CPU
Set dpi=300 or higher for scanned documents to improve accuracy
For forms with checkboxes, use mode="mixed" to detect both print and marks
Confidence threshold of 0.85 is a good default for human review routing
Pre-process images (deskew, denoise) for better results on poor-quality scans

Option	Default	Description
`mode`	`"auto"`	Detection mode: `auto`, `print`, `handwriting`, `mixed`
`preserve_layout`	`False`	Maintain spatial positioning of text
`extract_tables`	`False`	Detect and extract tables as structured data
`device`	`"cpu"`	Processing device: `cpu` or `cuda`
`language`	`"en"`	Primary language hint
`dpi`	`300`	DPI for PDF rasterization

chandra-ocr

Validation

Getting Started

Example Prompts

Documentation

Overview

Instructions

Installation

Basic Text Extraction

Layout-Preserved Extraction

Table Extraction

Handwriting Recognition

Mixed Documents (Print + Handwriting)

Batch Processing

Examples

Example 1: Extract Invoice Tables to CSV

Example 2: Process Handwritten Medical Forms

Guidelines

Information

Use Cases