Extract and Validate Structured Data from PDF Financial Reports

The Problem

An accounting firm processes 150 PDF financial statements per month from clients who send quarterly reports in different formats. Each PDF contains balance sheets, income statements, and cash flow tables that need to be extracted into a standardized Excel format for analysis. Currently, two junior analysts spend 3 days per month manually retyping numbers from PDFs. They make errors about 4% of the time -- transposed digits, missed rows, wrong decimal places -- and those errors compound through downstream tax calculations and audit reports, causing expensive corrections weeks later.

The Solution

Using the table-extractor, data-extractor, and data-validator skills, the workflow extracts tables from PDFs with camelot for structured tabular data, pulls non-tabular fields like company name and reporting period with the data extractor, and runs validation checks on every extracted value to catch errors before they enter the accounting system.

Step-by-Step Walkthrough

1. Extract tables from the PDF financial statements

Use camelot to detect and extract all tables from a batch of PDF reports, handling merged cells and multi-line headers that are common in financial statements.

Extract all tables from the 12 PDF files in ./client_reports/q3_2025/. Use lattice mode for bordered tables and stream mode as fallback for borderless tables. Output each table as a separate CSV with the filename format {company}_{table_number}.csv.

The table extractor identifies table boundaries, handles merged header cells spanning multiple columns, and preserves the row/column structure. Financial statements often have subtotal rows with bold formatting and indented line items -- the extractor preserves hierarchy through indentation detection.

2. Extract metadata and non-tabular fields

Pull structured fields that appear outside of tables: company name, fiscal year, reporting period, currency, auditor name, and filing date.

Extract these fields from each PDF: company_name, fiscal_year_end, reporting_currency, auditor_name, report_date, and total_pages. Output as a JSON file per document. Handle variations -- some reports say "Year Ended December 31, 2025" while others say "FY2025" or "12 months to 31/12/2025".

The data extractor handles the inconsistent formatting across different accounting firms. Date formats, currency symbols, and naming conventions vary widely, but the extractor normalizes everything to a consistent schema.

3. Validate extracted data for quality and consistency

Run comprehensive validation checks on every extracted value to catch extraction errors, missing data, and inconsistencies between tables.

Validate the extracted financial data. Check that: balance sheets balance (assets = liabilities + equity, within 0.01% tolerance), all currency values are numeric and positive where expected, no rows have null values in amount columns, year-over-year changes are within reasonable bounds (flag changes over 200%), and cross-reference totals between the income statement and balance sheet (net income should appear in both).

The validator produces a structured report for each company file, flagging every issue by severity:

VALIDATION REPORT — Acme Corp Q3 2025
======================================
Status: 3 errors, 5 warnings

ERRORS:
  [E01] Balance sheet imbalance (page 4, row 12)
        Assets: $14,892,341   Liabilities+Equity: $14,829,341
        Difference: $63,000 (0.42%) — exceeds 0.01% tolerance
  [E02] Revenue mismatch (income stmt vs balance sheet)
        Income statement total revenue:  $8,234,576
        Balance sheet retained earnings delta: $8,234,567
        Difference: $9 — likely transposed digit
  [E03] Null value in amount column (page 6, row 8)
        Field: "Other current liabilities" — blank cell

WARNINGS:
  [W01] YoY change exceeds 200%: R&D expenses $120K → $480K (+300%)
  [W02] Negative value in "Accounts receivable": -$3,200
  [W03] Currency symbol mismatch: page 3 uses "USD", page 5 uses "$"
  ...

Passed: 847 of 855 checks (99.1%)

The validator catches errors that manual review misses. A transposed digit in a $1,234,567 revenue figure becomes $1,234,576 -- a $9 difference that looks correct at a glance but fails the cross-reference check against the income statement total.

4. Generate the standardized output with an error report

Combine validated tables and metadata into the final standardized format, with a clear report of any issues found.

Merge all validated extractions into a single Excel workbook with one sheet per company. Add a "Validation Results" sheet listing every warning and error found, grouped by company and severity. Flag any values that were auto-corrected (like removing extra spaces or fixing decimal alignment) so an analyst can verify them.

The output workbook includes conditional formatting that highlights cells with validation warnings in yellow and errors in red, making it easy for an analyst to review the 5% of values that need attention rather than checking all 100%.

For recurring clients who submit reports in the same format every quarter, save the extraction configuration as a template. This eliminates the table-detection calibration step on subsequent runs and improves accuracy because the extractor knows exactly where to find each table on each page.

Real-World Example

A mid-size accounting firm processing 150 client PDFs per quarter deployed this pipeline and reduced extraction time from 3 analyst-days to 4 hours. The table extractor handled 89% of tables perfectly on the first pass; the remaining 11% needed minor manual corrections for unusual formatting like rotated headers or nested sub-tables. The validator caught 23 extraction errors in the first batch that would have gone undetected in the manual process -- including a $2.3M revenue figure where camelot misread a comma as a period, turning it into $2.3K. That single catch paid for the entire setup effort. The firm now processes quarterly batches in a single morning instead of dedicating half a week to data entry.

LIVETry this use case on your own files

Skills stack · 3 skills

table-extractor

data-extractor

data-validator