Extract Intelligence from PDF Archives — AI Workflow

Process large collections of scanned and digital PDFs to extract structured data, answer questions across documents, and build searchable knowledge bases.

The Problem

A real estate investment firm has 2,400 PDF documents accumulated over 8 years: property appraisals, lease agreements, inspection reports, and financial statements. Half are scanned paper documents with no searchable text.

When an analyst needs to compare cap rates across 15 properties or find every lease expiring in the next 6 months, they manually open documents one by one. A single due diligence review takes 3 days of reading PDFs to find the relevant data points buried across 40-60 documents. The firm missed a lease renewal deadline last year because nobody found the expiration clause buried on page 14 of a scanned agreement.

The Solution

Use the pdf-ocr skill to convert scanned documents into searchable text, the pdf-analyzer skill to extract structured data fields from each document type, and the chat-with-pdf skill to query across the processed archive for specific answers.

Step-by-Step Walkthrough

1. OCR the scanned document backlog

Process all scanned PDFs to extract machine-readable text:

Run OCR on all PDF files in the /property-docs/scanned/ directory. These are scanned property appraisals and inspection reports, mostly single-column layouts with occasional tables. Preserve the document structure including headers, tables, and numbered lists. Output each processed file as a searchable PDF alongside a plain text extraction. Flag any documents where OCR confidence is below 85% for manual review.

Scanned documents from the early years tend to have lower quality scans. Flagging low-confidence results prevents bad OCR from polluting the extracted data with garbled numbers, which is especially dangerous for financial figures where a misread "3" versus "8" changes an appraisal by millions.

2. Extract structured data from property appraisals

Parse appraisal documents to pull key financial and property metrics into structured records:

Analyze all PDF files matching "appraisal" in the /property-docs/ directory. From each appraisal, extract: property address, appraisal date, appraised value, cap rate, net operating income, gross rental income, vacancy rate, total square footage, year built, and appraiser name. Output as a CSV with one row per property. For documents where a field cannot be found, mark it as "NOT_FOUND" rather than guessing.

The "NOT_FOUND" approach is critical. A system that guesses missing values creates false confidence. Analysts need to know which data points are reliable extractions and which require manual verification.

3. Parse lease agreements for expiration tracking

Extract lease terms from all tenant agreements to build an expiration calendar:

Process all lease agreement PDFs in /property-docs/leases/. Extract from each: tenant name, property address, unit number, lease start date, lease end date, monthly rent, annual escalation percentage, security deposit amount, renewal option terms, and any early termination clauses. Output as structured JSON. For leases expiring within the next 12 months, flag them with priority "urgent" in the output.

Lease data extraction feeds directly into revenue forecasting. Knowing that 8 leases representing $47,000 in monthly rent expire in the next quarter lets the asset management team plan renewals or market vacant units before revenue gaps appear.

4. Query across the processed archive

Ask complex questions that span multiple documents to support investment decisions:

Using the processed document archive, answer these questions: What is the average cap rate across all properties appraised in 2025? Which 5 properties have the highest vacancy rates? Are there any inspection reports that mention structural concerns or deferred maintenance exceeding $50,000? List all leases with annual escalation below 3% that expire in the next 18 months, as these are candidates for renegotiation.

Cross-document queries reveal insights that no single document contains. Combining appraisal data with lease terms and inspection findings gives a complete picture of property performance that would take days to assemble manually.

5. Generate a due diligence summary for a specific property

Compile all available information about one property from every document that mentions it:

Compile a due diligence summary for the property at 2847 Industrial Parkway, Columbus OH. Search all documents in the archive for any mention of this address. Pull the most recent appraisal value, current lease terms for all tenants, last inspection findings, historical financial performance from any annual reports, and any environmental assessments. Organize into sections: Property Overview, Financial Summary, Tenant Roster, Physical Condition, and Risk Factors.

A due diligence package pulls from 40-60 documents across multiple types. The summary should cite the source document and page number for every data point so the analyst can verify critical figures.

Real-World Example

The firm is evaluating whether to sell a 12-property industrial portfolio. Previously, the analyst would spend 3 days opening 180 PDFs to compile the data. With the automated pipeline, the OCR pass processes 87 scanned documents in under an hour, the structured extraction pulls appraisal data from all 12 properties and lease terms from 34 active tenants, and the cross-document query identifies that 3 properties have cap rates below the target threshold and 8 leases expire within 14 months.

The due diligence package that took 3 days now takes an afternoon, and the analyst spends their time analyzing the data rather than hunting for it. The portfolio analysis reveals that one property has an inspection report flagging $120,000 in deferred roof maintenance that was not reflected in the operating budget, a finding that adjusts the asking price downward by $400,000.