pdf-analyzer
Extract text, tables, metadata, and structured data from PDF files. Use when a user asks to read a PDF, parse a PDF, extract data from a PDF, summarize a PDF document, pull tables from a PDF, or convert PDF content to structured formats like JSON or CSV. Handles single and multi-page documents, scanned PDFs, and PDFs with complex table layouts.
Usage
Getting Started
- Install the skill using the command above
- Open your AI coding agent (Claude Code, Codex, Gemini CLI, or Cursor)
- Reference the skill in your prompt
- The AI will use the skill's capabilities automatically
Example Prompts
- "Summarize the key findings in quarterly-report.pdf"
- "Extract all tables and figures from this research paper"
Documentation
Overview
Extract text, tables, and structured data from PDF files and convert them into usable formats. This skill handles text extraction, table detection, metadata reading, and output formatting for single or multi-page PDFs.
Instructions
When a user asks you to analyze, read, parse, or extract data from a PDF file, follow these steps:
Step 1: Identify the PDF and goal
Determine the file path and what the user wants extracted:
- Full text: All readable text from every page
- Tables: Structured tabular data
- Metadata: Title, author, creation date, page count
- Specific sections: Targeted content from certain pages
- Summary: A condensed version of the document contents
Step 2: Choose the extraction method
Write a Python script using one of these libraries (prefer pdfplumber for tables, PyMuPDF for speed):
For text extraction:
import pdfplumber
def extract_text(pdf_path):
text_by_page = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text()
if text:
text_by_page.append({"page": i + 1, "text": text.strip()})
return text_by_page
For table extraction:
import pdfplumber
import csv
def extract_tables(pdf_path, output_csv=None):
all_tables = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for table in tables:
headers = table[0]
rows = table[1:]
all_tables.append({
"page": i + 1,
"headers": headers,
"rows": rows
})
if output_csv and all_tables:
with open(output_csv, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(all_tables[0]["headers"])
for table in all_tables:
writer.writerows(table["rows"])
return all_tables
For metadata:
import pdfplumber
def extract_metadata(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
return {
"pages": len(pdf.pages),
"metadata": pdf.metadata
}
Step 3: Run the script and format output
Execute the script, then present results in the format the user needs (plain text, JSON, CSV, markdown table, or summary).
Step 4: Handle errors gracefully
If extraction fails, try these fallback approaches:
- Switch from pdfplumber to PyMuPDF (fitz)
- For scanned PDFs, suggest OCR with pytesseract
- For encrypted PDFs, inform the user a password is needed
Examples
Example 1: Extract all tables from an invoice PDF
User request: "Pull the line items table from invoice.pdf and save as CSV"
Actions taken:
- Open
invoice.pdfwith pdfplumber - Detect tables on each page
- Extract headers and rows
- Write to
invoice_items.csv
Output:
Found 1 table on page 1 with 5 columns: Item, Quantity, Unit Price, Tax, Total
Extracted 12 rows. Saved to invoice_items.csv.
Preview:
| Item | Quantity | Unit Price | Tax | Total |
|-----------------|----------|------------|-------|---------|
| Widget A | 100 | $2.50 | $25.00| $275.00 |
| Widget B | 50 | $4.00 | $20.00| $220.00 |
Example 2: Summarize a research paper
User request: "Summarize the key findings from paper.pdf"
Actions taken:
- Extract full text from all pages
- Identify abstract, introduction, results, and conclusion sections
- Produce a structured summary
Output:
Document: paper.pdf (14 pages)
Title: Effects of Sleep Duration on Cognitive Performance
Authors: Smith et al., 2024
Key Findings:
- Participants sleeping < 6 hours showed 23% slower reaction times
- Optimal cognitive performance observed at 7-8 hours of sleep
- No significant benefit observed beyond 9 hours
Methods: Randomized controlled trial, n=240, over 12 weeks
Conclusion: 7-8 hours of sleep optimizes cognitive performance in adults
Guidelines
- Always check if the PDF is readable before attempting extraction. Some PDFs are image-only and require OCR.
- For large PDFs (100+ pages), process in batches and show progress.
- When extracting tables, validate that column counts are consistent across rows. Merged cells often cause misalignment.
- Preserve the original page numbers in output so the user can cross-reference.
- If a PDF has both text and scanned pages, extract text where available and flag scanned pages for OCR.
- Never assume table headers. Always use the first row unless the user specifies otherwise.
- For multi-column layouts (academic papers), extract text in reading order, not left-to-right across columns.
Information
- Version
- 1.0.0
- Author
- terminal-skills
- Category
- Documents
- License
- Apache-2.0