Terminal.skills
Skills/data-extractor
>

data-extractor

Extract structured data from any document format using unified document processing. Use when a user asks to extract data from a document, parse a PDF, pull structured data from files, convert documents to JSON or CSV, extract fields from invoices or forms, or scrape data from documents.

#data-extraction#document-parsing#pdf#structured-data#etl
terminal-skillsv1.0.0
Works with:claude-codeopenai-codexgemini-clicursor
Source

Usage

$
✓ Installed data-extractor v1.0.0

Getting Started

  1. Install the skill using the command above
  2. Open your AI coding agent (Claude Code, Codex, Gemini CLI, or Cursor)
  3. Reference the skill in your prompt
  4. The AI will use the skill's capabilities automatically

Example Prompts

  • "Analyze the sales data in revenue.csv and identify trends"
  • "Create a visualization comparing Q1 vs Q2 performance metrics"

Documentation

Overview

Extract structured data from documents in any format: PDF, DOCX, HTML, TXT, images, and more. Converts unstructured or semi-structured content into clean JSON, CSV, or other structured formats. Handles invoices, forms, reports, and free-text documents.

Instructions

When a user asks you to extract data from a document, follow this process:

Step 1: Identify the document format and install dependencies

bash
# Determine file type
file document.pdf

# Install dependencies based on format
pip install pdfplumber python-docx beautifulsoup4 lxml openpyxl

Library selection by format:

  • PDF: pdfplumber (text + tables), PyMuPDF (fitz) for complex layouts
  • DOCX: python-docx
  • HTML: beautifulsoup4 with lxml
  • Excel: openpyxl or pandas
  • Images: pytesseract (OCR) with Pillow
  • JSON/XML: Python standard library

Step 2: Extract raw content

PDF extraction:

python
import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        text = page.extract_text()
        print(f"--- Page {i+1} ---")
        print(text)

        # Extract tables if present
        tables = page.extract_tables()
        for table in tables:
            for row in table:
                print(row)

DOCX extraction:

python
from docx import Document

doc = Document("document.docx")
for para in doc.paragraphs:
    print(f"[{para.style.name}] {para.text}")

# Extract tables
for table in doc.tables:
    for row in table.rows:
        print([cell.text for cell in row.cells])

HTML extraction:

python
from bs4 import BeautifulSoup

with open("document.html") as f:
    soup = BeautifulSoup(f, "lxml")

# Extract specific elements
for table in soup.find_all("table"):
    rows = table.find_all("tr")
    for row in rows:
        cells = [td.get_text(strip=True) for td in row.find_all(["td", "th"])]
        print(cells)

Step 3: Parse and structure the data

Once you have raw text, extract the target fields:

Pattern-based extraction:

python
import re
import json

text = "..."  # extracted text

# Define patterns for common fields
patterns = {
    "invoice_number": r"Invoice\s*#?\s*:?\s*(\w+[-/]?\w+)",
    "date": r"Date\s*:?\s*(\d{1,2}[/-]\d{1,2}[/-]\d{2,4})",
    "total": r"Total\s*:?\s*\$?([\d,]+\.?\d*)",
    "email": r"[\w.-]+@[\w.-]+\.\w+",
}

extracted = {}
for field, pattern in patterns.items():
    match = re.search(pattern, text, re.IGNORECASE)
    if match:
        extracted[field] = match.group(1) if match.lastindex else match.group(0)

print(json.dumps(extracted, indent=2))

Line-item extraction from tables:

python
import pandas as pd

# From a list of table rows
headers = table_data[0]
rows = table_data[1:]
df = pd.DataFrame(rows, columns=headers)

# Clean up
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df = df.dropna(how="all")

Step 4: Validate and clean the output

python
# Type conversion
extracted["total"] = float(extracted["total"].replace(",", ""))

# Date normalization
from datetime import datetime
extracted["date"] = datetime.strptime(extracted["date"], "%m/%d/%Y").isoformat()

# Validate required fields
required = ["invoice_number", "date", "total"]
missing = [f for f in required if f not in extracted]
if missing:
    print(f"Warning: missing fields: {missing}")

Step 5: Output in the desired format

python
# JSON output
with open("extracted_data.json", "w") as f:
    json.dump(extracted, f, indent=2)

# CSV output
df.to_csv("extracted_items.csv", index=False)

# Pretty print summary
print(f"Extracted {len(extracted)} fields from document")
print(f"Line items: {len(df)} rows")

Examples

Example 1: Extract invoice data from a PDF

User request: "Extract the invoice details from this PDF"

Actions:

  1. Open the PDF with pdfplumber and extract text
  2. Use regex patterns to find invoice number, date, vendor, subtotal, tax, total
  3. Extract the line items table into a DataFrame
  4. Output a JSON file with header fields and a CSV with line items

Output:

json
{
  "invoice_number": "INV-2025-0042",
  "date": "2025-03-15",
  "vendor": "Acme Corp",
  "subtotal": 1250.00,
  "tax": 100.00,
  "total": 1350.00,
  "line_items": [
    {"description": "Widget A", "qty": 10, "unit_price": 75.00, "amount": 750.00},
    {"description": "Widget B", "qty": 5, "unit_price": 100.00, "amount": 500.00}
  ]
}

Example 2: Extract contacts from a DOCX directory

User request: "Pull all names and email addresses from this company directory document"

Actions:

  1. Parse the DOCX file, iterate through paragraphs and tables
  2. Use regex to find email addresses and associated names
  3. Deduplicate and output as CSV

Output: A CSV file with columns: name, email, department, phone.

Example 3: Convert an HTML report to structured data

User request: "Extract the quarterly results table from this HTML page"

Actions:

  1. Parse the HTML with BeautifulSoup
  2. Find the target table by heading or class
  3. Extract headers and rows into a DataFrame
  4. Clean column names and convert numeric values
  5. Export as CSV and provide summary statistics

Output: A clean CSV with quarterly metrics and a summary of key figures.

Guidelines

  • Always inspect the raw extracted text before writing parsers. Understanding the layout saves time.
  • Use pdfplumber for most PDF extraction. Fall back to PyMuPDF for complex multi-column layouts.
  • For scanned PDFs (image-based), use OCR with pytesseract before parsing.
  • Validate extracted data types: convert strings to numbers, normalize dates.
  • Report extraction confidence: note any fields that could not be found or seem incorrect.
  • Handle multi-page documents by accumulating results across pages.
  • For batch extraction (many documents of the same type), build a reusable extraction function and apply it across all files.
  • Always preserve the original document alongside extracted data for verification.
  • When patterns fail, fall back to positional extraction based on text layout.

Information

Version
1.0.0
Author
terminal-skills
Category
Data & AI
License
Apache-2.0