Terminal.skills
Use Cases/Transform Messy Spreadsheets Into Business Intelligence and Save 20 Hours Weekly

Transform Messy Spreadsheets Into Business Intelligence and Save 20 Hours Weekly

Clean, analyze, and automate complex Excel data processing to eliminate manual work and generate actionable business insights.

Data & AI#excel#csv#spreadsheet#data#pandas
Works with:claude-codeopenai-codexgemini-clicursor
$

The Problem

Every Monday, Ana at a 67-person manufacturing company faces the same nightmare: 23 Excel files from different departments, each with its own formatting conventions, merged cells, and data quality issues. Sales uses mm/dd/yyyy dates. Operations uses dd-mm-yyyy. Finance embeds subtotals randomly throughout data tables. She needs to combine all of it into a unified executive dashboard.

The manual process takes 20 hours every week. Product codes like "SKU-001" in one sheet become "SKU_001" in another and "SKU 001" in a third. Customer names have spelling variations — "ACME Corp", "Acme Corporation", "ACME Corp." are the same company appearing as three different entities. Phone numbers come in every imaginable format: "(555) 123-4567", "555.123.4567", "5551234567". Duplicate detection is entirely manual and consistently error-prone.

The business cost goes beyond Ana's time. At $45/hour, the manual processing costs $46,800 annually. But the bigger problem is the 3-day delay between data collection and executive reporting. Leadership makes strategic decisions on stale information. Last quarter, they missed a 15% uptick in product returns because the data was 4 days old by the time anyone analyzed it. By then, another 200 units had shipped with the same defect.

The Solution

Using the excel-processor, data-analysis, and report-generator skills, the approach is to ingest all 23 files, intelligently normalize formats, detect and merge duplicates using fuzzy matching, run data quality checks, generate business insights through statistical analysis, and build automated dashboards that refresh without manual intervention.

Step-by-Step Walkthrough

Step 1: Multi-File Ingestion and Standardization

Process all 23 department Excel files. Detect data types, normalize formats,
and combine into a unified dataset.

Each file gets parsed, profiled, and normalized. The files range from simple (HR's 156-row headcount sheet) to chaotic (Finance's reconciliation file with embedded formulas, merged header cells, and subtotal rows mixed into the data):

  • Sales_Q4_2024.xlsx — 3,247 rows of product sales and customer info with inconsistent dates
  • Operations_Weekly.xlsx — 1,891 rows of inventory data with multiple SKU format variations
  • Finance_Reconciliation.xlsx — 892 rows with revenue, costs, and embedded calculations that reference other sheets
  • HR_Headcount.xlsx — 156 rows of employee data across departments
  • Plus 19 more files with similar issues

The format normalization handles the messy reality of multi-department data:

IssueScopeResolution
Date formats7 different patterns across filesStandardized to YYYY-MM-DD
Phone numbers12 format variationsNormalized to (XXX) XXX-XXXX
CurrencyMixed "$1,234.56" and bare "1234.56"Consistent $X,XXX.XX
Product codes23 naming variationsUnified SKU-XXXXX format
Company names89 spelling variationsDeduplicated to 34 unique entities

The combined dataset: 47,832 rows across 31 columns with 99.7% data integrity. The 0.3% that can't be auto-resolved gets flagged for human review rather than silently guessed — a critical distinction. A phone number with 11 digits could be a typo or an international number. That's Ana's call, not an algorithm's.

Step 2: Duplicate Detection and Cleaning

Identify duplicate records across files using fuzzy matching and business logic
rules. Clean inconsistent data entries.

Simple exact-match deduplication misses most of the real duplicates because the same entity is spelled differently across files. Fuzzy matching catches what exact matching can't:

Customer matching: "ACME Corp" = "Acme Corporation" = "ACME Corp." get merged using a combination of name similarity scoring and email/phone confirmation. "Smith, John" vs "John Smith" vs "J. Smith" get matched through their shared phone number. The matching is conservative — a name-only match flags for review, but a name + phone match merges automatically.

Product deduplication: SKU-001, SKU_001, and SKU 001 consolidate to a canonical SKU-001. Description variations like "Widget (Red)" vs "Red Widget" normalize to a consistent format. Same-SKU records with different prices get flagged for manual review — that's a business decision, not a data cleaning decision. The price discrepancy might be a regional difference, a bulk discount, or a data entry error.

Cleaning results:

  • 2,847 duplicate customer records merged (most recent data retained)
  • 834 product variations consolidated
  • 156 address standardizations applied
  • 23 pricing inconsistencies flagged for manual review
  • 4,891 phone number format fixes

Final dataset: 44,985 unique records — a 6% duplicate removal rate. That 6% was silently inflating every report Ana produced. Revenue per customer, order frequency, average order value — all of these metrics were wrong because the same customer appeared multiple times under different names.

Step 3: Data Quality Analysis

Analyze data quality issues, detect outliers, and generate data health reports
with recommendations.

With clean, deduplicated data, the quality analysis reveals what's missing and what looks wrong:

Completeness gaps:

  • Customer Email: 89.4% populated (4,782 missing) — worth running through an email append service
  • Product Cost: 94.2% populated (2,601 missing) — can inherit from category averages
  • Region: 78.3% populated (9,847 missing) — recoverable via ZIP code geo-lookup

Outliers that need investigation:

  • 23 orders valued over $50K (legitimate large orders or data entry errors?)
  • 12 products showing negative stock (definitely data errors — someone entered returns wrong)
  • 156 products with margins exceeding 90% (pricing review needed)
  • 4 customers placing over 100 orders per day (bot activity or bulk purchasing program?)

Statistical validation: Revenue follows a normal distribution. Order frequency shows a clear seasonal pattern with a December peak and February trough. Customer lifetime value follows a power law — the top 12% of customers drive 67% of revenue. Geographic spread covers 47 states with concentration in Texas, California, and New York.

Step 4: Business Insight Generation

Generate business insights through statistical analysis, trend detection, and
predictive modeling.

This is where the cleaned data starts paying for itself. Patterns that were invisible in 23 separate spreadsheets become obvious in the unified dataset:

Revenue insights:

  • Year-over-year growth is 23.4% overall, but 8 product categories are declining — the aggregate number masks a diversification problem
  • The Southwest region is outperforming by 31%, suggesting an expansion opportunity
  • Top 12% of customers drive 67% of revenue — a concentration risk worth monitoring

Product performance:

  • Two products are growing fast: SKU-1847 (+89%) and SKU-2103 (+76%)
  • The Accessories category is down 23% and needs intervention
  • $1.2M in slow-moving inventory identified across 340 SKUs — capital tied up in products that aren't selling

Customer intelligence:

  • 234 customers are showing a purchase decline pattern — early churn signals that sales can act on now, not after the customer leaves
  • Customers who buy from 3+ product categories have 67% higher lifetime value — a cross-sell opportunity
  • Untapped potential in the Pacific Northwest represents roughly $2.1M in addressable opportunity

Forecast: Q1 revenue projected at $3.8M with a 12% confidence interval. Inventory recommendations: increase SKU-1847 stock by 45%, reduce Accessories by 30%.

Step 5: Automated Dashboard and Reporting

Create executive dashboards with automated data refresh, alert systems, and
drill-down capabilities.

The final piece replaces Ana's manual reporting entirely. The system watches a shared folder for new Excel files and processes them automatically:

Executive dashboard — KPI summary with traffic-light status indicators, 13-week rolling averages, year-over-year comparisons, and revenue forecasts. Updated within 2 hours of new data arriving, not 3 days.

Department views — Sales gets pipeline and territory performance. Operations gets inventory turns and capacity utilization. Finance gets P&L trends and budget variance. Marketing gets campaign ROI and acquisition costs. Each view pulls from the same unified dataset, so the numbers always agree.

Alert system — significant changes trigger stakeholder notifications within 1 hour. A 15% spike in returns doesn't wait 4 days to surface — it appears in Slack the same day. Inventory dropping below safety stock triggers an immediate procurement alert.

Scheduled reports — weekly summaries auto-generate and distribute every Monday morning. By the time the executive team opens their email, the latest numbers are waiting.

Real-World Example

The transformation happens in stages. Week one: the 23 files are ingested, cleaned, and unified for the first time. The duplicate removal alone changes the numbers — 6% of records were inflating every metric. Customer counts, average order values, revenue per account — all slightly wrong, for years. Week two: the outlier detection catches 12 negative-stock products that had been causing fulfillment errors for months.

But the real payoff is the insight that falls out of clean data. The discovery that the Accessories category is down 23% — invisible when the data lived in separate spreadsheets — leads to a pricing adjustment that recovers $180K in annual revenue. The identification of $1.2M in slow-moving inventory triggers a clearance campaign that frees up warehouse space and working capital. Route optimization based on the geographic analysis saves $67,000 in logistics costs.

Ana's Monday morning shifts from 20 hours of copy-paste to 45 minutes of reviewing dashboards and flagged anomalies. The 3-day reporting delay drops to 2 hours. And the executive team stops making decisions on stale data — which turns out to be worth far more than the $46,800 in saved labor.