Overview
Read, transform, analyze, and generate Excel and CSV files using Python. This skill covers data loading, cleaning, filtering, aggregation, formula generation, and export to multiple formats.
Instructions
When a user asks you to work with spreadsheets, Excel files, or CSV data, follow these steps:
Step 1: Load the data
import pandas as pd
# For Excel files
df = pd.read_excel("data.xlsx", sheet_name=0) # or sheet_name="Sheet1"
# For CSV files
df = pd.read_csv("data.csv")
# Show shape and preview
print(f"Shape: {df.shape[0]} rows x {df.shape[1]} columns")
print(f"Columns: {list(df.columns)}")
print(df.head())
Always print the shape, column names, and first few rows so the user can verify the data loaded correctly.
Step 2: Assess data quality
# Check for issues
print(f"Missing values:\n{df.isnull().sum()}")
print(f"\nDuplicates: {df.duplicated().sum()}")
print(f"\nData types:\n{df.dtypes}")
Report any issues found before proceeding with transformations.
Step 3: Apply requested transformations
Common operations:
Filtering:
filtered = df[df["status"] == "active"]
filtered = df[df["amount"] > 1000]
filtered = df[df["date"].between("2024-01-01", "2024-12-31")]
Aggregation:
summary = df.groupby("category").agg(
count=("id", "count"),
total=("amount", "sum"),
average=("amount", "mean")
).reset_index()
Pivot tables:
pivot = df.pivot_table(
values="revenue",
index="region",
columns="quarter",
aggfunc="sum",
margins=True
)
Cleaning:
df["name"] = df["name"].str.strip().str.title()
df["email"] = df["email"].str.lower()
df["date"] = pd.to_datetime(df["date"], errors="coerce")
df = df.drop_duplicates(subset=["id"])
df = df.dropna(subset=["required_field"])
Step 4: Export results
# To Excel with formatting
with pd.ExcelWriter("output.xlsx", engine="openpyxl") as writer:
df.to_excel(writer, sheet_name="Data", index=False)
summary.to_excel(writer, sheet_name="Summary", index=False)
# To CSV
df.to_csv("output.csv", index=False)
Step 5: Report what was done
Always summarize the operations performed, rows affected, and output file location.
Examples
Example 1: Clean and deduplicate a customer list
User request: "Clean up customers.xlsx -- remove duplicates, fix phone formatting, and split into active/inactive sheets"
Actions taken:
- Load
customers.xlsx(2,340 rows, 8 columns) - Remove 156 duplicate rows based on email
- Standardize phone numbers to (XXX) XXX-XXXX format
- Split into active (1,847 rows) and inactive (337 rows)
- Export to
customers_clean.xlsxwith two sheets
Output:
Loaded: 2,340 rows x 8 columns
Removed: 156 duplicate rows (by email)
Fixed: 892 phone numbers reformatted
Split: 1,847 active, 337 inactive
Saved to customers_clean.xlsx:
- Sheet "Active": 1,847 rows
- Sheet "Inactive": 337 rows
Example 2: Generate a monthly sales summary
User request: "Create a pivot table from sales.csv showing revenue by region and month"
Actions taken:
- Load
sales.csv(15,200 rows) - Parse date column, extract month
- Build pivot table: regions as rows, months as columns, sum of revenue
- Add row and column totals
- Export to
sales_summary.xlsx
Output:
| Region | Jan | Feb | Mar | Total |
|-----------|----------|----------|----------|-----------|
| North | $45,200 | $52,100 | $48,900 | $146,200 |
| South | $38,700 | $41,300 | $44,600 | $124,600 |
| East | $51,900 | $49,800 | $55,200 | $156,900 |
| West | $42,100 | $46,700 | $43,500 | $132,300 |
| Total | $177,900 | $189,900 | $192,200 | $560,000 |
Saved to sales_summary.xlsx
Guidelines
- Always show data shape and column names after loading so the user can verify correctness.
- When dealing with dates, explicitly parse them with
pd.to_datetime()and specify the format when ambiguous (e.g., is 01/02/03 Jan 2 or Feb 1?). - For large files (100k+ rows), warn about memory and suggest chunked processing.
- Preserve original files. Write output to new files unless the user explicitly asks to overwrite.
- When merging multiple files, check that column names match and report any discrepancies before proceeding.
- If the user asks for "formulas", generate the Excel formula strings (e.g.,
=SUM(B2:B100)) rather than computing values, so the spreadsheet stays dynamic. - Default to UTF-8 encoding for CSV. If the data contains special characters, test with
encoding="utf-8-sig"for Excel compatibility.