Terminal.skills
Use Cases/Allocate Infrastructure Costs Across Teams with AI

Allocate Infrastructure Costs Across Teams with AI

Break down cloud infrastructure costs by team, service, and environment to enable accurate chargeback and budgeting.

Data & AI#data-analysis#pandas#visualization#statistics#csv
Works with:claude-codeopenai-codexgemini-clicursor
$

The Problem

Your cloud bill is $47,000 per month and growing 15% quarter over quarter, but nobody can explain which team or service is driving the increase. Engineering says it is the data team's Spark jobs. The data team blames staging environments that never get torn down. Finance wants per-team cost breakdowns for budget planning, but your tagging strategy is inconsistent — only 60% of resources have proper cost-allocation tags. The CFO is asking hard questions, and the one person who understood the billing dashboard left the company three months ago.

Without per-team cost allocation, every budget conversation devolves into finger-pointing. The VP of Engineering cannot defend the cloud budget to the board without knowing what the money buys. And "we need to reduce cloud spend" becomes an impossible mandate because nobody knows where to cut.

The AWS Cost Explorer does not help. It shows spend by service — $8,400 on EC2, $6,200 on RDS, $3,100 on S3 — but that tells you nothing about who is responsible. The EKS cluster runs workloads for three teams. The RDS instance serves both the product and the data team. Shared infrastructure like VPCs, NAT Gateways, and CloudWatch costs $4,800 per month with no obvious owner.

The Solution

Use data-analysis to parse and categorize cloud billing exports, excel-processor to merge billing data with team ownership mappings, and report-generator to produce monthly cost allocation reports with trend analysis. The workflow takes a raw billing CSV and a team ownership spreadsheet, cross-references them, fills in gaps using resource tags and metadata, and produces a ready-to-present breakdown with optimization recommendations.

Step-by-Step Walkthrough

The workflow starts with two inputs: a billing export and a team ownership mapping. From there, every dollar gets traced to a team, anomalies get surfaced, optimization opportunities get quantified, and the whole thing gets packaged into a report that Finance and Engineering can both use.

Step 1: Export Your Billing Data and Team Mappings

Here's our AWS Cost and Usage Report for January (cost-usage-2026-01.csv.gz),
our team-service ownership spreadsheet (service-owners.xlsx), and our tagging
policy document. Break down costs by team, identify untagged spend, and show
month-over-month trends for the last 3 months.

The billing export contains 340,000 line items across 47 AWS services. Each line item is a usage record — one hour of an EC2 instance, one GB of S3 storage, one million Lambda invocations. The raw data is overwhelming, which is why most teams never look at it directly. The AWS Cost Explorer aggregates it into service-level summaries, but that hides the per-resource detail needed for team allocation.

The ownership spreadsheet maps each service to a team, but it was last updated two months ago and is missing the new data pipeline resources that the Data Team spun up in December. This is where the cross-referencing starts — and where the gaps in your tagging strategy become painfully visible.

Step 2: Map Services to Teams

The mapping process works in layers. First pass: match billing line items to the ownership spreadsheet by resource name and AWS account. Second pass: for resources not in the spreadsheet, check resource tags for team, project, or owner. Third pass: for anything still unmatched, infer from VPC placement, IAM creator, and naming conventions. Everything that cannot be matched goes into an "untagged" bucket for review.

The results after all three passes:

TeamResources OwnedMapping Method
PlatformEKS cluster, RDS primary, ElastiCache, CloudFrontSpreadsheet + confirmed tags
Product3 ECS services (api, web, worker), S3 media bucketSpreadsheet + confirmed tags
DataEMR cluster, Redshift, Glue jobs, S3 data-lake bucketSpreadsheet + confirmed tags
SharedVPC, Route53, CloudWatch, IAMAllocated proportionally by team headcount

But 23% of the spend does not match anything in the spreadsheet or tags. That is $10,810 per month with no owner:

  • 4 EC2 instances (i-0a3f..., i-0b7c...) — $1,840/mo, likely staging leftovers from the Q3 project. The instance names contain "staging-test" but no team tag. Creation date: 4 months ago, last SSH login: 3 months ago.
  • 2 RDS snapshots older than 90 days — $320/mo, forgotten after the database migration to Aurora last quarter. The source instances no longer exist.
  • 1 NAT Gateway in ap-southeast-1 — $180/mo. There are no services running in that region. Someone probably created it during a disaster recovery test and forgot to tear it down.

These orphaned resources are the kind of waste that only surfaces when someone actually maps every dollar to an owner. Nobody notices $180/month on a $47,000 bill — but $2,340/month in orphaned resources adds up to $28,080 per year burning quietly in the background.

The mapping process also reveals cost concentration risks. The Platform Team's RDS primary instance accounts for $4,800/month — a single resource representing 10% of the total bill. If that instance needs to be upgraded for the next growth phase, the cost jump will be significant and should be budgeted for in advance rather than appearing as a surprise on next quarter's bill.

The shared cost allocation is transparent: Finance gets a worksheet showing exactly how VPC, Route53, CloudWatch, and IAM costs ($4,800 total) get split across the three teams based on headcount. If the teams prefer a different allocation method (CPU hours, request counts), the model can be adjusted — the important thing is that shared costs are allocated somewhere instead of sitting in an unaccountable bucket.

Step 3: Calculate Per-Team Cost Allocation

Shared costs get distributed proportionally by team headcount (Platform: 8, Product: 12, Data: 5). This is a common FinOps allocation method — not perfect, but far better than ignoring shared costs entirely. An alternative approach uses resource consumption metrics (CPU hours, network egress), but headcount-based allocation gets you 90% of the accuracy with 10% of the effort. With services mapped, the January breakdown tells a clear story:

TeamDirect CostShared CostTotal% of BillMoM Change
Platform$14,200$3,100$17,30036.8%+4.2%
Product$11,800$2,900$14,70031.3%+8.1%
Data$9,600$2,400$12,00025.5%+31.2%
Untagged$3,000--$3,0006.4%-12.0%
Total$47,000100%+11.3%

The Data Team's 31.2% spike jumps out immediately. Digging into the line items reveals the cause: an EMR cluster named emr-etl-migration was spun up on January 12 and left running 24/7 with 8 m5.xlarge instances. It was meant for a one-time data migration job that finished in 3 days. Nobody turned it off, and it has been burning $2,800/month since.

The month-over-month trends tell different stories for each team. Platform is growing at a steady 4.2% — healthy for a scaling product, and roughly in line with customer growth. Product's 8.1% growth correlates with the new enterprise features they shipped in December (additional ECS tasks for the new reporting API). These are explainable, expected increases. The Data Team's 31.2% spike is neither — it is a forgotten resource masquerading as organic growth.

Without the allocation breakdown, the $47,000 bill just looks like "infrastructure costs went up 11.3%." With it, the story becomes: "Platform and Product grew as expected, the Data Team has a $2,800 waste problem, and there is $3,000 in orphaned resources nobody owns."

Step 4: Identify Optimization Opportunities

With every dollar mapped, the low-hanging fruit becomes obvious:

RecommendationMonthly SavingsEffortRisk
Convert EMR cluster to spot instances + auto-scaling$2,800MediumLow — batch jobs tolerate interruption
Delete 4 untagged EC2 instances (no traffic in 30 days)$1,840LowLow — verified no inbound connections
Right-size Product Team ECS tasks (avg CPU at 12%)$1,500MediumMedium — needs load testing first
Move old RDS snapshots to S3 Glacier$290LowNone — snapshots are read-only backups
Delete unused NAT Gateway in ap-southeast-1$180LowNone — no resources in the region

Total potential savings: $6,610/month — 14% of the current bill, or $79,320 annually. The EMR fix alone recovers $2,800/month and can be done in an afternoon by switching to auto-scaling with spot instances for the batch ETL jobs. The EC2 and NAT Gateway deletions are even simpler — verify they are truly unused and terminate them.

There is also a longer-term recommendation: the Product Team's ECS tasks are running at 12% average CPU utilization, meaning they are provisioned for roughly 8x the load they actually handle. Right-sizing those tasks from 2 vCPU / 4GB to 0.5 vCPU / 1GB would save $1,500/month, but needs load testing under peak traffic to confirm the smaller tasks can handle spikes without throttling.

The distinction between quick wins and longer-term optimizations matters for sprint planning. The quick wins (EMR fix, EC2 deletions, snapshot archival, NAT Gateway removal) total $5,110/month and can be done this week. The medium-effort item (ECS right-sizing) adds another $1,500/month but requires a load test sprint first. Presenting both categories lets engineering leadership choose how aggressively to optimize. Most teams start with the quick wins to demonstrate value, then tackle the medium-effort items in a follow-up sprint once the initial savings are visible in the next month's bill.

Step 5: Generate the Monthly Report

The final deliverable is an Excel workbook and a PDF for the CFO:

cost-allocation-january-2026.xlsx contains 6 sheets:

  1. Executive Summary — one-page overview with charts, traffic-light status per team, and a "key finding" callout box highlighting the EMR cluster issue
  2. Team Breakdown — per-team costs with month-over-month trend lines going back 3 months
  3. Service Detail — line-item costs by AWS service, sortable and filterable, with each line mapped to its owning team
  4. Untagged Resources — every orphaned resource with suggested owner based on VPC placement, IAM creator, and resource naming patterns
  5. Optimization Actions — prioritized savings with effort estimates and risk ratings
  6. Tagging Compliance — 77% tagged this month, up from 60% last month, trending toward the 95% target

The PDF version strips the raw data and keeps the executive summary, team breakdown charts, and optimization actions — exactly what Finance needs for the board deck without the overwhelming detail. Two deliverables for two audiences: the engineering team gets the Excel with full line-item detail for follow-up, and Finance gets the PDF with the story and the charts.

The report becomes a repeatable process. Next month, drop in the February billing export and the agent produces the same breakdown with updated trends, automatically flagging any team whose costs jumped more than 15% or any new untagged resources that appeared. Over time, the monthly reports build a cost history that reveals patterns invisible in any single snapshot — seasonal spikes from holiday traffic, step-function increases from new feature launches, and gradual drift from resources that slowly accumulate without cleanup.

The historical data also transforms budget conversations. Instead of guessing next quarter's cloud spend, the team extrapolates from actual per-team growth rates. Platform costs are growing 4% per month, Product at 8% — next quarter's budget can be set with confidence instead of finger-in-the-air estimates.

Real-World Example

Marco is the VP of Engineering at a 40-person SaaS startup heading into Series B. The investor pitch deck claims "capital-efficient infrastructure" but the due diligence team is going to ask hard questions about unit economics. The $47,000 monthly cloud bill is a black box. Marco cannot answer "what does it cost to serve one customer?" because nobody knows which resources belong to which product. He cannot answer "how does infrastructure cost scale with revenue?" because there is no per-team breakdown to correlate with business metrics.

He exports three months of AWS Cost and Usage Reports alongside the service-ownership spreadsheet and feeds them to the agent. Within an hour, he has his answer: 94% of costs mapped to specific teams, $3,000 in orphaned resources flagged, and the Data Team's 31% spike traced to a single forgotten EMR cluster. The three-month trend data is particularly revealing: the cost increase is not organic growth — it is a step function that started on exactly one date in January when the EMR cluster was created. This distinction matters for investor conversations. Organic growth tied to customer acquisition is healthy. Forgotten infrastructure is waste.

The optimization report identifies $6,600/month in easy savings. Marco's team implements the changes in one sprint: the EMR cluster gets auto-scaling with spot instances, the orphaned EC2 instances get terminated after confirming zero traffic, and the ECS tasks get right-sized after a load test confirms they can handle 3x current peak traffic at the smaller size. The February bill drops to $41,200 — a $5,800 reduction that drops straight to the bottom line.

The monthly process is now repeatable. Marco's ops lead runs the allocation on the first business day of each month — drop in the new billing export, get the updated breakdown, review the anomalies, and publish the report. The February report catches a new untagged resource immediately (a developer spun up an EC2 instance for testing and forgot to tag it), and the trend data shows the optimization savings landing exactly as projected.

For the investor meeting, Marco presents something no previous cloud bill could show: per-team spend with trend lines, cost-per-customer metrics derived from the product team's allocation divided by active accounts, and a downward cost trajectory. The due diligence team asks about infrastructure scalability, and Marco shows the ECS headroom analysis — current spend supports 3x growth before the next cost step. The investors see a team that understands its infrastructure economics, not one that throws money at AWS and hopes for the best.