FOCUS billing: the dedup trap

2026-04-14

I was teaching my AI coding assistant how to query our FOCUS billing exports with DuckDB. I asked for March ELB costs. It came back with a number in the tens of thousands. Cost Explorer said $311.

FOCUS exports live in S3, partitioned by billing period. DuckDB reads them directly with the httpfs extension — no downloading, no schlepping to /tmp. I have a zsh wrapper that runs aws-export-credentials --exec duckdb so the credentials are already there:

SELECT round(sum(BilledCost), 2) as total
FROM read_parquet(
    's3://ramparts-finops/focus/data/billing_period=2026-03/*/*.parquet'
);

That glob returns $140,785.45. The real number is $3,104.83. Off by 45x.

Each billing period partition doesn’t contain one file. It contains 84. Every few hours, AWS writes a new complete snapshot into a timestamped subdirectory — the full month’s data as AWS knows it at that moment. By end of month, billing_period=2026-03/ has 84 subdirectories, each with one parquet file, each containing every line item from that day forward. The glob reads all 84 and sums them, counting every charge once for every snapshot it appears in.

When I told it to fix the overcounting, it tried downloading the files locally instead of reading from S3. Each download clobbered the last because they share the same filename. It ended up with the earliest, smallest snapshot — and reported $2.63. Two mistakes, opposite directions.

AWS puts a manifest JSON at the root of each billing period’s metadata partition that points to the latest snapshot. You could parse that, extract the path, and query one file at a time. Or you could let DuckDB do the dedup in SQL:

SELECT
  billing_period,
  round(sum(BilledCost), 2) AS total_cost
FROM (
  SELECT
    *,
    max(split_part(filename, '/', -2)) OVER (
      PARTITION BY billing_period
    ) AS latest_snapshot
  FROM read_parquet(
    's3://ramparts-finops/focus/data/billing_period=2026-*/*/*.parquet',
    hive_partitioning = true,
    filename = true
  )
)
WHERE split_part(filename, '/', -2) = latest_snapshot
GROUP BY billing_period
ORDER BY billing_period;

┌────────────────┬────────────┐
│ billing_period │ total_cost │
│    varchar     │   double   │
├────────────────┼────────────┤
│ 2026-01        │    2965.17 │
│ 2026-02        │    2194.47 │
│ 2026-03        │    3104.83 │
│ 2026-04        │    1148.62 │
└────────────────┴────────────┘

hive_partitioning=true gives you billing_period as a column. filename=true exposes the S3 path. split_part extracts the timestamp directory. The window function picks the latest snapshot per partition. One query, all months, self-deduplicating.

790,106 rows per month. Not 34,549,923.

#aws #finops #duckdb #platformengineering