Cost Engineering January 14, 2025

Why ETL Copy Jobs Are Eating Your Cloud Bill

Every ETL job that copies data into a warehouse also doubles your egress bill, your storage bill, and your data freshness lag. Here's what the math looks like — and what you can do about it.

Ryan Whitaker Co-founder & CEO, DataLynxr

Abstract data flow diagram illustrating ETL cost overhead on cloud lakehouse storage

The hidden bill in your data architecture

Most data teams track their Snowflake or BigQuery invoices closely. They benchmark query times and argue about warehouse sizing. What they rarely audit is the upstream bill — the ETL pipeline that copies data into the warehouse in the first place.

That pipeline has three cost components most teams don't separate out:

Egress charges. Every byte you move from S3, GCS, or ADLS to a warehouse crosses a billing boundary. AWS charges $0.09/GB for cross-region data transfer. At 10 TB/day of ingestion, that's roughly $900/day in egress alone before the warehouse ever runs a query.
Duplicate storage. You're now storing the same dataset twice: once in your object storage (original), once in the warehouse (copy). The warehouse copy is often stored in a proprietary columnar format that can't be accessed independently, so you can't reclaim it without losing SQL access.
Engineering time. Someone has to write, test, monitor, and fix the pipelines that perform the copy. On teams we've talked to, this occupies anywhere from 30% to 60% of data engineering bandwidth — time that's not going toward analytics or product features.

What the math actually looks like

Consider a 50 TB dataset refreshed nightly via an incremental ETL job that moves 500 GB per day into a cloud warehouse.

Monthly cost estimate — ETL-first approach, 50TB dataset

Cost component	Rate	Monthly estimate
S3 → warehouse egress (500 GB/day)	$0.09/GB	~$1,350
Warehouse storage (50TB active copy)	~$23/TB/mo	~$1,150
Object storage (same 50TB)	~$2.30/TB/mo	~$115
Compute: 3 engineers × 40% ETL time	$120K/yr loaded	~$12,000

Illustrative estimates based on AWS us-east-1 public pricing and industry-observed staffing patterns. Not DataLynxr-specific data.

The engineering labor line is the one that shocks people. When you price it at loaded cost, the "free" ETL pipeline that comes with your warehouse isn't free at all. It's often the largest single line item in the data platform budget.

Why traditional ETL architectures accumulate this cost

The core problem is architectural: a SQL warehouse is optimized for queries against its own internal storage, not against your cloud object storage. So to run fast SQL, you have to get the data in. That requires a pipeline. The pipeline requires maintenance. Maintenance requires engineers.

The Delta Lake and Apache Iceberg table formats changed the feasibility equation. A vectorized query engine can now run ANSI SQL against Parquet files in S3 with performance that competes with warehouse-native queries on many workloads — without requiring the data to leave your bucket. Predicate pushdown, partition pruning, and Z-ordering on Iceberg and Delta tables mean the engine reads only the Parquet files that actually contain the rows answering your query.

That's the architecture DataLynxr is built around. The SQL engine, the streaming ingestion runtime, and the ML feature layer all point at the same storage layer — your S3/GCS/ADLS bucket — without copying data between them.

What changes when you eliminate the copy

Beyond cost, removing the ETL copy step changes the data freshness equation. A scheduled ETL pipeline creates a staleness floor: data is always at least as old as the last successful run. With a sub-5s streaming ingest path to the same tables your SQL queries read, the freshness question disappears from the architecture entirely.

It also changes the schema drift problem. When you copy data into a warehouse, you're translating it — from Parquet to the warehouse's internal format, through a schema the ETL job inferred or you defined manually. When the source schema evolves, the copy breaks. With lakehouse-native queries, the SQL engine reads the Iceberg or Delta schema directly. Schema evolution (adding columns, renaming columns, evolving partition layouts) is handled by the table format spec, not by your ETL job.

What DataLynxr does not do

DataLynxr is not a BI tool. It doesn't generate dashboards or visualizations — it provides the query interface that your existing BI tools (Tableau, Superset, Metabase) connect to via JDBC/ODBC. It is not a managed Spark cluster: there is no Spark runtime involved. It is not a data warehouse in the traditional sense — your data is never stored in DataLynxr's infrastructure. If you're looking for a managed OLAP warehouse with its own proprietary storage, DataLynxr is the wrong product.

It is specifically a lakehouse compute layer: stateless compute nodes that run SQL, streaming ingestion, and ML feature reads against your existing S3/GCS/ADLS storage in open table formats.

Where to go from here

If you have data in S3 Iceberg or Delta tables today, the fastest way to see the cost delta is to run your most expensive warehouse query against the same data via DataLynxr and compare both the execution time and the billing. The benchmarks page has reproducible TPC-DS methodology you can apply to your own dataset. The quickstart guide gets you connected to S3 in under 10 minutes — no data migration required.