Cost Engineering December 9, 2025

Reducing Cloud Egress Costs with Lakehouse-Native Queries

Pushing compute to where data lives — instead of moving data to compute — eliminates cross-region egress on queries you're already paying for. A practical breakdown for data platform engineers.

Jordan Park Software Engineer, DataLynxr

Network diagram showing data flow optimized to avoid cross-region cloud egress charges

The egress bill most teams don't audit

Cloud providers charge for data leaving their network — what the industry calls "egress." The rates are well-known: AWS charges $0.09/GB for data transferred out of a region, GCP charges similar rates, Azure is structured differently but similarly material.

What's less well-understood is where egress actually appears in a data platform budget. Most teams audit their query compute bills closely. They look at their S3 storage bills. They sometimes notice their CDN costs. What they routinely miss is the egress generated by their own ETL pipelines — specifically, the cost of moving data from their object storage to a cloud data warehouse in another service tier or another region.

Three places ETL generates egress

1. Cross-service transfer. AWS S3 to Redshift, GCS to BigQuery, ADLS to Azure Synapse — all of these cross a service boundary that AWS, Google, and Microsoft price as billable data transfer in many configurations. The specific rate depends on whether the services are in the same region and the same VPC, but "same region" is not always sufficient to avoid charges depending on how the warehouse service routes traffic internally.

2. Cross-region replication. Many data teams run their source data in one region (where their application is) and their analytics warehouse in another (where their analysts or BI tools are configured). A 10 TB/day ETL pipeline replicating cross-region generates roughly $900/day in AWS egress at current rates — before a single query runs.

3. Data export for ML. Training a machine learning model typically requires exporting a training dataset from the warehouse to an S3 bucket accessible to the training cluster, or directly to the training instance. That export is another round of egress that analysts and data scientists often don't think about as a platform cost.

What "compute co-located with storage" actually means

The lakehouse-native architecture eliminates most of these paths by pushing compute to the data instead of the data to compute. DataLynxr's query nodes run in the same AWS region as your S3 bucket. The compute reads Parquet files directly from S3 over S3's internal network path — no cross-service, no cross-region transfer for the query itself.

More precisely: DataLynxr uses IAM roles and S3 bucket policies you control to read your data. The compute nodes are assigned to the same availability zone as your primary S3 bucket. S3 data transfer within the same region and AZ is $0.00/GB (intra-region data transfer pricing). The egress charges that appeared on your ETL-to-warehouse invoices disappear, because the data never leaves S3.

Calculating the egress delta for your workload

The calculation is straightforward:

Find the daily volume of data your ETL jobs write to your warehouse. Your pipeline orchestrator (Airflow DAG logs, dbt run metadata) should have this. Alternatively, check your warehouse's data ingestion logs.
Multiply by the AWS/GCP/Azure egress rate for your source region. For AWS us-east-1 to a managed service, the conservative estimate is $0.09/GB.
Multiply by 30 for a monthly estimate.
Check whether your warehouse and source S3 are in the same region. If not, the cross-region rate applies instead of the cross-service rate.

Egress cost model — 500 GB/day ETL pipeline

Scenario	Daily egress	Rate	Monthly cost
Same-region, same-AZ (lakehouse-native)	0 GB billed	$0.00/GB	$0
Cross-service, same region (S3 → Redshift)	500 GB	~$0.02/GB	~$300
Cross-region (S3 us-east-1 → warehouse us-west-2)	500 GB	$0.02/GB	~$300
Cross-region to public internet tier	500 GB	$0.09/GB	~$1,350

AWS us-east-1 public pricing as of 2025. Actual charges depend on VPC configuration, service endpoints, and transit gateway usage.

What this doesn't fix

Eliminating ETL egress doesn't eliminate all cloud data transfer costs. If your data analysts query DataLynxr from a laptop in a different city, the query results are returned over the public internet — those bytes are billed at the standard egress rate. For typical analytical query results (a few MB of aggregated output), this is negligible. For workloads that extract large datasets to client-side tools (e.g., exporting a 10 GB CSV), the egress bill for the query results applies regardless of whether the source data stayed in S3.

DataLynxr is also not a CDN or a cache tier. It is a compute layer that reads from your object storage. If your primary objective is reducing read latency for end users globally, a CDN or a read replica architecture is what you need. DataLynxr's strength is analytical query performance and streaming freshness, not geographic distribution of results.

Audit your pipeline first

Before switching anything, pull three months of your cloud provider's data transfer billing line items. Filter for data transfer from your primary S3/GCS/ADLS bucket. You may be surprised by the number — most data teams have never done this audit because the egress cost is buried in a "Data Transfer" line item alongside CDN, API calls, and other services.

If you want to model the cost savings for your specific workload, the contact form asks about your current architecture and data volume. We can help you estimate the delta before you run a trial.