SQL Analytics February 18, 2025

SQL on Lakehouse vs. Copy-to-Warehouse: A Honest Benchmark

We ran TPC-DS at 10 TB against a popular cloud data warehouse loaded via ETL from S3, and directly against the same S3 Iceberg tables. The results surprised us in places.

Maya Chen Head of Engineering, DataLynxr

Side-by-side architecture comparison: SQL on lakehouse versus copy-to-warehouse pipeline

Why we ran this benchmark

The data engineering community has been arguing about lakehouse-vs-warehouse performance for years, but most published comparisons are either vendor-sponsored, cherry-picked, or use configurations that don't reflect real deployments. We wanted to run a clean, reproducible test against a representative analytical workload — and share the full methodology, not just the headline number.

We chose TPC-DS because it's an industry-standard decision support benchmark designed to test complex analytical queries: multi-table joins, window aggregations, correlated sub-queries, and range predicates across large fact tables. It's not perfect, but it's honest in the sense that no vendor can specifically optimize for it without the optimization generalizing to real workloads.

Test setup

Dataset: TPC-DS scale factor 10,000 (10 TB). Generated and stored in Apache Iceberg format with Parquet file encoding on S3 (us-east-1). Z-ordering applied on the primary fact table partition columns after initial load.

DataLynxr configuration: 4-node compute cluster (r6g.2xlarge equivalent), single-region co-located with S3 bucket. Query cache disabled for all runs — cold reads only.

Copy-to-warehouse baseline: The same Iceberg dataset exported to a popular cloud SQL warehouse (columnar format, same region). ETL pipeline: nightly incremental load via Parquet copy, no additional clustering applied.

Methodology: Each query run 3 times, median reported. Warehouse cache flushed between runs. All queries run in sequence, no concurrency.

Results — selected queries

TPC-DS query execution time (seconds) — 10 TB scale

Query	DataLynxr (lakehouse-native)	Copy-to-warehouse	Ratio
Q4 (complex join, 12 tables)	7.2s	9.1s	1.3×
Q17 (window + correlated subq)	4.9s	8.4s	1.7×
Q47 (multi-window rank)	3.8s	14.2s	3.7×
Q52 (range predicate, high selectivity)	1.1s	0.9s	0.8× (warehouse wins)
Q72 (large fact, 3-way join)	11.3s	18.7s	1.7×

Results from DataLynxr internal testing on equivalent hardware. Reproducible test harness available on request.

Where the warehouse wins

Q52 is worth discussing. It's a highly selective range predicate against a narrow date column. The cloud warehouse had tightly clustered micro-partitions on this column from a previous optimization pass, and was able to answer the query by scanning a very small number of blocks. DataLynxr's Z-ordering had the same column in the sort order, but at 10 TB scale the Parquet files were larger and the S3 LIST overhead added latency.

This is a real tradeoff: warehouse-native formats are often better optimized for highly selective point lookups. Lakehouse-native formats win on wide analytical scans and workloads where the ETL overhead makes freshness a bigger variable than raw scan speed. Teams need to benchmark their specific query patterns, not accept industry averages.

The ETL lag factor

These numbers don't include the most important time component: the ETL pipeline itself. The copy-to-warehouse approach requires data to be ingested before any query can run against it. At 10 TB scale with a nightly batch load, a query asking "what happened in the last 4 hours" is simply unanswerable — the data doesn't exist in the warehouse yet.

The lakehouse-native approach has a freshness advantage that no amount of warehouse optimization can close: if your streaming pipeline is writing to the Iceberg table, the SQL engine sees those writes immediately after the Delta/Iceberg transaction commits. That's architectural freshness, not an optimization.

What these results actually mean for an ETL migration decision

Our benchmark question was narrow: given the same data, does querying it directly from S3 Iceberg tables deliver acceptable query latency compared to querying a warehouse copy? For most of the TPC-DS suite, the answer is yes — and in many cases the lakehouse-native path is significantly faster because vectorized execution against Parquet with predicate pushdown is doing less work than the warehouse ETL pipeline plus query.

But the migration decision for most teams isn't primarily a latency question. It's a total cost question. The ETL pipeline that keeps the warehouse copy fresh has direct costs (cloud egress, warehouse storage, compute for ETL jobs) and indirect costs (engineering time, oncall for failed jobs, schema drift incidents). These often dwarf the query latency differential.

The teams we've talked to who are on the fence about lakehouse-native SQL are worried about two things: (1) will analytical query latency regress for their specific queries, and (2) will the BI tools their analysts use work correctly against a new query endpoint. For (1), we recommend running your own workload samples before committing — the TPC-DS results give you a signal but not a guarantee for your data distribution and query patterns. For (2), DataLynxr exposes a JDBC/ODBC endpoint that's been tested against Tableau, Metabase, Superset, and standard SQL clients — BI tool compatibility hasn't been a blocker for the teams we've onboarded.

Reproducibility

The test harness is available on request via [email protected]. We can share the TPC-DS data generation scripts, the Iceberg table DDL, the Z-ordering configuration, and the query list. We'd welcome external validation — if your numbers differ, we want to know why.

See the full benchmark suite at the benchmarks page, which includes 1TB and 50TB scale results and a streaming latency breakdown.