Benchmarks July 15, 2025

Benchmarking TPC-DS at 10 TB on Lakehouse Storage

Our engineering team ran the full TPC-DS suite at 10 TB scale. Here's the methodology, the raw numbers, and what they mean for teams considering a lakehouse migration.

Maya Chen Head of Engineering, DataLynxr

Bar chart comparing TPC-DS query runtimes at 10TB scale across lakehouse and warehouse platforms

What we tested and why

TPC-DS is the Transaction Processing Performance Council's Decision Support benchmark. It models a retail company's reporting environment across 99 queries that span complex multi-table joins, window aggregations, correlated sub-queries, and CASE WHEN logic. It's not the only benchmark that matters, but it's the one that most closely mirrors real-world analytical workloads at data warehouse scale.

We ran the full 99-query suite at 10 TB scale (TPC-DS scale factor 10,000) against Apache Iceberg tables on S3. We publish these numbers because data teams deserve reproducible numbers they can interrogate — not just headline claims. The methodology is fully documented and the test harness is available on request.

Test environment

Dataset: TPC-DS SF10000 (10 TB). Generated with the official TPC-DS data generator. Stored as Apache Iceberg tables in Parquet format (Snappy compression) on S3 us-east-1.
Table optimization: Z-ordering applied on primary fact tables using the partition key columns most frequently used in range predicates (ss_sold_date_sk for store_sales, ws_sold_date_sk for web_sales).
Compute: 4-node cluster, r6g.2xlarge per node (8 vCPU, 64 GB RAM), same AWS availability zone as the S3 bucket to eliminate cross-AZ egress.
Cache: Query result cache disabled. File metadata cache enabled (table-level metadata only, not row data). Each query run 3 times — median reported.
Catalog: DataLynxr internal Iceberg catalog backed by DynamoDB.

Full results — all 99 queries

We won't paste all 99 rows here, but the distribution is more interesting than the headline number. Here are the notable bands:

TPC-DS query time distribution — 10 TB, DataLynxr lakehouse-native

Latency band	Query count	% of suite	Representative queries
Under 2 seconds	31	31%	Q1, Q6, Q14, Q33, Q52
2–5 seconds	38	38%	Q17, Q19, Q47, Q55, Q72
5–15 seconds	22	22%	Q4, Q11, Q23, Q64, Q82
Over 15 seconds	8	8%	Q5, Q31, Q74, Q77

Queries over 15 seconds are predominantly multi-CTE queries with 3+ self-joins on large fact tables without effective partition pruning. These improve significantly with additional Z-order coverage.

Where Z-ordering made the biggest difference

Z-ordering (also called multi-dimensional clustering in Delta Lake) is a technique that co-locates related rows in the same Parquet files by sorting on multiple columns simultaneously. For Iceberg tables with the Z-ordering applied on ss_sold_date_sk and ss_store_sk, queries with predicates on both columns scanned approximately 12% of total table data instead of the full 10 TB. This is the mechanism that makes "sub-second on 10TB" credible — you're not reading 10TB, you're reading ~1.2TB.

Queries that don't benefit from Z-ordering (no range predicates on the ordered columns, or queries requiring full table scans) don't show the same speedup. Q5 and Q74 fall into this category.

Reproducibility — run it yourself

We made a deliberate choice to publish methodology rather than just numbers. The reasons:

Your Z-ordering configuration, your S3 region, and your cluster sizing all affect results significantly. Our numbers are ours.
We want to be honest about which queries are slow and why, not just showcase the best 20.
We believe the lakehouse architecture position is strong enough to survive honest benchmarking. If it isn't, we should know.

The test harness includes: TPC-DS data generation scripts, Iceberg DDL with Z-order configuration, all 99 query files, and a runner script. Email [email protected] to request the full harness.

What happens inside the query planner on the slow queries

The 8% of queries running over 15 seconds share a common structure: three or more CTEs with self-joins on the same large fact table, no partition predicates in the inner loop, and aggregations that can't be pushed below the join. In a traditional warehouse, these queries benefit from pre-computed statistics that the internal storage format tracks per-column per-micropartition. In Iceberg, per-column min/max statistics are stored in the Parquet row group metadata — but the planner's ability to use them depends on whether the query engine's cost model accounts for Iceberg's data skipping API.

DataLynxr's query planner reads Iceberg manifest files before generating the execution plan. For queries with effective partition predicates, this eliminates entire Parquet files before a single byte of actual data is read. For the full-scan queries — Q5, Q74, and a handful of others — no amount of metadata optimization helps because the query genuinely needs most of the data. These queries run slower than they would in a warehouse with an internal columnar format optimized for sequential scan throughput. We don't hide this in the benchmark output.

The practical implication: if your query workload is dominated by full-scan aggregations over 10TB+ of un-partitioned data, a warehouse with internal columnar storage is likely the better fit. DataLynxr performs best on workloads where predicate pushdown can do real work — date-range analytics, event stream aggregations by time window, and joins that can be pushed down to the Parquet file level via partition pruning.

What this means for migration decisions

If your workload is primarily OLAP queries with date range predicates and multi-table joins (most reporting and analytics workloads are), the 10TB TPC-DS numbers suggest lakehouse-native query performance is competitive with warehouse-native on most queries. The cases where warehouse internal formats win are highly selective point lookups and queries without effective partition pruning — a minority of typical analytical workloads.

The more interesting question for most teams isn't "is the query latency acceptable?" but "does eliminating the ETL pipeline change the economics enough to make it worth switching?" That calculation depends on your pipeline maintenance burden, your data volume, and your egress bill. See the ETL cost breakdown post for the cost side of that calculation.

One practical note on migration sequencing: we've found teams get better results by migrating individual query workloads incrementally rather than doing a full cutover. Start with the workloads that have the most effective partition pruning — these will show the biggest latency improvement and cost reduction. Leave the full-scan aggregations for last, and run them in parallel across both systems for a few weeks before committing.