Machine Learning May 6, 2025

Point-in-Time Correct ML Features Without a Dedicated Feature Store

Most training/serving skew comes from using different feature pipelines for batch training and real-time serving. Apache Iceberg time-travel gives you a better architectural option.

Ryan Whitaker Co-founder & CEO, DataLynxr

Timeline diagram showing point-in-time correct ML feature lookup from lakehouse table snapshots

What training/serving skew actually is

Training/serving skew is when the features your model trains on and the features it serves predictions against are computed differently. The most common cause isn't a bug — it's a structural choice that seems reasonable at the time: batch features computed by a Spark job for training, and online features computed by a separate service or microservice for real-time serving.

The divergence accumulates from different code paths, different aggregation windows, different null handling, or simply from data arriving in one system before the other. A model trained on a "user's average order value over the last 30 days" feature computed by a weekly Spark batch job will see different values than a serving endpoint computing the same feature from a Redis cache populated by a different pipeline.

Fixing this usually involves one of: (a) making the batch and online pipelines share code, which is hard to maintain across language boundaries; (b) pre-computing all features in batch and serving from the batch results, which adds serving latency and creates freshness problems; or (c) adopting a dedicated feature store (Feast, Tecton, Hopsworks) that manages the dual-write problem. Option (c) is what most growing ML teams default to — and it works, but it adds significant infrastructure to manage.

What Iceberg time-travel makes possible

Apache Iceberg's time-travel semantics give you a third option that avoids the dual-write problem entirely: read historical feature values from the same table that serves real-time features, by specifying the exact snapshot timestamp you want.

Iceberg maintains a metadata log of every committed snapshot, each with a timestamp. A SELECT ... AS OF TIMESTAMP '2025-03-01 14:22:00' query on an Iceberg table returns exactly the rows as they existed at that moment — including any rows that have since been updated or deleted. This is not a backup or a separate archive: it's a read against the same Parquet files, filtered by Iceberg's snapshot metadata to return only files that were active at the requested timestamp.

Point-in-time correct training dataset construction

A training dataset with point-in-time correct features looks like this: for each training label (e.g., "user 12345 churned at 2025-03-15"), you want the feature values that were available at that moment in time — not the feature values as computed today.

With Iceberg time-travel, you request the feature values at the label timestamp for each entity. In DataLynxr's Python SDK:

training_dataset.py

from datalynxr import LakehouseClient

client = LakehouseClient(workspace="ml-platform")

# label_df: entity_id, label, label_timestamp
train_features = client.get_feature_values(
    table="s3://my-bucket/user_features",
    entity_ids=label_df["entity_id"],
    timestamp=label_df["label_timestamp"],  # per-row timestamps
    features=["avg_order_value_30d", "session_count_7d", "last_category"]
)

# Returns a DataFrame with features as they existed
# at each entity's label_timestamp. Zero future leakage.

The SDK translates this into a set of Iceberg time-travel reads, one per unique timestamp bucket (timestamps are rounded to snapshot granularity). The resulting DataFrame is structurally identical to what the serving path returns — because the serving path calls the same get_feature_values() method without a timestamp argument, which defaults to the latest snapshot.

What this eliminates

You don't need a separate batch feature pipeline. The features already exist in your lakehouse table, written there by your SQL jobs or streaming ingestion. You don't need to maintain a separate online feature store: the same lakehouse table serves real-time requests via the Python SDK's point-in-time read. You don't need a backfill pipeline to populate historical feature values for training: the Iceberg snapshot log is the history.

The things this does not replace: a dedicated feature store may still be the right choice if you need sub-millisecond serving latency (Iceberg reads from S3 are in the low single-digit seconds, not sub-millisecond), or if you need per-feature access control that's more granular than table-level RBAC. Those are real constraints DataLynxr doesn't solve today.

Practical considerations

Iceberg snapshot retention is configurable. By default, DataLynxr retains 7 days of snapshots before running compaction. For point-in-time training datasets that require historical lookbacks longer than 7 days, you'd increase the retention window — at the cost of more S3 storage for the historical Parquet files. This is a straightforward tradeoff: snapshot retention × table size × S3 price.

Schema evolution on feature tables is handled natively by Iceberg. If you add a new feature column, older snapshots don't have it — get_feature_values() returns NULL for that column on any timestamp before the column was added. This matches the behavior of a real-world system where the feature didn't exist yet.

Try it on your data

If you have an existing Delta or Iceberg table with timestamped feature data, you can test the SDK's point-in-time reads without migrating anything. See the API reference for the full Python SDK documentation and the ML feature store use case page for architecture diagrams.