← Back to Insights
// Muhammad Rasyad Caesarardhi

Taming the Data Swamp: The Impact of OPTIMIZE, ZORDER, and VACUUM on Delta Tables

Data EngineeringDelta LakeDelta OSSPerformance Tuning

When working with large-scale telemetry or IoT data, the physical arrangement of data on disk can make or break your pipeline’s query and update performance.

Let’s explore a common scenario: managing a Delta Lake table that continuously ingests readings from a distributed network of IoT weather sensors. To avoid the overhead of over-partitioning—a common pitfall with time-series logs—this table is unpartitioned. Over time, however, our table statistics look like this:

  • Total Rows: Approx. 100 million rows
  • Size on Disk: Roughly 2 GB
  • Number of Files: Over 2,000 files
  • Format: Delta

The Schema:

root
 |-- ID: long --MERGE KEY
 |-- DEVICE_ID: string
 |-- SESSION_ID: long
 |-- TIMESTAMP: timestamp
 |-- LAT: decimal(10,7)
 |-- LONG: decimal(10,7)
 |-- TEMPERATURE: decimal(5,2)
 |-- HUMIDITY: decimal(5,2)
 ...

At first glance, relying on roughly 2 GB for ~100 million rows seems incredibly storage-efficient. However, storing this across over 2,000 files means our average file size is less than 1 MB.

We have stumbled directly into the classic “Small File Problem”. When a compute engine like Spark reads this table, it spends more wall-clock time opening, reading metadata, and closing file handles than it does processing the actual data.

To fix this, we need to bring out three of Delta Lake’s superpowers: OPTIMIZE, ZORDER, and VACUUM. But as we will see, executing these commands blindly can introduce severe Write Amplification during MERGE operations. Let’s break down exactly what each command does, evaluate different optimization scenarios against a live dataset, and unpack the “Why” behind their unexpected behaviors.


1. The Core Operations

OPTIMIZE: The Bin-Packing Hero

The OPTIMIZE command is designed to solve the small file problem. It performs a process known as “bin-packing” by systematically reading tiny, fragmented files and compacting them into larger, highly efficient Parquet files (Delta typically targets 1 GB files by default).

ZORDER: The Multi-Dimensional Sorter

Since our table is unpartitioned, how can we make queries faster when a user searches for a specific DEVICE_ID across a certain TIMESTAMP without forcing a full table scan? ZORDER maps multi-dimensional data into one dimension while preserving data locality. Running OPTIMIZE table ZORDER BY (DEVICE_ID, TIMESTAMP) records the min/max statistics in the Delta Log. When querying for specific devices in specific ranges, Spark effectively skips scanning files that don’t match those parameters entirely.

VACUUM: The Garbage Collector

Because Delta Lake maintains immutable data structures for Time Travel guarantees, OPTIMIZE does not immediately delete the old fragmented files—it simply marks them. VACUUM acts as the garbage collector, permanently deleting these no-longer-referenced data files from your cloud storage.


2. Observing The Merge Paradox: A Controlled Experiment

To understand the true impact of these operations, we tested an UPSERT operation (using the MERGE command) of 140 new/updated records against our ~100M row base table under various optimization conditions.

*Environment Context & Architecture:

  • Data Layout & Optimization: Initial file generation, bin-packing, and OPTIMIZE commands were executed on a managed cloud Spark cluster.
  • MERGE Benchmark Engine: The actual heavy lifting—the UPSERT benchmarks and shuffle operations mapped below—were executed locally on an Apple Silicon Mac M2 (16 GB RAM) using pure Apache Spark 3.4.4 / Delta Lake 2.4.0 (OSS). Forcing a local laptop to churn gigabytes of shuffle data perfectly exposes the brutal real-world cost of Write Amplification.*

Our core MERGE logic was based on joining by the ID column: ON target.ID = source.ID

Scenario A: The Base (Unoptimized)

Before any optimization, our table consisted of over 2,000 tiny files (~0.98 MB average).

When we executed the MERGE of 140 rows:

  • Shuffle Data: Spark registered 1.2 GiB of shuffle read/write, but heavily suffered from execution latency (task time hit 6.7 minutes).
  • Why? Because the files were incredibly fragmented, the read phase opened thousands of files, slowing down processing, even though the final base rewrite size was confined.

Scenario A Execution Plan Scenario A Shuffle Stats

Scenario B: OPTIMIZE (Without Z-Order)

We compacted our table using standard bin-packing.

Optimization Metric Details (Delta Log):

Removed FilesAdded FilesMin File SizeMax File SizeDeletion Vectors Removed
2,2271327.08 MB327.08 MB0
  • Files: Reduced from 2,229 to 3.
  • Average File Size: Spiked to ~720.51 MB.
  • MERGE Shuffle: Task duration dropped heavily (3.9 minutes), but shuffle data increased to 1.9 GiB!

Scenario B Shuffle Stats

Scenario C: OPTIMIZE with ZORDER (BY ID)

Since our MERGE essentially joins on ID, we logically expected Z-Ordering by ID to effortlessly scale the MERGE.

OPTIMIZE delta.`{table_path}` ZORDER BY (ID)

Optimization Metric Details (Delta Log):

Removed FilesAdded FilesMin File SizeMax File SizeZORDER
1621.09 GB1.19 GB["ID"]
  • Files: Radically reduced to just 2 massive Parquet files (~1.08 GB each).
  • MERGE Result: Counter-intuitively, the Shuffle Read and Write spiked to an astonishing 3.7 GiB, and task time suffered again (6.6 minutes).
  • With VACUUM: Running VACUUM afterward predictably deleted the old files to save storage, but it didn’t fundamentally fix the core issue—our shuffle read/write remained brutally high at 3.6 GiB.

Scenario C Shuffle Stats

Scenario D: The Danger of Manual Repartitioning

Sometimes, engineers attempt to bypass Delta commands using plain Spark code to force files into smaller chunks:

# Forcing splits purely via DataFrame API
(spark.read.format("delta").load(table_path)
    .repartition(num_partitions)
    .write.option("dataChange", "false")
    .format("delta").mode("overwrite").save(table_path))

Repartition Metric Details (Delta Log):

Operational ModeFiles AddedRows OutputTotal Bytes AddedEngine
Overwrite16~100,000,0003,100,000,720Apache-Spark/3.4.3
  • Files: Re-split into 16 files averaging ~184 MB.

File Size Distribution:

size_bucketfile_counttotal_size_mbmin_size_mbmax_size_mbavg_size_mbtotal_rows
5_128–256 MB162956.39184.6184.93184.77~100,000,000
  • MERGE Result & The Performance Trap: This was disastrous. Upon merge, Spark spawned 100 disjointed files. Shuffle Read and Write skyrocketed to 7.2 GiB each, punishing execution with a 16-minute task time!
  • The “Sleeping Optimize”: When we subsequently tried to run a standard OPTIMIZE to fix it, Delta silently skipped the operation because it saw the raw file sizes were technically already under optimization target limits.

Scenario D Shuffle Stats

Scenario E: The Ultimate Compromise (ZORDER + 128 MB Limits)

To get the best of both worlds—efficient data clustering and a small rewrite blast radius—we applied both a strict size cap and deterministic Z-Ordering by ID.

# Note: Executed on managed cloud Spark cluster, which supports passing this Databricks configuration limit
spark.conf.set("spark.databricks.delta.optimize.maxFileSize", 134217728) # 128MB Limit

Optimization Metric Details (Delta Log):

Removed FilesAdded FilesMin File SizeMax File SizeAvg File Size
2424110.98 MB140.78 MB118.24 MB
  • Files: Controlled cleanly into roughly ~24 mapped files.
  • Average File Size: Averaged a sleek ~118 MB natively.

File Size Distribution (Post-Optimization):

size_bucketfile_counttotal_size_mbmin_size_mbmax_size_mbavg_size_mbtotal_rows
4_50–128 MB212441.56105.84127.28116.2696880028
5_128–256 MB3396.09129.54134.26132.0315607411
  • MERGE Result: The sweet spot. Task time plummeted to a blazing 2.9 minutes, and shuffle metrics crashed to just 1.5 GiB. We gained the rapid file-skipping power of ZOrder against a much safer update footprint.

Final File Layout (After Successful Merge):

size_bucketfile_counttotal_size_mbmin_size_mbmax_size_mbavg_size_mbtotal_rows
2_1–10 MB12.162.162.162.1693655
3_10–50 MB884.3910.310.7810.554600193
4_50–128 MB202323.77105.84127.28116.1992186180
5_128–256 MB3396.09129.54134.26132.0315607411

Scenario E Shuffle Stats


3. Unpacking the “Why”: The High Shuffle Paradox

If you observed the execution logs closely across our scenarios, you likely noticed a glaring paradox: While standard ZORDER drastically reduced the number of files and accelerated pure SELECT reading, it simultaneously triggered massive surges in Shuffle Data and execution times during MERGE operations.

Why did perfect Z-Ordering (Scenario C) skyrocket our shuffle to nearly 4 GiB just to update 140 isolated rows?

The Under-the-Hood Write Amplification

According to the official Delta Lake OSS specification (index.mdx), “Z-Ordering aims to produce evenly-balanced data files… writing fewer large files is more efficient for reads, but you might still see an increase in write latency.”

Our experiment perfectly proves this architectural warning. Delta Lake does not update individual rows “in place.” When a MERGE operation identifies that a row needs to be updated (or deleted), Delta must structurally recreate and rewrite the entire Parquet file that contains that row.

In our Z-Ordered scenario (Scenario C), the engine aggressively balanced our ~100 million rows into massive files containing over 1 GB of data each. Now, look closely at our 140 upsert rows: their IDs represent a perfectly sequential block.

Because we Z-Ordered by ID, Delta’s data skipping worked perfectly! Spark successfully bypassed the rest of the table and isolated these updates to exactly one single 1.2 GB file (you can see "numTargetFilesRemoved":"1" in the logs). But here is where the Write Amplification traps you:

  1. Read: To change those 140 perfectly grouped rows, Spark still had to read the entire targeted 1.2 GB Parquet file into memory.
  2. Perform the Shuffle Join: It joined the incoming streaming updates against that massive 1.2 GB memory footprint.
  3. Rewrite: It was forced to physically rewrite the entire gigabyte-scale block back out to cloud storage.

This dynamic represents Write Amplification at its worst. Even with perfect Z-Ordering and flawless data-skipping picking the best possible contiguous file, a tiny 140-row update forced Spark to shuffle and brutally roar through gigabytes of I/O simply because the destination file was a monolith.

File Skipping vs. The Rewrite Blast Radius

When we look at Scenario E (ZORDER + 128 MB Limit), we find an architectural balance. We capped the files at 128 MB. As a result, the engine consolidated files powerfully enough to solve the overarching “Small File Problem” for analytics, but we crucially kept the blocks scaled down so that an UPSERT operation implicitly demanded Delta only rewrite a localized 128 MB segment instead of a 1 GB monolith. The “blast radius” for any single record update was severely contained.

Furthermore, we proved that using Spark’s basic .repartition (Scenario D) fails immensely because it haphazardly scatters records across partitions without logical ordering metadata, completely destroying your merge join bounds.


4. Modern Solutions to Write Amplification

To actively mitigate write amplification on large Lakehouse architectures—especially those sustaining frequent, high-volume MERGE loads—embrace these modern feature sets:

  1. Precision File Tuning (targetFileSize) (Supported in: OSS & Databricks): As heavily demonstrated, setting intentional size ceilings alongside Z-Ordering securely controls the blast radius of a rewrite. In Delta OSS, this is implemented natively via the table property delta.targetFileSize. By configuring this ceiling, the engine gracefully balances chunks downward instead of defaulting blindly to 1 GB.
  2. Low Shuffle Merge (Supported in: Databricks Only): This is a powerful proprietary engine optimization enabled natively in Databricks Runtimes (10.4 LTS+). It does not exist in open-source Delta. It targets unmodified rows precisely to intentionally preserve existing file placements without initiating brutal re-shuffle processes.
  3. Deletion Vectors (Supported in: OSS Delta 2.3+ & Databricks): Operating exactly as defined in Delta’s inner PROTOCOL.md specification, Deletion Vectors solve the root-cause write penalty flawlessly. Instead of a hard compaction cycle, they allow Delta to virtually mark physical rows as “deleted/updated” by writing a separate JSON sidecar metadata array (often just a few discrete bytes). This cleanly bypasses rewriting the parent 1 GB Parquet file during the MERGE. The heavy physical Parquet pruning is deferred transparently until you explicitly trigger an OPTIMIZE. (Note: Deploying this in OSS requires explicitly running ALTER TABLE target SET TBLPROPERTIES('delta.enableDeletionVectors' = true)).

Conclusion

Transforming a fragmented data swamp into a resilient Lakehouse pipeline demands far more than running OPTIMIZE on autopilot. By comprehensively charting how your physical storage blocks collide dynamically with MERGE mechanics under the hood, you can precisely tune your architectures to perform aggressively for both ingestion algorithms and user analytics simultaneously.


5. References & Further Reading

The insights and architecture behaviors mapped out in this deep dive are grounded in official Delta Lake protocols. If you’re building a highly-scaled Lakehouse, I recommend reading through the official specifications:

  • Z-Ordering & Data Skipping Constraints: Delta Lake OSS Optimizations (index.mdx) - Details on how Z-Ordering colocates data and the architectural limits of sorting by high-cardinality values.
  • Deletion Vectors Protocol: Delta Transaction Log Protocol (PROTOCOL.md) - The official spec on how Deletion Vectors physically decouple MERGE operations from immediate Parquet file rewrites by mapping relative sidecar JSON arrays via active transactions.
  • Update & MERGE Architectures: Delta Update Operations (delta-update.mdx) - Specifications detailing how the MERGE INTO operation executes against existing file pools.
  • Databricks Low Shuffle Merge Engine: Databricks Official Documentation - Deep dive on how modern computational runtimes (Databricks 10.4 LTS+) intelligently restrict shuffle loops for un-modified rows.