Work in progress - The overall flow of this note is being fixed
If you're a new architect, or simply someone new to Databricks/Delta-Lake, you'll want to go through this note a few times. At each time, think how your existing data setup can be improved through things explained in this note.
There is plenty of official documentation on the topic. But this note organizes it all in one place, and forms a definite & actionable structure in your brain.
Gist
Things you should know in 1 line:
- Auto compaction : solves small file problem by merging small files into fewer larger files, reducing file count overhead
- Liquid clustering : continuously and incrementally groups rows with similar key values into same files, improving file-level data skipping
- ZORDER : re-writes data by sorting rows along space-filling curve, allowing multi-column filters to skip more files
- Partitioning : splits and organizes data into directory based chunks, enabling coarser data pruning before files are scanned (single column filter statements)
- OPTIMIZE : Executes a family of optimization operations that rewrite data files to improve read performance - file compaction and z-ordering or liquid clustering (whatever is enabled)
On a delta table, you can either use (Liquid Clustering), or (z-ordering+partitioning). Liquid clustering isn't compatible with partitioning and z-ordering.
Liquid clustering and auto compaction are two different mechanisms, independent of each other. However, they are co-executed inside the same spark write jobs. So in a way, these two work hand-in-hand.
- Operationally
- Automated management will automatically trigger liquid clustering and auto compaction (via
OPTIMIZE) - If automated management is not available/enabled, periodically run
OPTIMIZEvia scheduled job runs
- Automated management will automatically trigger liquid clustering and auto compaction (via
- Tech choices
- To prevent small file problem - Enable delta auto-compaction if not already enabled
- To make sure the files are organized well - prefer liquid clustering over zorder+partitioning
Automated mechanisms
Liquid clustering - Faster reads via data skipping
NOTE: Prefer using liquid clustering instead of the older hive-partitioning & Z-ORDER mechanisms
Liquid clustering cannot be used with partitioning and bucketing. You have to give up all control to liquid clustering.
Preferred for tables that are being loaded by streaming, or by small frequently run batches of appends. Liquid clustering especially shines when read patterns are not predictable.
Liquid clustering optimizes the data layout, aiming for a balanced data layout - uniform file size, and an appropriate number of files depending on the size of the table. It will take care of cardinality, data skew and small file problem.
Use for:
- skewed data
- tables that regularly ingest new data (streaming data, frequent batch insertions)
- query patterns are constantly changing with time
- partitioning leads to smile file problem
Liquid clustering comes in 2 flavours:
- Automated liquid clustering (databricks only feature, part of Predictive Optimization) - the runtime will automatically choose columns to cluster on based on history of queries executed on the table. This is very similar to Snowflake's automated re-clustering.
- Non-automated liquid clustering - you'll have to specify the columns to cluster on
When is liquid clustering triggered?
TRIGGER ON WRITE
Liquid clustering happens at write-time. Every time delta writes writes to a table, liquid clustering is performed on it (INSERT, MERGE, COPY INTO etc). So, all liquid clustering work happens inside the spark jobs that are touching the table. There is no separate background service that continuously optimizes the tables.
MANUAL/EXPLICIT TRIGGER
Liquid clustering can be manually triggered via the optimize command
from delta.tables import *
deltaTable = DeltaTable.forPath(spark, "path/to/table")
deltaTable.optimize().executeCompaction()
OPTIMIZE table_name;
AUTOMATED TRIGGER (databricks only)
When predictive optimization is enabled (requires databricks premium plan), databricks will automatically run ANALYZE, OPTIMIZE, VACUUM operations - by queueing them as job to be run on serverless compute instances.
Since OPTIMIZE is being run, it will also end up triggering liquid clustering.
What happens during liquid-clustering?
- Delta reads clustering metadata - looks at existing files, their min/max ranges on clustering keys, clustering quality scores
- Incoming rows are sorted and grouped by the clustering keys.
- Delta will append rows into files whose key ranges are a good match for that particular row. If there is no existing file with a good match, it will create new file with tight range.
There is no global-sort. No full table scan (only file statistics are read). The computational cost is proportional only to the new data.
If liquid clustering finds that the existing data is horribly clustered, it won't re-organize and re-write the whole table. Instead, it will only rewrite a few of the worst files. If it sees diminishing returns, it will stop early.
Why? because it doesn't want to fix streaming SLAs. Re-organizing the entire table may take a long time, breaking the SLA. So instead, it will fix only a few existing files on every run. Over time, as more and more writes trigger liquid clustering, the table will eventually become well-clustered.
So, in short, liquid clustering will touch a few files on every run, keeping the rewrite size small.
Automated liquid clustering
(Databricks only feature)
When automated liquid clustering is enabled, the runtime will analyze the historical query workload, and automatically identify the best columns for clustering. It will figure out the keys which will lead to cost savings from data skipping. In this case, the liquid clustering columns won't need to be manually declared.
-- Enable automatic liquid clustering on an existing table,
-- including tables that previously had manually specified keys.
ALTER TABLE table1 CLUSTER BY AUTO;
-- Disable automatic liquid clustering on an existing table.
ALTER TABLE table1 CLUSTER BY NONE;
-- Disable automatic liquid clustering by setting the clustering keys
-- to chosen clustering columns or new columns.
ALTER TABLE table1 CLUSTER BY (column01, column02);
Source: Databricks: Delta Clustering: Enable/disable automatic liquid clustering
What triggers reclustering?
A re-clustering is triggered when: - Files overlap on clustering keys - There is fragmentation - Query
Creating/modifying tables for liquid clustering
Liquid clustering can only be enabled on delta tables created with CLUSTER BY clause.
How to choose columns to cluster on
Source - delta.io: liquid-clustering
Clustering can be done on multiple columns. Order of columns doesn't matter. Cardinality of the column doesn't matter.
If existing tables are being migrated to liquid clustering based delta tables, choose liquid clustering columns as follows:
- Hive-style partitioning - use columns used for partitioning the data
- Z-order - use ZORDER BY columns
- Hive-style partitioning & zorder - use columns used for partitioning the data and ZORDER BY columns
- Generated columns to reduce cardinality - use original column, don't create a generated column
Create new table with liquid clustering
(
DeltaTable.create()
.tableName("sales")
.addColumn("customer", dataType = "STRING")
.addColumn("city", dataType = "STRING")
.addColumn("sales", dataType = "INT")
.clusterBy("customer", "city") # enables liquid clustering
.execute()
)
CREATE TABLE table1(customer string, city string, sales int) USING DELTA CLUSTER BY (customer);
New table from existing table
CREATE TABLE sales_clustered CLUSTER BY (customer, city)
AS SELECT * FROM sales_raw;
Enable liquid clustering on existing table
ALTER TABLE events
SET TBLPROPERTIES (
delta.liquid.clusteringKeys = 'user_id, event_date'
)
Change liquid-clustering columns
ALTER TABLE table_name CLUSTER BY (new_column1, new_column2);
Disable liquid clustering
ALTER TABLE table_name CLUSTER BY NONE;
Auto-compaction - Control file size
Auto compaction kicks in when the number of file sizes is above a threshold (the spark configuration variable is spark.databricks.delta.autoCompact.minNumFiles). The output file size is controlled using spark.databricks.delta.autoCompact.maxFileSize.
Like liquid clustering, auto compaction comes in 2 flavours:
- Automated auto-compaction (Databricks only feature, part of Predictive Optimization) - the runtime will run OPTIMIZE, which will end up performing auto-compaction
- Non-automated auto-compaction - auto compaction will have to be performed by manually running
OPTIMIZEvia a spark job, or a user.
Auto compaction executes an OPTIMIZE command at the end of a transaction. Keep in mind that since OPTIMIZE is being run - if liquid clustering was enabled, then liquid clustering will also be performed on the table.
Enabling auto compaction
To enable auto compaction, following variables can be set:
- Table properties:
delta.autoOptimize.autoCompact - Spark Session setting:
spark.databricks.delta.autoCompact.enabled
The values these 2 settings take are:
| Options | Behavior |
|---|---|
auto (recommended) |
Tunes target file size while respecting other autotuning functionality. Requires Databricks Runtime 10.4 LTS or above. |
legacy |
Alias for true. Requires Databricks Runtime 10.4 LTS or above. |
true |
Use 128 MB as the target file size. No dynamic sizing. |
false |
Turns off auto compaction. Can be set at the session level to override auto compaction for all Delta tables modified in the workload. |
Source: Databricks: Auto compaction for Delta Lake on Databricks
Auto tuning
Write heavy tables
If the table is write-heavy and many MERGE and DML operations are executed on it, set table property delta.tuneFileSizesForRewrites to true.
When true, the target file size is set to a lower threshold, which makes write-intensive operations faster.
If not set, databricks will detect if 9 of the last 10 ops were MERGE, and set the table property to true. To disable this behaviour, the value has to be explicitly set to false.
Estimating number of files
Databricks aims for smaller file sizes for smaller tables, and larger file sizes for larger tables. The aim it to make sure that the number of files doesn't increase too much.
However, it won't set the file size dynamically if you've explicitly set target file size (delta table property delta.targetFileSize. Its default value is None).
| Table size | Target file size | Approximate number of files in table |
|---|---|---|
| 10 GB | 256 MB | 40 |
| 1 TB | 256 MB | 4096 |
| 2.56 TB | 256 MB | 10240 |
| 3 TB | 307 MB | 12108 |
| 5 TB | 512 MB | 17339 |
| 7 TB | 716 MB | 20784 |
| 10 TB | 1 GB | 24437 |
| 20 TB | 1 GB | 34437 |
| 50 TB | 1 GB | 64437 |
| 100 TB | 1 GB | 114437 |
Enable background auto compaction
This is a 2 step process:
- remove spark config
spark.databricks.delta.autoCompact.enabledfrom cluster/notebook settings - for each table, run
ALTER TABLE <table_name> UNSET TBLPROPERTIES (delta.autoOptimize.autoCompact). This will remove legacy auto-compaction settings.
After removing these, auto compaction will be triggered automatically (for all unity catalogue managed tables)
Automating table maintenance (Databricks only)
Predictive optimization
This is a feature available for Unity Catalog managed tables.
When predictive optimization is enabled (requires databricks premium plan), databricks will automatically run ANALYZE, OPTIMIZE, VACUUM operations - by queueing them as job to be run on serverless compute (you will be billed for this compute).
Since OPTMIZE is being run, it will end up triggering both liquid clustering and auto-compaction on tables.
Observability statistics for predictive optimization executions are stored in the table
system.storage.predictive_optimization_operations_history.
To explicitly enable/disable predictive optimization on a table:
ALTER CATALOG [catalog_name] { ENABLE | DISABLE | INHERIT } PREDICTIVE OPTIMIZATION;
ALTER { SCHEMA | DATABASE } schema_name { ENABLE | DISABLE | INHERIT } PREDICTIVE OPTIMIZATION;
Manual intervention
Rough bullet points - Aim for 128mb-1gb per file
Liquid clustering
Small file problem
OPTIMIZE - data layout optimization
Source - Databricks: Delta Optimize
Will perform bin-packing and z-ordering.
Merges small files into larger ones. Good for read heavy tables.
Auto-optimization - wip - WHERE TO PUT THIS IN THIS NOTE?
The term has been split into 2: - https://docs.databricks.com/aws/en/delta/tune-file-size#optimized-writes - https://docs.databricks.com/aws/en/delta/tune-file-size#auto-compact
ALTER TABLE events SET TBLPROPERTIES (
delta.autoOptimize.optimizeWrite = true,
delta.autoOptimize.autoCompact = true
)
Z-ORDER - faster reads by data-skipping
NOTE: As of now (2025), liquid clustering is recommended over Z-ORDERING.
WARNING:
Very costly operation to perform. Z-order will lead to re-write amplification. Every
OPTIMIZE ZORDERwill rewrite large chunks of data.
- Improves file-level data skipping when queries are filtering on multiple columns
- While writing - writes rows with similar values close together, and stores min/max stats per file
Now, spark can skip entire files by looking at the min/max stats. Very effective for range and equality filters
Z-ORDER has to be triggered manually (in pipelines or scheduled jobs). So, use Z-ORDER only for legacy tables, and one-off historical backfills.
By default, simply prefer liquid clustering.
To run z-order:
OPTIMIZE orders
ZORDER BY (customer_id, order_date)
Zorder vs bucketing
Bucketing is not supported for delta tables.
Similar to partitioning+bucketing for parquet tables, partitioning+z-ordering is the way to go for delta tables.
However, it is still recommended to simply use liquid clustering instead of partitioning+z-ordering.
Zorder and bucketing solve completely different problems. Zorder will help speed up writes through data-skipping. Bucketing will help speed up execution by preventing shuffles during joins and aggregations.
Z-order doesn't prevent the data-shuffle problem that bucketing does.
Partitioning - WIP
If liquid clustering is not available, Partitioning + Z-ORDER is the way to go, for enabling faster reads via file-skipping.
Also naturally, partitioning won't work only if there are no low-cardinality columns that are being used by queries to filter data. This almost never happens in practice, so partitioning the table is almost always the go-to strategy.
CREATE TABLE events
PARTITIONED BY (event_date)
VACUUM - Removing stale files by trimming history
VACUUM will:
- Trim data history by deleting snapshot files that are older than the specified time travel retention period.
- Physically delete dropped rows (that were soft-deleted via deletion vectors)
It is never run automatically. It will always have to be manually executed by a job or a user.
Delete old data files (loses history)
VACUUM events RETAIN 168 HOURS