This is a classic data lake design / partitioning dilemma. The implications are subtle and can have performance and maintainability impacts.
Current approach (Option 1)
raw/
Current marketing/
affiliate/
2022/
01/
12/
file1.parquet
- Pros:
- Easy to understand for humans: chronological order is clear.
- Works fine for batch reads where you want all data at once.
- No strict schema needed in folder names; just hierarchical organization.
- Cons:
- Spark/Databricks and Synapse will treat this as just a folder hierarchy, not as partition columns.
That means:
- You cannot easily filter by year/month/day in queries using partition pruning, which can dramatically affect performance.
- Optimizations like Delta Lake’s Z-order clustering or partition pruning are harder to leverage.
- Adding new filtering dimensions later (e.g.,
entity
,source
) requires reshaping the folder structure, potentially costly.
- Spark/Databricks and Synapse will treat this as just a folder hierarchy, not as partition columns.
That means:
Suggested approach (Option 2)
market/
entity=affiliate/
year=2022/
month=01/
day=12/
file1.parquet
- Pros:
- Each folder level is explicitly a partition key (
entity
,year
,month
,day
). Spark and Synapse can automatically discover partitions. - Queries are much faster because you can filter on partitions without scanning all files.
SELECT * FROM silver_table WHERE year=2022 AND month=01
- Easier to maintain with multiple sources/entities over time.
- Compatible with Delta Lake and other optimization features in Databricks (compaction, Z-ordering, caching).
- Each folder level is explicitly a partition key (
- Cons:
- Slightly more verbose folder naming.
- Requires consistent naming (
year=YYYY
,month=MM
, etc.). - If you change your mind later, re-partitioning existing data may be required.
Impact of changing your current structure
Since your data is already in the silver layer:
- If you restructure to key-value partitions (
entity=.../year=.../month=...
):- You may need to rewrite existing Parquet files to the new folder layout.
- Any downstream jobs referencing the old path need to be updated.
- If the dataset is huge, rewriting can be costly in terms of time and compute.
- If you keep the current structure:
- Queries on specific days/months will scan unnecessary files unless you implement manual filtering logic (based on folder names).
- Optimizations like partition pruning or automatic discovery in Spark/Databricks won’t work efficiently.
What I'd recommend is following:
- For long-term scalability and performance, Option 2 is the industry standard:
- Especially if you anticipate queries by date, entity, or other dimensions.
- It works well with Delta Lake / Synapse / Spark.
- For small datasets with infrequent querying, Option 1 is okay:
- Easier to maintain initially.
- No need to rewrite files.
Practical compromise:
- Keep landing/raw layer in Option 1 style (easy ingestion).
- Use bronze/silver layer for Option 2 style (key-value partitions):
- You can write a one-time job to re-partition silver files according to
entity/year/month/day
. - Then all downstream jobs benefit from partition pruning.
- You can write a one-time job to re-partition silver files according to
If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.
hth
Marcin