ADLS structure approach

Question

ADLS structure approach

azure_learner 755

Hi friends, we are getting data from API endpoint , and saving the data in Parquet file format, since it is Parquet file format ,I have designed Data Lake hierarchical order in the form ,ADLS hierarchy for landing zone,bronze, silver layer now, for datasets are all populated fully.

```
        raw/
```

Current marketing/

                         affailaite/

                                   2022/

                                           01/

                                                12/

                                                        file1.parquet

But, my colleague is arguing that it is better to have along the lines :

2.

Suggested source=market/

entity=affilliate /

year=2022/

month=01/

day=12/ file1.json

As I have already implemented the 1 because , it was not a database and datafile being saved as a parquet format at landing ,bronze,silver layer. Is it the right approach to go forward?

But, for option 2 as I mentioned before my colleague is saying key-value approach is much better as it provides:

Automatic partition discovery in Synapse
Built-in optimization in Spark/Databricks

and databricks engine would not recognize 2022 as the year, 01 as month and 12 as a day explicitly. Please guide and help me here. As data resides in the silver layer , If I do make changes what impact would have in terms of data and other aspects or my 1 approach is correct as it is not a database, and it is much better to read the whole structure because of the parquet file format. I will appreciate your help immensely. Thank you in advance for your gernerous help.

Accepted answer

0 additional answers

Your answer

Answer 1

Marcin Policht 54,995 MVP Volunteer Moderator

This is a classic data lake design / partitioning dilemma. The implications are subtle and can have performance and maintainability impacts.

Current approach (Option 1)

raw/
   Current marketing/
       affiliate/
           2022/
               01/
                   12/
                       file1.parquet

Pros:
1. Easy to understand for humans: chronological order is clear.
2. Works fine for batch reads where you want all data at once.
3. No strict schema needed in folder names; just hierarchical organization.
Cons:
1. Spark/Databricks and Synapse will treat this as just a folder hierarchy, not as partition columns. That means:
  - You cannot easily filter by year/month/day in queries using partition pruning, which can dramatically affect performance.
  - Optimizations like Delta Lake’s Z-order clustering or partition pruning are harder to leverage.
2. Adding new filtering dimensions later (e.g., entity, source) requires reshaping the folder structure, potentially costly.

Suggested approach (Option 2)

market/
   entity=affiliate/
       year=2022/
           month=01/
               day=12/
                   file1.parquet

Pros:
1. Each folder level is explicitly a partition key (entity, year, month, day). Spark and Synapse can automatically discover partitions.
2. Queries are much faster because you can filter on partitions without scanning all files.
```
     SELECT * FROM silver_table
     WHERE year=2022 AND month=01
```
3. Easier to maintain with multiple sources/entities over time.
4. Compatible with Delta Lake and other optimization features in Databricks (compaction, Z-ordering, caching).
Cons:
1. Slightly more verbose folder naming.
2. Requires consistent naming (year=YYYY, month=MM, etc.).
3. If you change your mind later, re-partitioning existing data may be required.

Impact of changing your current structure

Since your data is already in the silver layer:

If you restructure to key-value partitions (entity=.../year=.../month=...):
1. You may need to rewrite existing Parquet files to the new folder layout.
2. Any downstream jobs referencing the old path need to be updated.
3. If the dataset is huge, rewriting can be costly in terms of time and compute.
If you keep the current structure:
1. Queries on specific days/months will scan unnecessary files unless you implement manual filtering logic (based on folder names).
2. Optimizations like partition pruning or automatic discovery in Spark/Databricks won’t work efficiently.

What I'd recommend is following:

For long-term scalability and performance, Option 2 is the industry standard:
- Especially if you anticipate queries by date, entity, or other dimensions.
- It works well with Delta Lake / Synapse / Spark.
For small datasets with infrequent querying, Option 1 is okay:
- Easier to maintain initially.
- No need to rewrite files.

Practical compromise:

Keep landing/raw layer in Option 1 style (easy ingestion).
Use bronze/silver layer for Option 2 style (key-value partitions):
- You can write a one-time job to re-partition silver files according to entity/year/month/day.
- Then all downstream jobs benefit from partition pruning.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

azure_learner 755 Reputation points

2025-08-24T08:05:16.64+00:00
Thank you @Marcin Policht I appreciate your help alot. Since in bronze and silver layer we are creating the hierarchical structure through databricks, I would follow this approach:

Change the databricks code , and incoporate your suggestions and implement key-value based heirarchy in bronze and silver layer.

Should I go for full blown hierarchical structure starting with
**1.** datasouce=marketing year=2022/ month= 01/ day=12/ file1.parquet or **2.** year=2022/ month= 01/ day=12/ file1.parquet

or option 2 will suffice.

Also,since we already have data in these layers, would simply changing hierarchal place holders at the databricks code level , would it impact underlying data, and are there any other consideratons that I should take care of.

Please help and suggest. Thank you very much again.
Marcin Policht 54,995 Reputation points MVP Volunteer Moderator

2025-08-24T11:25:01.75+00:00
I'd agree that moving bronze/silver to a key-value (Hive-style) partition layout is the right call. As far as deciding between the two variants is concerned, you mentioned:

datasource=marketing/year=2022/month=01/day=12/file1.parquet

year=2022/month=01/day=12/file1.parquet

If you already separate domains in the metastore (catalog/schema/table) or by top-level folders like /bronze/marketing/... then Option 2 is sufficient for the partition portion.

If you store multiple domains in the same table root/path and need the path itself to distinguish them, then keep a domain key as a non-partition folder above the table root or make it a partition column if you'll filter on it.

I'd suggest a table-centric layout, with domain separation at catalog/schema (Unity Catalog) or top-level directory, and then Hive-style key=value partitions for pruning:

/Volumes/bronze/marketing/affiliate/ <-- table root (Delta recommended) entity=affiliate/ <-- only if you truly need entity as a filter/partition year=2022/ month=01/ day=12/ part-0000-....snappy.parquet|delta

or, without entity= if the table is already the entity:

/Volumes/bronze/marketing/affiliate/ year=2022/ month=01/ day=12/ ...

In general, don't over-partition. Common sweet spot is (year,month,day) (by event date or ingestion date—see below). In addition, reserve extra partition keys (like entity) only if they're selective and commonly filtered.

Spark/Databricks/Synapse auto partition discovery expects key=value/… folder names. Partition pruning (and file-skipping) works only when the filterable dimension is a partition column (and encoded in the path). Delta Lake (recommended over naked Parquet) further gives OPTIMIZE/VACUUM/time travel and removes the need for manual “repair partitions”.

Regarding the migration plan, you asked if “simply changing placeholders in Databricks code” impacts data. Changing the write path does not move existing files; it just starts writing to a new location. So plan the cutover:

Pick partition columns

Prefer event_date (the business date) → year, month, day.

If you have late arriving data and want “what arrived when”, also keep an ingestion_date column (not necessarily a partition).

Create new table paths (Delta recommended)

Start writing new data to the new table root with partitionBy("year","month","day").

Backfill later (see step 4) or union old+new via a view during transition.

Keep old readers working

Create a view that unions the old layout and the new layout so downstream jobs don't break:
CREATE OR REPLACE VIEW silver.marketing_affiliate AS SELECT * FROM silver_old.marketing_affiliate UNION ALL SELECT * FROM silver_new.marketing_affiliate;

Swap dependencies gradually, then retire the old.

Backfill historical data (optional but recommended)

A one-time job that reads old layout, derives year/month/day, and rewrites to the new root.

For Delta targets: after write, run OPTIMIZE (and later VACUUM).

Tombstone/redirect

When confident, update consumers to the new table only, and archive or delete the old path.

Regarding code patterns (Databricks / PySpark), derive partitions and write Delta:

from pyspark.sql import functions as F df = source_df.withColumn("event_date", F.to_date("event_ts")) \ .withColumn("year", F.year("event_date")) \ .withColumn("month", F.date_format("event_date", "MM")) \ .withColumn("day", F.date_format("event_date", "dd")) # Bronze example (external location or UC Volume): target_path = "/Volumes/bronze/marketing/affiliate" (df.write .format("delta") .mode("append") .partitionBy("year","month","day") .save(target_path)) # Register as a table (preferred for governance) spark.sql(f""" CREATE TABLE IF NOT EXISTS bronze.marketing_affiliate USING DELTA LOCATION '{target_path}' """)

With Auto Loader (if you ingest files continuously):

from pyspark.sql.functions import * raw_path = "/Volumes/raw/marketing/affiliate" bronze_path = "/Volumes/bronze/marketing/affiliate" (spark.readStream .format("cloudFiles") .option("cloudFiles.format", "json") # or csv/parquet .load(raw_path) .withColumn("event_date", to_date("event_ts")) .withColumn("year", year("event_date")) .withColumn("month", date_format("event_date","MM")) .withColumn("day", date_format("event_date","dd")) .writeStream .option("checkpointLocation", bronze_path + "/_checkpoint") .format("delta") .partitionBy("year","month","day") .outputMode("append") .start(bronze_path))

Query pruning works naturally:

SELECT count(*) FROM bronze.marketing_affiliate WHERE year = 2024 AND month = '07';

Backfill old layout → new layout:

old = spark.read.format("parquet").load("/old/bronze/Current marketing/affiliate/*/*/*/*") old2 = (old .withColumn("event_date", F.to_date("event_ts")) .withColumn("year", F.year("event_date")) .withColumn("month", F.date_format("event_date","MM")) .withColumn("day", F.date_format("event_date","dd")) ) (old2.write .format("delta") .mode("append") .partitionBy("year","month","day") .save("/Volumes/bronze/marketing/affiliate"))

Delta maintenance:

OPTIMIZE bronze.marketing_affiliate; -- compaction VACUUM bronze.marketing_affiliate RETAIN 168 HOURS; -- after you're sure no readers need old files

Synapse & Spark “discover” and prune partitions when folders are col=value. So your colleague is right: use key=value names (e.g., year=2022/month=01/day=12). If you stick with plain 2022/01/12, the engines treat them as directories, not partition columns, and you lose pruning unless you manually parse paths.

Other considerations to keep in mind:

Delta vs Parquet: Prefer Delta for bronze/silver. You keep Parquet inside Delta anyway, but gain transaction logs, schema evolution, OPTIMIZE/VACUUM, and simpler partition management.

Schema evolution: If your API evolves, enable controlled evolution (mergeSchema on write or DLT expectations) and track with table constraints.

Partition size: Aim for 128–512 MB file sizes post-OPTIMIZE. Too many tiny files kill performance.

Late data: If you partition by event date, writes for old days will land in older partitions—fine in Delta. If your SLA is ingestion-based, partition by ingestion date instead, and keep event_date as a query column.

Metastore vs path readers: If any downstreams read paths directly, path changes will break them. Prefer reading tables (catalog.schema.table). Provide a compatibility view during migration.

Repair/discover:

Delta tables: ALTER TABLE … SET LOCATION if you move the root; no MSCK needed.

Non-Delta Hive tables: MSCK REPAIR TABLE to load partitions.

Naming: Use consistent lower_snake_case table and column names; partition columns typed as INT (year), STRING (‘MM', ‘dd') or INT—just be consistent with filters.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin
Maraya Peretti 0 Reputation points

2025-08-24T11:45:32.12+00:00

nice bro
azure_learner 755 Reputation points

2025-08-24T13:28:25.8866667+00:00

Thank you so much @Marcin Policht I am grateful for your comprehensive and elaborate explainations.

Share via

ADLS structure approach

0 additional answers

Your answer