ADLS structure approach

azure_learner 755 Reputation points
2025-08-23T17:16:33.8166667+00:00

Hi friends, we are getting data from API endpoint , and saving the data in Parquet file format, since it is Parquet file format ,I have designed Data Lake  hierarchical order in the form ,ADLS hierarchy for landing zone,bronze, silver layer now, for datasets are all populated fully.

  1.         raw/
    

Current marketing/

                         affailaite/

                                   2022/

                                           01/

                                                12/

                                                        file1.parquet

But, my colleague is arguing that it is better to have along the lines :

2.

Suggested source=market/ 

                             entity=affilliate / 

                                          year=2022/

                                               month=01/ 

                                                    day=12/ file1.json

As I have already implemented the 1 because , it was not a database and datafile being saved as a parquet format at landing ,bronze,silver layer. Is it the right approach to go forward?

But, for option 2 as I mentioned before my colleague is saying key-value approach is much better as it provides:

  • Automatic partition discovery in Synapse
  • Built-in optimization in Spark/Databricks  

and databricks engine would not recognize 2022 as the year, 01 as month and 12 as a day explicitly.  Please guide and help me here. As data resides in the silver layer , If I do make changes what impact would have in terms of data and other aspects or my 1 approach is correct as it is not a database, and it is much better to read the whole structure because of the parquet file format. I will appreciate your help immensely. Thank you in advance for your gernerous help.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
0 comments No comments
{count} votes

Accepted answer
  1. Marcin Policht 54,995 Reputation points MVP Volunteer Moderator
    2025-08-23T19:22:42.6666667+00:00

    This is a classic data lake design / partitioning dilemma. The implications are subtle and can have performance and maintainability impacts.

    Current approach (Option 1)

    raw/
       Current marketing/
           affiliate/
               2022/
                   01/
                       12/
                           file1.parquet
    
    • Pros:
      1. Easy to understand for humans: chronological order is clear.
      2. Works fine for batch reads where you want all data at once.
      3. No strict schema needed in folder names; just hierarchical organization.
    • Cons:
      1. Spark/Databricks and Synapse will treat this as just a folder hierarchy, not as partition columns. That means:
        • You cannot easily filter by year/month/day in queries using partition pruning, which can dramatically affect performance.
        • Optimizations like Delta Lake’s Z-order clustering or partition pruning are harder to leverage.
      2. Adding new filtering dimensions later (e.g., entity, source) requires reshaping the folder structure, potentially costly.

    Suggested approach (Option 2)

    market/
       entity=affiliate/
           year=2022/
               month=01/
                   day=12/
                       file1.parquet
    
    • Pros:
      1. Each folder level is explicitly a partition key (entity, year, month, day). Spark and Synapse can automatically discover partitions.
      2. Queries are much faster because you can filter on partitions without scanning all files.
             SELECT * FROM silver_table
             WHERE year=2022 AND month=01
        
      3. Easier to maintain with multiple sources/entities over time.
      4. Compatible with Delta Lake and other optimization features in Databricks (compaction, Z-ordering, caching).
    • Cons:
      1. Slightly more verbose folder naming.
      2. Requires consistent naming (year=YYYY, month=MM, etc.).
      3. If you change your mind later, re-partitioning existing data may be required.

    Impact of changing your current structure

    Since your data is already in the silver layer:

    • If you restructure to key-value partitions (entity=.../year=.../month=...):
      1. You may need to rewrite existing Parquet files to the new folder layout.
      2. Any downstream jobs referencing the old path need to be updated.
      3. If the dataset is huge, rewriting can be costly in terms of time and compute.
    • If you keep the current structure:
      1. Queries on specific days/months will scan unnecessary files unless you implement manual filtering logic (based on folder names).
      2. Optimizations like partition pruning or automatic discovery in Spark/Databricks won’t work efficiently.

    What I'd recommend is following:

    1. For long-term scalability and performance, Option 2 is the industry standard:
      • Especially if you anticipate queries by date, entity, or other dimensions.
      • It works well with Delta Lake / Synapse / Spark.
    2. For small datasets with infrequent querying, Option 1 is okay:
      • Easier to maintain initially.
      • No need to rewrite files.

    Practical compromise:

    • Keep landing/raw layer in Option 1 style (easy ingestion).
    • Use bronze/silver layer for Option 2 style (key-value partitions):
      • You can write a one-time job to re-partition silver files according to entity/year/month/day.
      • Then all downstream jobs benefit from partition pruning.

    If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

    hth

    Marcin

    2 people found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.