Timestamp column from Parquet file (created with Pandas) becomes NULL values in Data Flow
Hello,
I'm facing an issue where a timestamp column in a Parquet file is consistently read as NULL within an Azure Data Flow. Critically, the Data Flow does not throw any errors, making the problem difficult to debug.
Here is a summary of my environment and the problem:
Scenario:
- I am using a Data Flow within Azure Data Factory to read a Parquet file.
- In the Data Flow's data preview and subsequent transformations, all values in the target date/timestamp column are NULL.
Parquet File Generation Details:
- The Parquet file is generated using a Python script.
- I use a Pandas DataFrame and export it to a Parquet file using the
to_parquet()
function. - The source column in the Pandas DataFrame has a
datetime64[ns]
data type. - When creating the file, I am providing a PyArrow schema that maps this column to
pyarrow.timestamp('s')
. - The date format is
YYYY-MM-DD hh:mm:ss
.
How Azure Data Factory Interprets the Schema:
- In the source Dataset within Data Factory, the schema for the column is inferred as TIMESTAMP_MILLIS.
- Within the Data Flow's source projection, the column is recognized as a timestamp type.
My Question:
Given that I'm writing the timestamp with second-level precision (timestamp('s')
) but Data Factory seems to be interpreting it as millisecond-level precision (TIMESTAMP_MILLIS
), could this mismatch be causing the values to be silently dropped and replaced with NULLs?
How can I configure my process to ensure the date values are loaded correctly? Should I change my PyArrow schema to use timestamp('ms')
to align with Data Factory, or is there a setting within the Data Flow that I need to adjust?
This is a critical issue for us as it blocks all our date-based data processing. Any help or guidance would be greatly appreciated.
Thank you.