Timestamp column from Parquet file (created with Pandas) becomes NULL values in Data Flow

Question

Timestamp column from Parquet file (created with Pandas) becomes NULL values in Data Flow

hh 0

Hello,

I'm facing an issue where a timestamp column in a Parquet file is consistently read as NULL within an Azure Data Flow. Critically, the Data Flow does not throw any errors, making the problem difficult to debug.

Here is a summary of my environment and the problem:

Scenario:

I am using a Data Flow within Azure Data Factory to read a Parquet file.
In the Data Flow's data preview and subsequent transformations, all values in the target date/timestamp column are NULL.

Parquet File Generation Details:

The Parquet file is generated using a Python script.
I use a Pandas DataFrame and export it to a Parquet file using the to_parquet() function.
The source column in the Pandas DataFrame has a datetime64[ns] data type.
When creating the file, I am providing a PyArrow schema that maps this column to pyarrow.timestamp('s').
The date format is YYYY-MM-DD hh:mm:ss.

How Azure Data Factory Interprets the Schema:

In the source Dataset within Data Factory, the schema for the column is inferred as TIMESTAMP_MILLIS.
Within the Data Flow's source projection, the column is recognized as a timestamp type.

My Question:

Given that I'm writing the timestamp with second-level precision (timestamp('s')) but Data Factory seems to be interpreting it as millisecond-level precision (TIMESTAMP_MILLIS), could this mismatch be causing the values to be silently dropped and replaced with NULLs?

How can I configure my process to ensure the date values are loaded correctly? Should I change my PyArrow schema to use timestamp('ms') to align with Data Factory, or is there a setting within the Data Flow that I need to adjust?

This is a critical issue for us as it blocks all our date-based data processing. Any help or guidance would be greatly appreciated.

Thank you.

Venkat Reddy Navari 5,815 Reputation points Microsoft External Staff Moderator

2025-08-26T06:51:57.94+00:00
Hiromu Honda Yes – the issue comes from the precision mismatch between how Pandas/PyArrow is writing the column and how Data Factory’s Data Flow expects to read it.

PyArrow / Pandas side: By default, datetime64[ns] maps to nanosecond precision, but when you explicitly use pyarrow.timestamp('s'), you are restricting it to seconds.

ADF side: Data Flow interprets Parquet timestamps using TIMESTAMP_MILLIS (millisecond precision). If the stored Parquet metadata doesn’t align, ADF cannot map the values and ends up producing NULLs instead of throwing an error.

When writing the Parquet file, change your schema mapping to:

pyarrow.timestamp('ms')

This ensures the written file matches ADF’s expectation (TIMESTAMP_MILLIS). After that, the values should show up correctly in your Data Flow without additional settings.

If you want to double-check, you can inspect the Parquet schema using parquet-tools schema <file> (or PyArrow’s pq.read_schema()) to confirm the actual stored type before ingesting it into ADF.

Align your Parquet write schema with timestamp('ms') and the NULLs will go away.

I hope this information helps. Please do let us know if you have any further queries. Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
hh 0 Reputation points

2025-08-28T06:19:17.4266667+00:00
@Venkat Reddy Navari

Thank you for your prompt response and suggestions.

I have tried the methods you recommended, but unfortunately, I'm still facing errors. Here are the results of my attempts:

When I used pyarrow.timestamp('ms') in my Python code, I received the following error in the Data Flow preview: TimestampNTZType (of class org.apache.spark.sql.types.TimestampNTZType$)

Next, I tried pyarrow.timestamp("ms", tz="UTC"). This resulted in a different, more detailed error:

Job aborted due to stage failure: Task 0 in stage 19.0 failed 1 times, most recent failure: Lost task 0.0 in stage 19.0 (TID 19) (vm-37d85505 executor 1): org.apache.spark.SparkRuntimeException: Unable to create Parquet converter for data type "timestamp_ntz" whose Parquet type is optional int64 TBL (TIMESTAMP(MILLIS,true)). at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotCreateParquetConverterForDataTypeError(QueryExecutionErrors.scala:2298) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:499) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.$anonfun$fieldConverters$4(ParquetRowConverter.scala:258) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.map(TraversableLike.scala:286) at scala.collection.TraversableLike.map$(TraversableLike.scala:279) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.<init>(ParquetRowConverter.scala:246) at org.apache.spark.sql.execution.datasources.parquet.ParquetRecordMaterializer.<init>(ParquetRecordMaterializer.scala:57) at org.apache.spark.sql.execution.datasources.parquet.ParquetReadSupport.prepareForRead(ParquetReadSupport.scala:111) at org.apache.parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:213) at org.apache.parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:182) at org.apache.parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:140) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2(ParquetFileFormat.scala:353) at org.apache.spark.sql.execution.datasources.parquet2.Parquet2FileFormat.$anonfun$buildReaderWithPartitionValues$2(Parquet2FileFormat.scala:29) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:266) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:330) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:158) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:764) at scala.collection.Iterator$SliceIterator.hasNext(Iterator.scala:268) at org.apache.spark.sql.execution.columnar.DefaultCachedBatchSerializer$$anon$1.hasNext(InMemoryRelation.scala:118) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.storage.memory.MemoryStore.putIterator(MemoryStore.scala:239) at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:368) at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1549) at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1459) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1523) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1350) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:379) at org.apache.spark.rdd.RDD.iterator(RDD.scala:330) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:368) at org.apache.spark.rdd.RDD.iterator(RDD.scala:332) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:139) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:574) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:577) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)

It seems there is still a compatibility issue between the timestamp format written by pyarrow and what Azure Data Flow expects.

Given these new errors, do you have any other advice or a different approach I could try?

Thank you again for your help.
Venkat Reddy Navari 5,815 Reputation points Microsoft External Staff Moderator

2025-08-28T10:12:30.5533333+00:00
Hiromu Honda Thanks for testing those approaches and sharing the detailed error messages, that helps a lot.

From the logs, the problem comes from how Spark inside ADF Data Flow interprets timestamp_ntz (no time zone) vs timestamp_tz. ADF Data Flow is built on Spark, and Spark < 3.4 still has limited support for the NTZ type when reading Parquet, which is why you’re hitting the converter error.

A couple of options you can try:

Force millisecond precision with tz awareness when writing Instead of pyarrow.timestamp('s') or 'ms', try:

pa.timestamp('ms', tz=None)

This ensures the Parquet column is stored as plain TIMESTAMP_MILLIS (without NTZ metadata), which Data Flow can interpret properly.

Convert to string before writing (workaround if you just need the values to flow through)

Cast your datetime column to string in Pandas (astype(str)), write it out, and then use a Data Flow Derived Column to cast back to timestamp.

This avoids the schema conflict entirely and is often used as a workaround when portability is more important than raw performance.

Check schema with parquet-tools Run parquet-tools schema your_file.parquet to confirm whether the column is being written as TIMESTAMP(MILLIS,true) or timestamp_ntz. If it’s the NTZ form, ADF Data Flow will fail until Spark adds broader support.

I hope this information helps. Please do let us know if you have any further queries. Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Smaran Thoomu 29,500 Reputation points Microsoft External Staff Moderator

2025-08-29T18:01:13.38+00:00

Hiromu Honda We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Your answer

Venkat Reddy Navari 5,815 Reputation points Microsoft External Staff Moderator

2025-08-26T06:51:57.94+00:00

Hiromu Honda Yes – the issue comes from the precision mismatch between how Pandas/PyArrow is writing the column and how Data Factory’s Data Flow expects to read it.

PyArrow / Pandas side: By default, datetime64[ns] maps to nanosecond precision, but when you explicitly use pyarrow.timestamp('s'), you are restricting it to seconds.

ADF side: Data Flow interprets Parquet timestamps using TIMESTAMP_MILLIS (millisecond precision). If the stored Parquet metadata doesn’t align, ADF cannot map the values and ends up producing NULLs instead of throwing an error.

When writing the Parquet file, change your schema mapping to:

pyarrow.timestamp('ms')

This ensures the written file matches ADF’s expectation (TIMESTAMP_MILLIS). After that, the values should show up correctly in your Data Flow without additional settings.

If you want to double-check, you can inspect the Parquet schema using parquet-tools schema <file> (or PyArrow’s pq.read_schema()) to confirm the actual stored type before ingesting it into ADF.

Align your Parquet write schema with timestamp('ms') and the NULLs will go away.

I hope this information helps. Please do let us know if you have any further queries. Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Venkat Reddy Navari 5,815 Reputation points Microsoft External Staff Moderator

2025-08-28T10:12:30.5533333+00:00

Hiromu Honda Thanks for testing those approaches and sharing the detailed error messages, that helps a lot.

From the logs, the problem comes from how Spark inside ADF Data Flow interprets timestamp_ntz (no time zone) vs timestamp_tz. ADF Data Flow is built on Spark, and Spark < 3.4 still has limited support for the NTZ type when reading Parquet, which is why you’re hitting the converter error.

A couple of options you can try:

Force millisecond precision with tz awareness when writing Instead of pyarrow.timestamp('s') or 'ms', try:

pa.timestamp('ms', tz=None)

This ensures the Parquet column is stored as plain TIMESTAMP_MILLIS (without NTZ metadata), which Data Flow can interpret properly.

Convert to string before writing (workaround if you just need the values to flow through)

Cast your datetime column to string in Pandas (astype(str)), write it out, and then use a Data Flow Derived Column to cast back to timestamp.

This avoids the schema conflict entirely and is often used as a workaround when portability is more important than raw performance.

Check schema with parquet-tools Run parquet-tools schema your_file.parquet to confirm whether the column is being written as TIMESTAMP(MILLIS,true) or timestamp_ntz. If it’s the NTZ form, ADF Data Flow will fail until Spark adds broader support.

I hope this information helps. Please do let us know if you have any further queries. Kindly consider upvoting the comment if the information provided is helpful. This can assist other community members in resolving similar issues.
Smaran Thoomu 29,500 Reputation points Microsoft External Staff Moderator

2025-08-29T18:01:13.38+00:00

Hiromu Honda We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Share via

Timestamp column from Parquet file (created with Pandas) becomes NULL values in Data Flow

Your answer