java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data #11614

wardlican · 2024-11-21T06:30:03Z

Apache Iceberg version

1.4.3

Query engine

Spark

Please describe the bug 🐞

CALL spark_catalog.system.rewrite_data_files(
  table => '${DATABASE_NAME}.${TABLE_NAME}',
  options => map(
    'max-concurrent-file-group-rewrites', 500,
    'target-file-size-bytes','536870912',
    'max-file-group-size-bytes','10737418240',
    'rewrite-all', 'true')
);

After using spark_catalog.system.rewrite_data_files to merge iceberg small files, the new parquet generated encountered an unreadable problem When currently executing a query operation . The error message is as follows

	 client token: N/A
	 diagnostics: User class threw exception: java.lang.RuntimeException: Job aborted due to stage failure: Task 208 in stage 7.0 failed 4 times, most recent failure: Lost task 208.3 in stage 7.0 (TID 272) (10.1.75.103 executor 9): org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141)
	at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1501)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:366)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.readPageHeader(Util.java:133)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1458)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1505)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1478)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1088)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:956)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:909)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163)
	... 23 more

Willingness to contribute

I can contribute a fix for this bug independently
I would be willing to contribute a fix for this bug with guidance from the Iceberg community
I cannot contribute a fix for this bug at this time

The text was updated successfully, but these errors were encountered:

jia-zhengwei · 2024-11-21T08:16:56Z

 Required field num_values was not found in serialized data!

What's the column of num_values ?

Fokko · 2024-11-21T18:35:33Z

Thanks @wardlican for raising this. Do you happen to know which system produced the Parquet files (Spark, Arrow, etc)?

wardlican · 2024-11-22T02:57:38Z

Thanks @wardlican for raising this. Do you happen to know which system produced the Parquet files (Spark, Arrow, etc)?

We are using spark_catalog.system.rewrite_data_files submitted by hive-beeline (spark engine)

wardlican · 2024-11-22T02:59:20Z

 Required field num_values was not found in serialized data!
What's the column of num_values ?

企业微信截图_2f8b95f5-ca6f-47b2-88ee-00574ce105ea

num_values should be an attribute in the parquet file format, not a data field

wardlican added the bug Something isn't working label Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data #11614

java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data #11614

wardlican commented Nov 21, 2024 •

edited

Loading

jia-zhengwei commented Nov 21, 2024

Fokko commented Nov 21, 2024

wardlican commented Nov 22, 2024

wardlican commented Nov 22, 2024

java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data #11614

java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data #11614

Comments

wardlican commented Nov 21, 2024 • edited Loading

Apache Iceberg version

Query engine

Please describe the bug 🐞

Willingness to contribute

jia-zhengwei commented Nov 21, 2024

Fokko commented Nov 21, 2024

wardlican commented Nov 22, 2024

wardlican commented Nov 22, 2024

wardlican commented Nov 21, 2024 •

edited

Loading