Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data #11614

Open
1 of 3 tasks
wardlican opened this issue Nov 21, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@wardlican
Copy link

wardlican commented Nov 21, 2024

Apache Iceberg version

1.4.3

Query engine

Spark

Please describe the bug 🐞

CALL spark_catalog.system.rewrite_data_files(
  table => '${DATABASE_NAME}.${TABLE_NAME}',
  options => map(
    'max-concurrent-file-group-rewrites', 500,
    'target-file-size-bytes','536870912',
    'max-file-group-size-bytes','10737418240',
    'rewrite-all', 'true')
);

After using spark_catalog.system.rewrite_data_files to merge iceberg small files, the new parquet generated encountered an unreadable problem When currently executing a query operation . The error message is as follows

	 client token: N/A
	 diagnostics: User class threw exception: java.lang.RuntimeException: Job aborted due to stage failure: Task 208 in stage 7.0 failed 4 times, most recent failure: Lost task 208.3 in stage 7.0 (TID 272) (10.1.75.103 executor 9): org.apache.iceberg.exceptions.RuntimeIOException: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:165)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.next(VectorizedParquetReader.java:141)
	at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:130)
	at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:93)
	at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:130)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.columnartorow_nextBatch_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.agg_doAggregateWithKeys_0$(Unknown Source)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage5.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
	at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
	at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1501)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: can not read class org.apache.iceberg.shaded.org.apache.parquet.format.PageHeader: Required field 'num_values' was not found in serialized data! Struct: org.apache.iceberg.shaded.org.apache.parquet.format.DataPageHeader$DataPageHeaderStandardScheme@57eb7595
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.read(Util.java:366)
	at org.apache.iceberg.shaded.org.apache.parquet.format.Util.readPageHeader(Util.java:133)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readPageHeader(ParquetFileReader.java:1458)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1505)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1478)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1088)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.internalReadRowGroup(ParquetFileReader.java:956)
	at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:909)
	at org.apache.iceberg.parquet.VectorizedParquetReader$FileIterator.advance(VectorizedParquetReader.java:163)
	... 23 more

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@wardlican wardlican added the bug Something isn't working label Nov 21, 2024
@jia-zhengwei
Copy link

 Required field num_values was not found in serialized data!

What's the column of num_values ?

@Fokko
Copy link
Contributor

Fokko commented Nov 21, 2024

Thanks @wardlican for raising this. Do you happen to know which system produced the Parquet files (Spark, Arrow, etc)?

@wardlican
Copy link
Author

Thanks @wardlican for raising this. Do you happen to know which system produced the Parquet files (Spark, Arrow, etc)?

We are using spark_catalog.system.rewrite_data_files submitted by hive-beeline (spark engine)

@wardlican
Copy link
Author

 Required field num_values was not found in serialized data!

What's the column of num_values ?

企业微信截图_2f8b95f5-ca6f-47b2-88ee-00574ce105ea

num_values ​​should be an attribute in the parquet file format, not a data field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants