Reading a parquet file and writing to s3 using pekko connectors. #857

Susmit07 · 2024-10-06T10:02:00Z

Susmit07
Oct 6, 2024

Hello everyone,

I’m working on a project using Pekko connectors to read Parquet files from HDFS, process them, and upload them to S3. I’ve implemented the following code, and I’d like to confirm whether this approach is sound, specifically around the use of ByteString(outputStream.toByteArray) for converting serialized Parquet data to be uploaded to S3.

import org.apache.parquet.avro.AvroParquetWriter
import org.apache.parquet.hadoop.metadata.CompressionCodecName
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.apache.parquet.avro.AvroReadSupport
import org.apache.avro.generic.GenericRecord
import akka.stream.scaladsl.{Flow, Sink, Source}
import akka.stream.alpakka.s3.scaladsl.S3
import akka.util.ByteString
import akka.Done
import scala.concurrent.Future
import java.io.ByteArrayOutputStream
import java.nio.file.Paths

// Function to process Parquet file and upload to S3
private def processParquetFile(filePath: String, fileHandler: FileHandler, bucket: String): Future[Done] = {
  logger.info(s"Starting to process file: $filePath")

  // Read the Parquet file as a Source[GenericRecord, NotUsed]
  val parquetSource: Source[GenericRecord, NotUsed] = fileHandler.readParquetFileSource(filePath)

  // AvroParquetWriter setup for GenericRecord
  val conf = new Configuration()
  conf.setBoolean(AvroReadSupport.AVRO_COMPATIBILITY, true)

  // Define flow to write GenericRecord as ByteString and upload to S3
  val s3UploadFlow: Flow[GenericRecord, MultipartUploadResult, NotUsed] =
    Flow[GenericRecord]
      .map { record =>
        // Serialize GenericRecord to Parquet format and convert to ByteString
        val outputStream = new ByteArrayOutputStream()
        val writer = AvroParquetWriter.builder[GenericRecord](new Path(filePath))
          .withConf(conf)
          .withCompressionCodec(CompressionCodecName.SNAPPY)
          .withDataModel(GenericData.get()) // Ensure the data model for GenericRecord is set
          .build()

        // Write record to Parquet format
        writer.write(record)
        writer.close()

        // Convert to ByteString for S3 upload
        ByteString(outputStream.toByteArray)
      }
      .via(S3.multipartUpload(bucket, s"s3_key_prefix/${Paths.get(filePath).getFileName}"))

  // Setup S3 sink
  val s3Sink: Sink[MultipartUploadResult, Future[Done]] = Sink.foreach { result =>
    logger.info(s"Uploaded part to S3 successfully. Result: $result")
  }

  // Run the stream: Read Parquet -> Serialize -> Upload to S3
  parquetSource
    .via(s3UploadFlow)
    .runWith(s3Sink)
    .recover {
      case ex: Exception =>
        logger.error(s"Error processing file $filePath: ${ex.getMessage}", ex)
        Done
    }
}

My Concern:
I am using ByteString(outputStream.toByteArray) to convert the serialized Parquet data into a format that can be streamed to S3. I’m concerned that this approach could lead to OutOfMemory (OOM) issues, especially when processing large files, as ByteArrayOutputStream keeps everything in memory.

Is this approach safe for production when dealing with large files?
Should I consider a more memory-efficient way to handle the conversion?
Additionally, I’m using S3.multipartUpload since it’s typically recommended for files larger than 500 MB. However, I’m curious:

Can I use S3.putObject instead for smaller files?
Is multipartUpload still the preferred approach for all file sizes, given it handles uploads in parts, or should I switch to putObject for smaller files?
Looking forward to your thoughts and suggestions!

Thank you.

raboof · 2024-10-06T10:10:48Z

raboof
Oct 6, 2024
Collaborator

I am using ByteString(outputStream.toByteArray) to convert the serialized Parquet data into a format that can be streamed to S3. I’m concerned that this approach could lead to OutOfMemory (OOM) issues, especially when processing large files, as ByteArrayOutputStream keeps everything in memory.

It looks like you're not creating ByteArrayOutputStream for the whole file, but one for each GenericRecord. It looks like as long as the GenericRecords aren't big, it shouldn't be a problem to process files that are large because they contain a large number of GenericRecords.

0 replies

Susmit07 · 2024-10-06T10:23:33Z

Susmit07
Oct 6, 2024
Author

Thank you @raboof as always for your reply

yes you are correct

It is a good idea to stick with multipartUpload for large files (typically >500MB), but for smaller files, putObject can it be considered ?

The advantage with putObject i feel is all or none, but for multipartUpload the part files i need to have a lifecycle policy if upload gets snapped out in between (considering we are not supporting upload resumability currently)

0 replies

raboof · 2024-10-07T08:07:36Z

raboof
Oct 7, 2024
Collaborator

I'm not familiar with those S3 specifics, sorry! If you figure it out that would be great to add to the S3 connector docs!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a parquet file and writing to s3 using pekko connectors. #857

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Reading a parquet file and writing to s3 using pekko connectors. #857

Susmit07 Oct 6, 2024

Replies: 3 comments

raboof Oct 6, 2024 Collaborator

Susmit07 Oct 6, 2024 Author

raboof Oct 7, 2024 Collaborator

Susmit07
Oct 6, 2024

raboof
Oct 6, 2024
Collaborator

Susmit07
Oct 6, 2024
Author

raboof
Oct 7, 2024
Collaborator