Replies: 3 comments
-
It looks like you're not creating |
Beta Was this translation helpful? Give feedback.
-
Thank you @raboof as always for your reply yes you are correct It is a good idea to stick with multipartUpload for large files (typically >500MB), but for smaller files, putObject can it be considered ? The advantage with putObject i feel is all or none, but for multipartUpload the part files i need to have a lifecycle policy if upload gets snapped out in between (considering we are not supporting upload resumability currently) |
Beta Was this translation helpful? Give feedback.
-
I'm not familiar with those S3 specifics, sorry! If you figure it out that would be great to add to the S3 connector docs! |
Beta Was this translation helpful? Give feedback.
-
Hello everyone,
I’m working on a project using Pekko connectors to read Parquet files from HDFS, process them, and upload them to S3. I’ve implemented the following code, and I’d like to confirm whether this approach is sound, specifically around the use of ByteString(outputStream.toByteArray) for converting serialized Parquet data to be uploaded to S3.
My Concern:
I am using ByteString(outputStream.toByteArray) to convert the serialized Parquet data into a format that can be streamed to S3. I’m concerned that this approach could lead to OutOfMemory (OOM) issues, especially when processing large files, as ByteArrayOutputStream keeps everything in memory.
Is this approach safe for production when dealing with large files?
Should I consider a more memory-efficient way to handle the conversion?
Additionally, I’m using S3.multipartUpload since it’s typically recommended for files larger than 500 MB. However, I’m curious:
Can I use S3.putObject instead for smaller files?
Is multipartUpload still the preferred approach for all file sizes, given it handles uploads in parts, or should I switch to putObject for smaller files?
Looking forward to your thoughts and suggestions!
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions