Apache Arrow Flight Server (Data as a Service) #6557

Susmit07 · 2024-10-14T18:24:02Z

Susmit07
Oct 14, 2024

Hi Developers & Experts of the community,

We have built a short term storage on S3, we want to expose Client APIs for end-users to fetch the datasets.

We are planning to build a arrow flight server on top of data lying in s3, the s3 data storage has petabyte of data.

Is it the right use case of using Arrow Flight Server and client ?

Hoping to get some useful information, that helps us. The server will be running in pods in K8s

Thanks & Grateful
Susmit

tustvold · 2024-10-14T21:43:36Z

tustvold
Oct 14, 2024
Collaborator

For datasets of that size, and with data stored in S3, you might want to consider something like iceberg and/or deltalake.

Whilst arrow flight is a valid way to expose a queryable API, ultimately you are going to end up building something similar to deltalake/iceberg in order to be able to query and manage your data efficiently, and so you might as well cut out the middleman.

Unless the database is the product and therefore the focus of your R&D, or you have non-standard query/access patterns, using something off the shelf like deltalake / iceberg will likely get you something faster

0 replies

Susmit07 · 2024-10-15T05:55:12Z

Susmit07
Oct 15, 2024
Author

If I go ahead with Arrow Flight internally the query engine would be something like Presto right ?

1 reply

tustvold Oct 15, 2024
Collaborator

If you're looking to use Presto/Spark, I'm not sure why you would involve arrow-flight at all. Both engines are capable of querying S3 data directly, and have mature support for iceberg, deltalake and similar technologies.

Perhaps you could articulate what problem you are trying to solve?

Susmit07 · 2024-10-15T07:06:20Z

Susmit07
Oct 15, 2024
Author

Just looping in @zeroshade as well to this thread, to explore different views and have a healthy conversation.

0 replies

Susmit07 · 2024-10-15T11:20:00Z

Susmit07
Oct 15, 2024
Author

The requirement is we need to sync data from HDFS to a short term storage S3 is our case

Basically a DataSync Service between cloud storages

I have already built the service using Apache Pekko / Akka HDFS & s3 connectors

Now comes the data reading part for end users

The data is stored in aws s3 short term storage in parquet

We want to built a Data as a Service on top of the data lying in S3 and expose API endpoints for client to query

The data lying will be short term, data may be of week or months (max 3 months) usecases varies from teams to teams

So we felt Apache Flight Server will be the best suited for our use case and the client should send a FlightDescriptor object wrapped with the sql query.

We parsed the query and query s3 using the aws s3 sdks

Apache Iceberg may be an overkill for us because we are not maintaining a data lake it will be a short lived temporary data

Again grateful for your reply and such a engaging conversation. @tustvold

Waiting to hear back from you

1 reply

tustvold Oct 15, 2024
Collaborator

My recommendation given what you describe would be to use try using Athena + Glue's managed Metastore and go from there. Whilst you will give up control, you will be able to find lots of guides and recommendations online on how to adapt these to your needs, and they're based on mature systems - Trino and Hive.

Depending on your query workload, you may find you benefit from the better statistics and layout optimisations of tools like deltalake and iceberg. These technologies are compatible with an approach using Athena, and so could be deployed as/when needed.

Given your line of questioning, I'd strongly advise against rolling your own query solution, there is a lot of complexity to doing it well.

Susmit07 · 2024-10-15T13:43:45Z

Susmit07
Oct 15, 2024
Author

We are not to going to use AWS ecosystem / services, S3 too we are using Cloudian managed S3

0 replies

Susmit07 · 2024-10-15T13:45:08Z

Susmit07
Oct 15, 2024
Author

Datalake one of the problem will be once the actual data files are purged by virtue of object lifecycle policies, the transaction files in Delta and Iceberg will lie around as dangling

1 reply

tustvold Oct 15, 2024
Collaborator

Both iceberg and deltalake have lifecycle mechanisms that avoid this issue, ultimately you wil need a catalog of some form to be able to perform queries, it just becomes a question of if you build your own or run something like hive, or use something object store based

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache Arrow Flight Server (Data as a Service) #6557

{{title}}

Replies: 6 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Apache Arrow Flight Server (Data as a Service) #6557

Susmit07 Oct 14, 2024

Replies: 6 comments · 3 replies

tustvold Oct 14, 2024 Collaborator

Susmit07 Oct 15, 2024 Author

tustvold Oct 15, 2024 Collaborator

Susmit07 Oct 15, 2024 Author

Susmit07 Oct 15, 2024 Author

tustvold Oct 15, 2024 Collaborator

Susmit07 Oct 15, 2024 Author

Susmit07 Oct 15, 2024 Author

tustvold Oct 15, 2024 Collaborator

Susmit07
Oct 14, 2024

Replies: 6 comments 3 replies

tustvold
Oct 14, 2024
Collaborator

Susmit07
Oct 15, 2024
Author

tustvold Oct 15, 2024
Collaborator

Susmit07
Oct 15, 2024
Author

Susmit07
Oct 15, 2024
Author

tustvold Oct 15, 2024
Collaborator

Susmit07
Oct 15, 2024
Author

Susmit07
Oct 15, 2024
Author

tustvold Oct 15, 2024
Collaborator