Replies: 6 comments 3 replies
-
For datasets of that size, and with data stored in S3, you might want to consider something like iceberg and/or deltalake. Whilst arrow flight is a valid way to expose a queryable API, ultimately you are going to end up building something similar to deltalake/iceberg in order to be able to query and manage your data efficiently, and so you might as well cut out the middleman. Unless the database is the product and therefore the focus of your R&D, or you have non-standard query/access patterns, using something off the shelf like deltalake / iceberg will likely get you something faster |
Beta Was this translation helpful? Give feedback.
-
If I go ahead with Arrow Flight internally the query engine would be something like Presto right ? |
Beta Was this translation helpful? Give feedback.
-
Just looping in @zeroshade as well to this thread, to explore different views and have a healthy conversation. |
Beta Was this translation helpful? Give feedback.
-
The requirement is we need to sync data from HDFS to a short term storage S3 is our case Basically a DataSync Service between cloud storages I have already built the service using Apache Pekko / Akka HDFS & s3 connectors Now comes the data reading part for end users The data is stored in aws s3 short term storage in parquet We want to built a Data as a Service on top of the data lying in S3 and expose API endpoints for client to query The data lying will be short term, data may be of week or months (max 3 months) usecases varies from teams to teams So we felt Apache Flight Server will be the best suited for our use case and the client should send a FlightDescriptor object wrapped with the sql query. We parsed the query and query s3 using the aws s3 sdks Apache Iceberg may be an overkill for us because we are not maintaining a data lake it will be a short lived temporary data Again grateful for your reply and such a engaging conversation. @tustvold Waiting to hear back from you |
Beta Was this translation helpful? Give feedback.
-
We are not to going to use AWS ecosystem / services, S3 too we are using Cloudian managed S3 |
Beta Was this translation helpful? Give feedback.
-
Datalake one of the problem will be once the actual data files are purged by virtue of object lifecycle policies, the transaction files in Delta and Iceberg will lie around as dangling |
Beta Was this translation helpful? Give feedback.
-
Hi Developers & Experts of the community,
We have built a short term storage on S3, we want to expose Client APIs for end-users to fetch the datasets.
We are planning to build a arrow flight server on top of data lying in s3, the s3 data storage has petabyte of data.
Is it the right use case of using Arrow Flight Server and client ?
Hoping to get some useful information, that helps us. The server will be running in pods in K8s
Thanks & Grateful
Susmit
Beta Was this translation helpful? Give feedback.
All reactions