-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Avro as schema in TypedDataSet #282
Comments
Everything around shapeless and implicit derivation works around There are ways (typically code generators) that take a schema defined externally (say, avro or protocol buffers) and convert it to a This will flow similar to this:
@SemanticBeeng Is this what you had in mind? |
Yes, this is central, ofc.
Yes, indeed, main design principle.
Yes, But, on top of this, I am suggesting:
We know Sometimes the integration is at the data level (not api): but same principles, techniques and benefits will apply if we use Think analytics platform interfacing with a There will be components in the analytics area that do not "know" Throughts? |
@SemanticBeeng sorry for taking long to reply. I am on board to making the proper changes, if they need to happen, so that the code generated by avro4s is compatible with Frameless schemas (which for now are just simple case classes). I think there was a similar ask by @codeexplorer regarding scalaPB. |
The gist of my suggestions was a design one and less about implementation.
At this time Did you get a chance to see this? "convert between sparkSQL schemas to avro data schema" |
From an implementation point of view alone, Because of that, a few people consider it a poor way define the type system behind one's domain models. Please review
|
Been thinking about this more. Any interest to make
https://twitter.com/semanticbeeng/status/1142400720324431873 Or somehow make This would advance the "beyond data pipeline" agenda and get away from current "stringly typed" technologies like
At least bring Related : https://twitter.com/semanticbeeng/status/1139789288856571904 |
Would it make sense to be able to introduce support for
avro
schema forTypedDataSet
?The current code defines schema based on the
SparkSQL
"language":frameless/dataset/src/main/scala/frameless/TypedDatasetForwarded.scala
Lines 43 to 44 in 576eb67
On the other hand
frameless
use a Scala types based "schema" to define data sets.Using something like
avro4s
the avro schema can be derived from types.It is quite useful to be able to use
avro
as schema inparquet
files for example: https://dzone.com/articles/understanding-how-parquetSee also "Write Avro records to a Parquet file.":
https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroParquetWriter.java#L34
In
spark-bigquery
there is already a schema converter that could be use to map to and fromSparkSql
based schema.See "convert between sparkSQL schemas to avro data schema"
https://github.com/spotify/spark-bigquery/blob/master/src/main/scala/com/databricks/spark/avro/SchemaConverters.scala#L114-L131
Somewhat related to #280.
The text was updated successfully, but these errors were encountered: