-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial PR #1
Initial PR #1
Conversation
0ad0ba7
to
3feecfe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM (disclosure: I am one of the authors) Thank you @sunchao
Thank you @sunchao -- I plan to give this a review over the next day or two |
Co-authored-by: Liang-Chi Hsieh <[email protected]> Co-authored-by: Kazuyuki Tanimura <[email protected]> Co-authored-by: Steve Vaughan Jr <[email protected]> Co-authored-by: Huaxin Gao <[email protected]> Co-authored-by: Parth Chandra <[email protected]> Co-authored-by: Oleksandr Voievodin <[email protected]>
3feecfe
to
2b95ac4
Compare
I was able to build the project and run some queries successfully. I plan on reviewing this over the weekend. |
@@ -0,0 +1 @@ | |||
nightly-2023-09-05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious to know why nightly Rust is required. It would be good to add some docs on this at some point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at this time it requires nightly Rust to compile. We started with stable Rust but at some point introduced some nightly-only features like "specialization". I think it is very easy to remove the dependency though - we can switch back to stable Rust later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at this code (obviously not the whole thing in detail) and I thought it was pretty awesome ❤️
The code I looked at looks clear, well commented and well tested.
I wonder if you have a public roadmap about where you hope to take this project?
As I understand it the next step is to perform the IP clearance process https://incubator.apache.org/ip-clearance/ (I can help with this if you need as I did it for the object_store
donation).
Once the IP clearance process is complete, I think this would make a great part of the apache arrow datafusion project
Some notes I found interesting while reviewing:
- There appears to be another implementation of parquet in java as well as in rust.
- There is a set of kernels (e.g.
core/src/execution/kernels/strings.rs
that seems somewhat similar to what is in arrow-rs and datafusion) - The docs imply there is codgen for filters, but I didn't find any reference to that in the code
import org.apache.parquet.hadoop.metadata.ColumnPath; | ||
import org.apache.parquet.io.SeekableInputStream; | ||
|
||
public class BloomFilterReader implements FilterPredicate.Visitor<Boolean> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW DataFusion's parquet reader handle bloom filters natively now thanks to @hengfeiyang https://github.com/apache/arrow-datafusion/blob/5e9c9a1f7cecabe6e6c40c8296adb517fac0da13/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L113
Though I don't think it supports encrypted ciphers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great to know :)
I need to check a list of things that are in Parquet Java but not in the Rust yet. I think the Parquet encryption is definitely an important one.
@@ -0,0 +1,116 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fascinated to know (can be later) why comet needs its own parquet reader in Java -- maybe we can add any missing functionality upstream in parquet-rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, when we started there are several things that are not yet ready in the Rust implementation yet, so we chose to use this hybrid implementation. The Rust implementation definitely has become much more mature now, and we do want to switch to it at some point.
I think to check what are the things that are missing in the Rust side. Perhaps:
- Parquet encryption support
- Check all the predicates and see if they are supported (e.g., in/notIn?)
- Dictionary pushdown? maybe it is already supported.
We also needed to do a bunch of Spark-specific things in our native Parquet reader. For instance, Spark has this timestamp/date rebase feature for conversions from the old Julian calendar to Gregorian calendar, and it also reads small precision decimal into i32
or i64
on the Java side, which requires special handling.
} | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😏
Some(dt).and_then(|d| d.with_nanosecond(1_000 * (d.nanosecond() / 1_000))) | ||
} | ||
|
||
pub fn date_trunc_dyn(array: &dyn Array, format: String) -> Result<ArrayRef, ExpressionError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW over time I hope we can move most functions like date_trunc
out of DataFusion's core and potentially have versions like this with spark compatible behavior available for others to use and help maintain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I have in mind, BTW, in case anyone has time to review: apache/datafusion#8705
Thanks @alamb , really appreciated
We don't have it yet. Internally we do have roadmap under
That's great! I'll check how it was done for other projects, and let you know if I need any help with it.
Yes, the Comet Parquet reader is a hybrid implementation: the IO part is done in Java while the decoding (to Arrow) & decompression is done in native. This is based on the assumption that we won't get much performance gain by moving the IO part to native. While keeping it in Java, we are able to leverage various storage connectors such as S3 and HDFS, that are already pretty mature, as well as Parquet features that are missing on the native side, like encryption support. With that said, at some point we do want to switch to a fully native Parquet reader like the one in DF. This can potentially help to simplify a lot of the logic we currently have.
Yes, I think we should be able to switch to the ones in DF now. These were added long time back when some of the string kernels in DF still didn't support dictionary, which is no longer true.
This is something we want to do in Comet, but hasn't started yet :) |
Thanks @sunchao for this contribution, very great work! Just curious if there is any performance report compared with vanilla spark? |
Hey @liurenjie1024 , we haven't done TPC-H/TPC-DS benchmark recently since there are still some important features that are missing, such as join support (which we are working on at the moment). We plan to run these benchmarks once the coverage is better and publish the results in the repo. For TPC-H q01 which we do support most operators, I think we saw 5x+ improvements (it can definitely go higher with further optimizations). |
That's awesome! |
I have spent some time looking at the code and found it to be well-written and easy to navigate. As I previously mentioned, I was able to run some queries and see performance improvements over regular Spark, so this LGTM as a donation. I believe that the next step is to have a formal vote on accepting this donation, and we will need to link to that mailing list discussion as part of the IP clearance process. I have created a Google document where the contributors can fill out the information needed to start the IP clearance process. https://docs.google.com/document/d/1azmxE1LERNUdnpzqDO5ortKTsPKrhNgQC4oZSmXa8x4/edit?usp=sharing |
Mailing list thread for the vote: https://lists.apache.org/thread/sk70pkhwmt8vgn0thtr04qg4mpqsgfvx |
Can we check RAT https://creadur.apache.org/rat/ result? |
I manually run the script on this PR.
I think Only missing license header is |
Thanks @viirya !
I think this is a sample file. It is mentioned in DEBUGGING.md |
Oh, got it. I removed it for now and updated DEBUGGING.md. If we need it, we can add it back later. Thanks. |
Thanks! |
I think we have a list of files that are excluded from the RAT check -- specifically https://github.com/apache/arrow-datafusion/blob/main/dev/release/rat_exclude_files.txt |
This is awesome and exciting. Just curious to know how many(percent maybe) internal workloads have already on this one, if it's ok to share it publicly? I think I can help/contribute a bit to fill the semantic gaps between this and vanilla spark if needed. |
@advancedxy thanks for the interest! it will be great to collaborate with you on this :) All of our Spark 3.4 production workloads are already using Comet, although only the native Parquet scan feature at the moment. We are finishing up some necessary work including things such as columnar shuffle support and unified memory management, before rolling out more features to our customers. |
Thanks for sharing, I think this is a smart strategy to roll out migrations incrementally like this. |
README.md
Outdated
# Apache Arrow DataFusion Comet | ||
|
||
Comet is an Apache Spark plugin that uses [Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) | ||
as native runtime to achieve dramatic improvement in terms of query efficiency and query runtime. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
„dramatic“ seems a bit too dramatic😉
BTW, if it’s allowed to disclose, which companies are behind the development of comet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it is. We should remove this word for now.
The initial contributors are from Apple (as can be seen from the PR), but we'd love to collaborate with people from the open source community who wants to achieve similar goals.
The vote to accept the donation has passed and the next step is to complete the IP clearance process. I have started filling out the XML IP clearance form in #2 |
License check for the Rust dependencies:
could you add https://github.com/apache/arrow-datafusion/blob/main/LICENSE.txt to the root of the repo in the PR? |
I manually checked the maven dependencies are licenses are all good. |
Sure @andygrove, just added the |
Update: moved grant discussion https://github.com/apache/arrow-datafusion-comet/pull/2/files#r1477313278 |
I have started the IP clearance vote: https://lists.apache.org/thread/lj3j4q7snpzrfo3mh3cph26mdpr2jrfx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I glanced through the PR and am excited to see it being shared with the open source community. May the project flourish!
System.setProperty(key, value); | ||
} else { | ||
LOG.info( | ||
"Skip setting system property {} to {}, because it is already set to {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipped
Util.readColumnIndex(inputStream, columnIndexDecryptor, columnIndexAAD)); | ||
} | ||
|
||
// Visible for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this comment be replaced with an annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to check. I remember we did this explicitly to avoid some additional dependency.
The IP clearance vote has passed. |
Thanks all for the help on this! |
Nice work -- so excited! |
add partial support for multiple parquet files
This is the initial PR for Comet.
Related mailing list discussion: https://lists.apache.org/thread/0q1rb11jtpopc7vt1ffdzro0omblsh0s