Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial PR #1

Merged
merged 4 commits into from
Feb 9, 2024
Merged

Initial PR #1

merged 4 commits into from
Feb 9, 2024

Conversation

sunchao
Copy link
Member

@sunchao sunchao commented Jan 24, 2024

This is the initial PR for Comet.

Related mailing list discussion: https://lists.apache.org/thread/0q1rb11jtpopc7vt1ffdzro0omblsh0s

@sunchao sunchao changed the title Initial PR for Comet donation Initial PR Jan 24, 2024
Copy link
Contributor

@kazuyukitanimura kazuyukitanimura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (disclosure: I am one of the authors) Thank you @sunchao

@alamb
Copy link
Contributor

alamb commented Jan 24, 2024

Thank you @sunchao -- I plan to give this a review over the next day or two

Co-authored-by: Liang-Chi Hsieh <[email protected]>
Co-authored-by: Kazuyuki Tanimura <[email protected]>
Co-authored-by: Steve Vaughan Jr <[email protected]>
Co-authored-by: Huaxin Gao <[email protected]>
Co-authored-by: Parth Chandra <[email protected]>
Co-authored-by: Oleksandr Voievodin <[email protected]>
@andygrove
Copy link
Member

I was able to build the project and run some queries successfully. I plan on reviewing this over the weekend.

@@ -0,0 +1 @@
nightly-2023-09-05
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious to know why nightly Rust is required. It would be good to add some docs on this at some point.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at this time it requires nightly Rust to compile. We started with stable Rust but at some point introduced some nightly-only features like "specialization". I think it is very easy to remove the dependency though - we can switch back to stable Rust later.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this code (obviously not the whole thing in detail) and I thought it was pretty awesome ❤️

The code I looked at looks clear, well commented and well tested.

I wonder if you have a public roadmap about where you hope to take this project?

As I understand it the next step is to perform the IP clearance process https://incubator.apache.org/ip-clearance/ (I can help with this if you need as I did it for the object_store donation).

Once the IP clearance process is complete, I think this would make a great part of the apache arrow datafusion project

Some notes I found interesting while reviewing:

  1. There appears to be another implementation of parquet in java as well as in rust.
  2. There is a set of kernels (e.g. core/src/execution/kernels/strings.rs that seems somewhat similar to what is in arrow-rs and datafusion)
  3. The docs imply there is codgen for filters, but I didn't find any reference to that in the code

import org.apache.parquet.hadoop.metadata.ColumnPath;
import org.apache.parquet.io.SeekableInputStream;

public class BloomFilterReader implements FilterPredicate.Visitor<Boolean> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW DataFusion's parquet reader handle bloom filters natively now thanks to @hengfeiyang https://github.com/apache/arrow-datafusion/blob/5e9c9a1f7cecabe6e6c40c8296adb517fac0da13/datafusion/core/src/datasource/physical_plan/parquet/row_groups.rs#L113

Though I don't think it supports encrypted ciphers

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great to know :)

I need to check a list of things that are in Parquet Java but not in the Rust yet. I think the Parquet encryption is definitely an important one.

@@ -0,0 +1,116 @@
/*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fascinated to know (can be later) why comet needs its own parquet reader in Java -- maybe we can add any missing functionality upstream in parquet-rs

Copy link
Member Author

@sunchao sunchao Jan 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when we started there are several things that are not yet ready in the Rust implementation yet, so we chose to use this hybrid implementation. The Rust implementation definitely has become much more mature now, and we do want to switch to it at some point.

I think to check what are the things that are missing in the Rust side. Perhaps:

  • Parquet encryption support
  • Check all the predicates and see if they are supported (e.g., in/notIn?)
  • Dictionary pushdown? maybe it is already supported.

We also needed to do a bunch of Spark-specific things in our native Parquet reader. For instance, Spark has this timestamp/date rebase feature for conversions from the old Julian calendar to Gregorian calendar, and it also reads small precision decimal into i32 or i64 on the Java side, which requires special handling.

}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😏

Some(dt).and_then(|d| d.with_nanosecond(1_000 * (d.nanosecond() / 1_000)))
}

pub fn date_trunc_dyn(array: &dyn Array, format: String) -> Result<ArrayRef, ExpressionError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW over time I hope we can move most functions like date_trunc out of DataFusion's core and potentially have versions like this with spark compatible behavior available for others to use and help maintain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I have in mind, BTW, in case anyone has time to review: apache/datafusion#8705

@sunchao
Copy link
Member Author

sunchao commented Jan 25, 2024

Thanks @alamb , really appreciated

I wonder if you have a public roadmap about where you hope to take this project?

We don't have it yet. Internally we do have roadmap under doc but it was removed in this PR. We can add it back after the initial PR.

As I understand it the next step is to perform the IP clearance process ...

That's great! I'll check how it was done for other projects, and let you know if I need any help with it.

There appears to be another implementation of parquet in java as well as in rust.

Yes, the Comet Parquet reader is a hybrid implementation: the IO part is done in Java while the decoding (to Arrow) & decompression is done in native. This is based on the assumption that we won't get much performance gain by moving the IO part to native. While keeping it in Java, we are able to leverage various storage connectors such as S3 and HDFS, that are already pretty mature, as well as Parquet features that are missing on the native side, like encryption support.

With that said, at some point we do want to switch to a fully native Parquet reader like the one in DF. This can potentially help to simplify a lot of the logic we currently have.

There is a set of kernels (e.g. core/src/execution/kernels/strings.rs that seems somewhat similar to what is in arrow-rs and datafusion)

Yes, I think we should be able to switch to the ones in DF now. These were added long time back when some of the string kernels in DF still didn't support dictionary, which is no longer true.

The docs imply there is codgen for filters, but I didn't find any reference to that in the code

This is something we want to do in Comet, but hasn't started yet :)

@liurenjie1024
Copy link

Thanks @sunchao for this contribution, very great work! Just curious if there is any performance report compared with vanilla spark?

@sunchao
Copy link
Member Author

sunchao commented Jan 26, 2024

Hey @liurenjie1024 , we haven't done TPC-H/TPC-DS benchmark recently since there are still some important features that are missing, such as join support (which we are working on at the moment). We plan to run these benchmarks once the coverage is better and publish the results in the repo. For TPC-H q01 which we do support most operators, I think we saw 5x+ improvements (it can definitely go higher with further optimizations).

@liurenjie1024
Copy link

Hey @liurenjie1024 , we haven't done TPC-H/TPC-DS benchmark recently since there are still some important features that are missing, such as join support (which we are working on at the moment). We plan to run these benchmarks once the coverage is better and publish the results in the repo. For TPC-H q01 which we do support most operators, I think we saw 5x+ improvements (it can definitely go higher with further optimizations).

That's awesome!

@andygrove
Copy link
Member

I have spent some time looking at the code and found it to be well-written and easy to navigate. As I previously mentioned, I was able to run some queries and see performance improvements over regular Spark, so this LGTM as a donation.

I believe that the next step is to have a formal vote on accepting this donation, and we will need to link to that mailing list discussion as part of the IP clearance process.

I have created a Google document where the contributors can fill out the information needed to start the IP clearance process.

https://docs.google.com/document/d/1azmxE1LERNUdnpzqDO5ortKTsPKrhNgQC4oZSmXa8x4/edit?usp=sharing

@andygrove
Copy link
Member

Mailing list thread for the vote: https://lists.apache.org/thread/sk70pkhwmt8vgn0thtr04qg4mpqsgfvx

@kou
Copy link
Member

kou commented Jan 27, 2024

Can we check RAT https://creadur.apache.org/rat/ result?
For example, apache/arrow-rs uses https://github.com/apache/arrow-rs/blob/master/dev/release/run-rat.sh to run RAT.

@viirya
Copy link
Member

viirya commented Jan 27, 2024

I manually run the script on this PR.

NOT APPROVED: .gitignore (./.gitignore): false
NOT APPROVED: Makefile (./Makefile): false
NOT APPROVED: filtered_rat.txt (./filtered_rat.txt): false
NOT APPROVED: rat.txt (./rat.txt): false
NOT APPROVED: rat_exclude_files.txt (./rat_exclude_files.txt): false
NOT APPROVED: rust-toolchain (./rust-toolchain): false
NOT APPROVED: core/.lldbinit (./core/.lldbinit): false
NOT APPROVED: core/Cargo.lock (./core/Cargo.lock): false
NOT APPROVED: spark/src/test/resources/tpch-extended/q1.sql (./spark/src/test/resources/tpch-extended/q1.sql): false
       9 unapproved licences. Check rat report: rat.txt

filtered_rat.txt, rat.txt, rat_exclude_files.txt are the files related to the RAT check script.
q1.sql is SQL test query file. In DataFusion, these files don't have license header too.
Cargo.lock is automatically generated by Cargo.
Seems that we don't need to add license header to rust-toolchain.

I think core/.lldbinit is debugger's config file and committed wrongly. Removed it.

Only missing license header is Makefile. I just added it.

@sunchao
Copy link
Member Author

sunchao commented Jan 28, 2024

Thanks @viirya !

I think core/.lldbinit is debugger's config file and committed wrongly. Removed it.

I think this is a sample file. It is mentioned in DEBUGGING.md

@viirya
Copy link
Member

viirya commented Jan 28, 2024

I think this is a sample file. It is mentioned in DEBUGGING.md

Oh, got it. I removed it for now and updated DEBUGGING.md. If we need it, we can add it back later. Thanks.

@kou
Copy link
Member

kou commented Jan 28, 2024

Thanks!

@alamb
Copy link
Contributor

alamb commented Jan 28, 2024

In DataFusion, these files don't have license header too.

I think we have a list of files that are excluded from the RAT check -- specifically https://github.com/apache/arrow-datafusion/blob/main/dev/release/rat_exclude_files.txt

@advancedxy
Copy link
Contributor

This is awesome and exciting. Just curious to know how many(percent maybe) internal workloads have already on this one, if it's ok to share it publicly?

I think I can help/contribute a bit to fill the semantic gaps between this and vanilla spark if needed.

@sunchao
Copy link
Member Author

sunchao commented Jan 29, 2024

@advancedxy thanks for the interest! it will be great to collaborate with you on this :)

All of our Spark 3.4 production workloads are already using Comet, although only the native Parquet scan feature at the moment. We are finishing up some necessary work including things such as columnar shuffle support and unified memory management, before rolling out more features to our customers.

@advancedxy
Copy link
Contributor

All of our Spark 3.4 production workloads are already using Comet, although only the native Parquet scan feature at the moment.

Thanks for sharing, I think this is a smart strategy to roll out migrations incrementally like this.

README.md Outdated
# Apache Arrow DataFusion Comet

Comet is an Apache Spark plugin that uses [Apache Arrow DataFusion](https://arrow.apache.org/datafusion/)
as native runtime to achieve dramatic improvement in terms of query efficiency and query runtime.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

„dramatic“ seems a bit too dramatic😉

BTW, if it’s allowed to disclose, which companies are behind the development of comet?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. We should remove this word for now.

The initial contributors are from Apple (as can be seen from the PR), but we'd love to collaborate with people from the open source community who wants to achieve similar goals.

@andygrove
Copy link
Member

The vote to accept the donation has passed and the next step is to complete the IP clearance process.

I have started filling out the XML IP clearance form in #2

@andygrove
Copy link
Member

License check for the Rust dependencies:

$ cargo license
(MIT OR Apache-2.0) AND Unicode-DFS-2016 (1): unicode-ident
0BSD OR Apache-2.0 OR MIT (1): adler
Apache-2.0 (40): arrow, arrow-arith, arrow-array, arrow-buffer, arrow-cast, arrow-csv, arrow-data, arrow-ipc, arrow-json, arrow-ord, arrow-row, arrow-schema, arrow-select, arrow-string, ciborium, ciborium-io, ciborium-ll, datafusion, datafusion-common, datafusion-execution, datafusion-expr, datafusion-optimizer, datafusion-physical-expr, datafusion-physical-plan, datafusion-sql, debugid, flatbuffers, parquet, parquet-format, pprof, prost, prost, prost-build, prost-derive, prost-derive, prost-types, sqlparser, sqlparser_derive, thrift, thrift
Apache-2.0 OR Apache-2.0 WITH LLVM-exception OR MIT (3): linux-raw-sys, rustix, wasi
Apache-2.0 OR BSD-2-Clause OR MIT (2): zerocopy, zerocopy-derive
Apache-2.0 OR BSL-1.0 (1): ryu
Apache-2.0 OR CC0-1.0 (1): blake3
Apache-2.0 OR CC0-1.0 OR MIT-0 (1): constant_time_eq
Apache-2.0 OR GPL-2.0 OR GPL-3.0 OR MIT (1): assertables
Apache-2.0 OR MIT (200): addr2line, ahash, allocator-api2, android-tzdata, android_system_properties, anes, anstyle, anyhow, arc-swap, arrayvec, async-trait, autocfg, backtrace, base64, bitflags, bitflags, blake2, block-buffer, bumpalo, cast, cc, cesu8, cfg-if, chrono, chrono-tz, chrono-tz-build, clap, clap_builder, clap_lex, const-random, const-random-macro, core-foundation-sys, cpp_demangle, cpufeatures, crc32fast, criterion, criterion-plot, crossbeam-deque, crossbeam-epoch, crossbeam-utils, crypto-common, derivative, destructure_traitobject, digest, either, equivalent, errno, fastrand, findshlibs, fixedbitset, flate2, fnv, form_urlencoded, futures, futures-channel, futures-core, futures-executor, futures-io, futures-macro, futures-sink, futures-task, futures-util, getrandom, gimli, glob, half, half, hashbrown, hashbrown, heck, heck, hermit-abi, hex, home, humantime, iana-time-zone, iana-time-zone-haiku, idna, indexmap, indexmap, itertools, itertools, itertools, itoa, jni, jni-sys, jobserver, js-sys, lazy_static, lexical-core, lexical-parse-float, lexical-parse-integer, lexical-util, lexical-write-float, lexical-write-integer, libc, libm, linked-hash-map, lock_api, log, log-mdc, log4rs, md-5, memmap2, multimap, num, num-bigint, num-complex, num-format, num-integer, num-iter, num-rational, num-traits, num_cpus, object, object_store, once_cell, parking_lot, parking_lot_core, paste, percent-encoding, petgraph, pin-project-lite, pin-utils, pkg-config, ppv-lite86, proc-macro2, quote, rand, rand_chacha, rand_core, rayon, rayon-core, regex, regex-automata, regex-syntax, rustc-demangle, rustc_version, rustversion, scopeguard, semver, seq-macro, serde, serde_derive, serde_json, serde_yaml, sha2, siphasher, smallvec, snafu, snafu-derive, stable_deref_trait, static_assertions, str_stack, syn, syn, tempfile, thiserror, thiserror-impl, thread-id, threadpool, tinytemplate, typenum, unicode-bidi, unicode-normalization, unicode-segmentation, unicode-width, url, uuid, version_check, wasm-bindgen, wasm-bindgen-backend, wasm-bindgen-macro, wasm-bindgen-macro-support, wasm-bindgen-shared, web-sys, winapi, winapi-i686-pc-windows-gnu, winapi-x86_64-pc-windows-gnu, windows-core, windows-sys, windows-targets, windows-targets, windows_aarch64_gnullvm, windows_aarch64_gnullvm, windows_aarch64_msvc, windows_aarch64_msvc, windows_i686_gnu, windows_i686_gnu, windows_i686_msvc, windows_i686_msvc, windows_x86_64_gnu, windows_x86_64_gnu, windows_x86_64_gnullvm, windows_x86_64_gnullvm, windows_x86_64_msvc, windows_x86_64_msvc, yaml-rust, zstd-safe, zstd-sys
Apache-2.0 OR MIT OR Zlib (4): bytemuck, miniz_oxide, tinyvec, tinyvec_macros
BSD-2-Clause (1): arrayref
BSD-3-Clause (4): alloc-no-stdlib, alloc-stdlib, snap, subtle
BSD-3-Clause OR MIT (2): brotli, brotli-decompressor
CC0-1.0 (1): tiny-keccak
CDDL-1.0 (1): inferno
MIT (46): bytes, combine, comfy-table, crunchy, dashmap, doc-comment, generic-array, integer-encoding, integer-encoding, is-terminal, lz4, lz4-sys, nix, oorandom, ordered-float, ordered-float, parse-zoneinfo, phf, phf_codegen, phf_generator, phf_shared, plotters, plotters-backend, plotters-svg, quick-xml, redox_syscall, rgb, serde-value, simd-adler32, slab, strum, strum_macros, symbolic-common, symbolic-demangle, tokio, tokio-macros, tokio-stream, tokio-util, tracing, tracing-attributes, tracing-core, twox-hash, typemap-ors, unsafe-any-ors, which, zstd
MIT OR Unlicense (8): aho-corasick, byteorder, csv, csv-core, memchr, same-file, walkdir, winapi-util
N/A (1): comet

could you add https://github.com/apache/arrow-datafusion/blob/main/LICENSE.txt to the root of the repo in the PR?

@andygrove
Copy link
Member

I manually checked the maven dependencies are licenses are all good.

@sunchao
Copy link
Member Author

sunchao commented Feb 3, 2024

could you add https://github.com/apache/arrow-datafusion/blob/main/LICENSE.txt to the root of the repo in the PR?

Sure @andygrove, just added the LICENSE.txt

@alamb
Copy link
Contributor

alamb commented Feb 4, 2024

@alamb alamb mentioned this pull request Feb 6, 2024
@andygrove
Copy link
Member

I have started the IP clearance vote: https://lists.apache.org/thread/lj3j4q7snpzrfo3mh3cph26mdpr2jrfx

Copy link

@EpsilonPrime EpsilonPrime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I glanced through the PR and am excited to see it being shared with the open source community. May the project flourish!

System.setProperty(key, value);
} else {
LOG.info(
"Skip setting system property {} to {}, because it is already set to {}",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipped

Util.readColumnIndex(inputStream, columnIndexDecryptor, columnIndexAAD));
}

// Visible for testing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this comment be replaced with an annotation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to check. I remember we did this explicitly to avoid some additional dependency.

@andygrove
Copy link
Member

I have started the IP clearance vote: https://lists.apache.org/thread/lj3j4q7snpzrfo3mh3cph26mdpr2jrfx

The IP clearance vote has passed.

@andygrove andygrove merged commit 383c8fd into main Feb 9, 2024
@viirya
Copy link
Member

viirya commented Feb 9, 2024

Thanks all for the help on this!

@alamb
Copy link
Contributor

alamb commented Feb 11, 2024

Nice work -- so excited!

@alamb alamb deleted the comet-upstream branch February 11, 2024 10:57
andygrove pushed a commit to andygrove/datafusion-comet that referenced this pull request Nov 9, 2024
add partial support for multiple parquet files
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants