[FEA] Move all JSON parsing to the same backend as get_json_object #10804
Labels
epic
Issue that encompasses a significant feature or body of work
feature request
New feature or request
Is your feature request related to a problem? Please describe.
This is an epic intended to get us to a point where all JSON parsing functionality can be enabled by default. This is not intended to be the final long term solution. We really want to have a common JSON parser/tokenizer that is owned and maintained by CUDF. But in order for us to have correctness and at least good enough performance in the short term we are going to go with this approach.
The first thing we need is to establish a baseline in terms of performance so we can be sure that we are not regressing in get_json_object as we make changes to the tokenization to make it more configurable.
As a part of this we also need to finish writing all of the JSON tests we can come up with.
After this we need to do some refactoring to the JSON tokenizer in https://github.com/NVIDIA/spark-rapids-jni/blob/branch-24.06/src/main/cpp/src/json_parser.cuh from_json and the json input format are configurable in a number of ways that we need to support.
get_json_object
andjson_tuple
are not configurable and the current tokenizer has been hard coded to handle those settings.Finally we will need to write some custom implementations of different operators so we can hopefully improve the total performance.
The text was updated successfully, but these errors were encountered: