Apache ORC Support in TensorFlow IO #1372

oliverhu · 2021-04-21T17:02:41Z

(Creating this issue for visibility so people interested can join the discussion... )

Overview

Load Apache ORC formatted data natively into TensorFlow from file system supported by TensorFlow, e.g. HDFS, local disk, etc.

Motivation

We traditionally use Avro to store our dataset but it is becoming inefficient to use row based format for big data analytics processing. Historically we selected ORC as our columnar storage format. (not planning to argue Parquet vs ORC here ;))

Design Discussions

Apache ORC would be brought in via https://github.com/bazelbuild/rules_foreign_cc
Feature wise, I expect the APIs to be similar to Parquet or Arrow reader.

Milestones

Add Apache ORC build dependency.
Implement a simple ORC dataset that maps records in ORC files into Tensors.
add a tutorial for ORC reader.
feature schemas support: support sparseTensor and VarLenFeature.
feature schemas support: support denseTensor FixedLenFeature only. (follow parse_example_v2.)
usability improvements
performance tuning
feature schemas support: support raggedTensor

The text was updated successfully, but these errors were encountered:

kvignesh1420 · 2021-06-15T17:36:03Z

@oliverhu any update on this?

oliverhu · 2021-06-15T18:12:09Z

no update recently @kvignesh1420

kvignesh1420 · 2021-06-15T18:24:33Z

@oliverhu can we document the current feature in the form of a tutorial?

oliverhu · 2021-06-15T22:58:53Z

sure, will add that !

kvignesh1420 · 2021-06-16T07:32:28Z

Reference FYKI: https://github.com/tensorflow/io/tree/master/docs/tutorials

372046933 · 2022-03-18T03:58:34Z

Is HDFS supported now? Loading from HDFS path results in coredump

dataset = tfio.IODataset.from_orc("hdfs://xxx/yy/iris.orc", capacity=15).batch(1)

372046933 · 2022-04-28T07:27:10Z

Is HDFS supported now? Loading from HDFS path results in coredump
dataset = tfio.IODataset.from_orc("hdfs://xxx/yy/iris.orc", capacity=15).batch(1)

HDFS supported (with kerberos) by #1674

oliverhu mentioned this issue Apr 21, 2021

Add Apache ORC build sequence with a sample reader file #1373

Merged

oliverhu mentioned this issue May 3, 2021

Implement ORC dataset reader #1383

Merged

oliverhu mentioned this issue Jun 25, 2021

Add ORC reader tutorial #1465

Merged

372046933 mentioned this issue Apr 27, 2022

Orc hdfs #1674

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apache ORC Support in TensorFlow IO #1372

Apache ORC Support in TensorFlow IO #1372

oliverhu commented Apr 21, 2021 •

edited

Loading

kvignesh1420 commented Jun 15, 2021

oliverhu commented Jun 15, 2021

kvignesh1420 commented Jun 15, 2021

oliverhu commented Jun 15, 2021

kvignesh1420 commented Jun 16, 2021

372046933 commented Mar 18, 2022

372046933 commented Apr 28, 2022

Apache ORC Support in TensorFlow IO #1372

Apache ORC Support in TensorFlow IO #1372

Comments

oliverhu commented Apr 21, 2021 • edited Loading

Overview

Motivation

Design Discussions

Milestones

kvignesh1420 commented Jun 15, 2021

oliverhu commented Jun 15, 2021

kvignesh1420 commented Jun 15, 2021

oliverhu commented Jun 15, 2021

kvignesh1420 commented Jun 16, 2021

372046933 commented Mar 18, 2022

372046933 commented Apr 28, 2022

oliverhu commented Apr 21, 2021 •

edited

Loading