sorting into states directly is slow #22

djsutherland · 2020-02-17T07:11:05Z

Seems like maybe pandas/pytables append is a lot slower than writing into a new file. (Or else the rewriting-when-strings-are-longer code is hitting a lot.)

The sort step should probably pre-count lines per PUMA in stats, and maybe max string lengths for the things that need that. Then we can preallocate file sizes and write into them, instead of appending.

Probably should (also?) consider using feather or parquet instead of hdf5.

The text was updated successfully, but these errors were encountered:

djsutherland · 2020-02-17T22:52:15Z

Seems like in this case:

feather is way faster to load but also bigger on disk
parquet is slightly smaller than hdf5 and way faster to load

So parquet seems like the way to go. Unfortunately, doesn't seem to really be very appendable (since it's columnar). Could write it in chunks and then do a (probably quick) rewrite at the end. Or look into dask for everything (#23).

For now, manually converting hdf5 => parquet post-sorting and letting featurize support either.

djsutherland · 2020-02-26T21:25:47Z

With the new two-pass scheme with the merge at the end, the state merger is fast, but puma merger is quite slow. Not sure whether this is due to casting categorical dtypes or just i/o.

Could merge in a separate thread as we go? Or again, maybe dask #23 solves this better.

djsutherland · 2020-02-26T21:27:19Z

Could also multiprocess the merging.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sorting into states directly is slow #22

sorting into states directly is slow #22

djsutherland commented Feb 17, 2020

djsutherland commented Feb 17, 2020

djsutherland commented Feb 26, 2020

djsutherland commented Feb 26, 2020

sorting into states directly is slow #22

sorting into states directly is slow #22

Comments

djsutherland commented Feb 17, 2020

djsutherland commented Feb 17, 2020

djsutherland commented Feb 26, 2020

djsutherland commented Feb 26, 2020