Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP use Tables.Columns instead of columntable #247

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

kleinschmidt
Copy link
Member

This uses Tables.Columns as a (potentially) lightweight wrapper around input tables that does not convert them to the strongly typed NamedTuple of Vectors representation. This might make some things easier on the compiler (e.g. #220 ).

Requires Tables 1.6.0 since that's when Columns stopped being a lie ;)

There's some design issues to work out here still, since a generic NamedTuple could be EITHER a column table (if it contains vectors) or a single row, and there are a handful of methods that specialize on that to provide special handling (most notably modelcols(::InteractionTerm, ...)). What we PROBABLY will need to do is to add parallel methods for Row in a similar fashion, but I'm not sure about that. In the mean time, merging this would be breaking since you lose first class support for named tuples of singletons, which is part of the current public API. There may be a way around that but I haven't dug into it yet...

@kleinschmidt kleinschmidt mentioned this pull request Jan 24, 2023
5 tasks
@nalimilan
Copy link
Member

There's some design issues to work out here still, since a generic NamedTuple could be EITHER a column table (if it contains vectors) or a single row, and there are a handful of methods that specialize on that to provide special handling (most notably modelcols(::InteractionTerm, ...)). What we PROBABLY will need to do is to add parallel methods for Row in a similar fashion, but I'm not sure about that. In the mean time, merging this would be breaking since you lose first class support for named tuples of singletons, which is part of the current public API. There may be a way around that but I haven't dug into it yet...

For which situations do we need modelcols to take a row? For clarity, we could require row objects to be Tables.AbstractRow even if Tables.jl doesn't require that. Otherwise confusing things could happen (weird errors when the Tables.istable check fails for some reason...).

cols = termvars(formula)
materialize = Tables.materializer(data)
data = materialize(TableOperations.select(cols...)(data))
drop = TableOperations.narrowtypes() ∘ TableOperations.dropmissing()
Copy link
Member

@nalimilan nalimilan Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT TableOperations.dropmissing operates row-wise (it calls filter). I'm afraid this is going to kill performance for data frames.

Maybe an optimized method for column tables could be added? (EDIT: That's probably doable, as we can use a faster approach than filter since we know that the condition can be computed separately for each row.) Another solution would be to define dropmissing in DataAPI, say that dropmissing(::Any) is owned by TableOperations, but have dropmissing(::DataFrame) be defined in DataFrames.

Also, narrowtypes is a much more costly operation that just doing nonmissingtype(eltype(col)) as it requires going over all entries. DataFrames's dropmissing does that by default, maybe TableOperations could take a similar argument.

function schema(ts::AbstractVector{<:AbstractTerm},
data,
hints::Dict{Symbol}=Dict{Symbol,Any}())
data = Tables.Columns(Tables.columns(data))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the advantage of wrapping the result of Tables.columns in a Tables.Columns object?

concrete_term(t::Term, d) = concrete_term(t, d, nothing)

# if the "hint" is already an AbstractTerm, use that
# need this specified to avoid ambiguity
concrete_term(t::Term, d::ColumnTable, hint::AbstractTerm) = hint
concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just

Suggested change
concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint
concrete_term(t::Term, d, hint::AbstractTerm) = hint

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants