-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP use Tables.Columns instead of columntable
#247
base: master
Are you sure you want to change the base?
Conversation
For which situations do we need |
cols = termvars(formula) | ||
materialize = Tables.materializer(data) | ||
data = materialize(TableOperations.select(cols...)(data)) | ||
drop = TableOperations.narrowtypes() ∘ TableOperations.dropmissing() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT TableOperations.dropmissing
operates row-wise (it calls filter). I'm afraid this is going to kill performance for data frames.
Maybe an optimized method for column tables could be added? (EDIT: That's probably doable, as we can use a faster approach than filter
since we know that the condition can be computed separately for each row.) Another solution would be to define dropmissing
in DataAPI, say that dropmissing(::Any)
is owned by TableOperations, but have dropmissing(::DataFrame)
be defined in DataFrames.
Also, narrowtypes
is a much more costly operation that just doing nonmissingtype(eltype(col))
as it requires going over all entries. DataFrames's dropmissing
does that by default, maybe TableOperations could take a similar argument.
function schema(ts::AbstractVector{<:AbstractTerm}, | ||
data, | ||
hints::Dict{Symbol}=Dict{Symbol,Any}()) | ||
data = Tables.Columns(Tables.columns(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of wrapping the result of Tables.columns
in a Tables.Columns
object?
concrete_term(t::Term, d) = concrete_term(t, d, nothing) | ||
|
||
# if the "hint" is already an AbstractTerm, use that | ||
# need this specified to avoid ambiguity | ||
concrete_term(t::Term, d::ColumnTable, hint::AbstractTerm) = hint | ||
concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint | |
concrete_term(t::Term, d, hint::AbstractTerm) = hint |
This uses
Tables.Columns
as a (potentially) lightweight wrapper around input tables that does not convert them to the strongly typedNamedTuple
ofVector
s representation. This might make some things easier on the compiler (e.g. #220 ).Requires Tables 1.6.0 since that's when
Columns
stopped being a lie ;)There's some design issues to work out here still, since a generic
NamedTuple
could be EITHER a column table (if it contains vectors) or a single row, and there are a handful of methods that specialize on that to provide special handling (most notablymodelcols(::InteractionTerm, ...)
). What we PROBABLY will need to do is to add parallel methods forRow
in a similar fashion, but I'm not sure about that. In the mean time, merging this would be breaking since you lose first class support for named tuples of singletons, which is part of the current public API. There may be a way around that but I haven't dug into it yet...