-
-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adds learners table and overloads lrn #142
Conversation
This comment has been minimized.
This comment has been minimized.
I'm not a fan of |
I agree we should discuss this properly during the sprint (and a good solution will help with the discussion about where to store learners/measures/pipelines0 |
I also do not understand why we should overload Could you add a reprex to the initial post showing the output? |
Because then a user doesn't need to do anything more than the usually would to load a function, why make it harder for them? Also if the learner is already attached then there is no extra overhead.
library(mlr3)
library(mlr3learners)
library(mlr3proba)
library(data.table)
extra_learners = rownames(
available.packages(repos = "https://mlr3learners.github.io/mlr3learners.drat")
)
lapply(extra_learners, require, character.only = TRUE, quietly = TRUE)
keys = mlr_learners$keys()
all_lrns = suppressWarnings(mlr3::lrns(keys))
learner_table = data.table(t(rbindlist(list(mlr3misc::map(all_lrns, function(.x) {
idsplt = strsplit(.x$id, ".", TRUE)[[1]]
list(idsplt[[2]], idsplt[[1]], .x$id, strsplit(.x$man, "::", TRUE)[[1]][1],
.x$packages[1], .x$properties, .x$feature_types, .x$predict_types)
})))))
colnames(learner_table) = c("name", "class", "id", "mlr3_package", "required_package",
"properties", "feature_types", "predict_types")
learner_table[, 1:4] = lapply(learner_table[, 1:4], as.character)
rm(all_lrns, extra_learners, keys)
list_mlr3learners = function(hide_cols = NULL, filter = NULL, tibble = FALSE) {
dt = copy(learner_table)
class = mlr3_package = required_package = NULL # hacky fix to prevent NOTE for global binding
if (!is.null(filter)) {
if (!is.null(filter$class)) {
dt = subset(dt, class %in% filter$class)
}
if (!is.null(filter$mlr3_package)) {
dt = subset(dt, mlr3_package %in% filter$mlr3_package)
}
if (!is.null(filter$required_package)) {
dt = subset(dt, required_package %in% filter$required_package)
}
if (!is.null(filter$properties)) {
dt = subset(dt, mlr3misc::map_lgl(dt$properties,
function(.x) any(filter$properties %in% .x)))
}
if (!is.null(filter$feature_types)) {
dt = subset(dt, mlr3misc::map_lgl(dt$feature_types,
function(.x) any(filter$feature_types %in% .x)))
}
if (!is.null(filter$predict_types)) {
dt = subset(dt, mlr3misc::map_lgl(dt$predict_types,
function(.x) any(filter$predict_types %in% .x)))
}
}
if (!is.null(hide_cols)) {
dt = subset(dt, select = !(colnames(dt) %in% hide_cols))
}
if (tibble) {
return(tibble::tibble(dt))
} else {
return(dt)
}
}
lrn = function(.key, ...) {
id = NULL
pkg = unlist(subset(list_mlr3learners(), id == .key)$mlr3_package)
inst = suppressWarnings(require(pkg, quietly = FALSE, character.only = TRUE))
if (!inst) {
ans = usethis::ui_yeah(
sprintf("%s is not installed but is required, do you want to install this now?", pkg),
n_no = 1
)
if (ans) {
install.packages(pkg, repos = "https://mlr3learners.github.io/mlr3learners.drat")
} else {
stop(sprintf("%s is not installed but is required.", pkg))
}
}
mlr3misc::dictionary_sugar_get(mlr_learners, .key, ...)
}
list_mlr3learners(hide_cols = c("properties", "feature_types"),
filter = list(class = "surv", predict_types = "distr"))
#> name class id mlr3_package
#> 1: akritas surv surv.akritas mlr3learners.proba
#> 2: coxph surv surv.coxph mlr3proba
#> 3: flexible surv surv.flexible mlr3learners.flexsurv
#> 4: kaplan surv surv.kaplan mlr3proba
#> 5: nelson surv surv.nelson mlr3learners.survival
#> 6: obliqueRSF surv surv.obliqueRSF mlr3learners.obliquersf
#> 7: parametric surv surv.parametric mlr3learners.survival
#> 8: penalized surv surv.penalized mlr3learners.penalized
#> 9: rfsrc surv surv.rfsrc mlr3learners.randomforestsrc
#> 10: rpart surv surv.rpart mlr3proba
#> required_package predict_types
#> 1: mlr3learners.proba crank,distr
#> 2: survival distr,crank,lp
#> 3: flexsurv distr,crank,lp
#> 4: survival crank,distr
#> 5: survival crank,distr
#> 6: obliqueRSF crank,distr
#> 7: survival distr,lp,crank
#> 8: penalized distr,crank
#> 9: randomForestSRC crank,distr
#> 10: rpart crank,distr
list_mlr3learners(tibble = TRUE)
#> # A tibble: 91 x 8
#> name class id mlr3_package required_package properties feature_types
#> <chr> <chr> <chr> <chr> <list> <list> <list>
#> 1 AdaB… clas… clas… mlr3learner… <chr [1]> <chr [2]> <chr [3]>
#> 2 C5 clas… clas… mlr3learner… <chr [1]> <chr [4]> <chr [3]>
#> 3 cfor… clas… clas… mlr3learner… <chr [1]> <chr [4]> <chr [4]>
#> 4 ctree clas… clas… mlr3learner… <chr [1]> <chr [3]> <chr [4]>
#> 5 cv_g… clas… clas… mlr3learners <chr [1]> <chr [3]> <chr [3]>
#> 6 debug clas… clas… mlr3 <chr [1]> <chr [3]> <chr [6]>
#> 7 extr… clas… clas… mlr3learner… <chr [1]> <chr [3]> <chr [2]>
#> 8 feat… clas… clas… mlr3 <chr [1]> <chr [5]> <chr [6]>
#> 9 fnn clas… clas… mlr3learner… <chr [1]> <chr [2]> <chr [2]>
#> 10 gamb… clas… clas… mlr3learner… <chr [1]> <chr [2]> <chr [4]>
#> # … with 81 more rows, and 1 more variable: predict_types <list>
lrn("regr.featureless")
#> <LearnerRegrFeatureless:regr.featureless>
#> * Model: -
#> * Parameters: robust=FALSE
#> * Packages: stats
#> * Predict Type: response
#> * Feature types: logical, integer, numeric, character, factor, ordered
#> * Properties: importance, missings, selected_features Created on 2020-07-22 by the reprex package (v0.3.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I really like the overall idea and think is an important addition.
We should discuss about the details though but I am sure we'll get there and find a good minimal approach.
See the comments in-line in addition.
Because then a user doesn't need to do anything more than the usually would to load a function, why make it harder for them?
This is good if one knows that {mlr3learners} overloads it. But this is not obvious. I see confusion incoming when people do not have mlr3learners loaded and call lrns()
and get a different behavior for plain {mlr3}.
If we can get a robust implementation, this could then also be an addition to lrns()
in mlr3 directly?
@@ -56,6 +56,7 @@ Suggests: | |||
ranger, | |||
rmarkdown, | |||
testthat, | |||
tibble, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not pull in tibble as a soft dep.
The following overloads the printer function for data.frame (can be extended for data.tables) and can live in everyone's .Rprofile
:
# tibble > data.frame
if (interactive() && "tibble" %in% rownames(utils::installed.packages())) {
print.data.frame = function(x, ...) {
tibble:::print.tbl(tibble::as_tibble(x), ...)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't CRAN give a warning/note when using triple colon in packages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not for inclusion in a package, this is for a local .Rprofile
.
extra_learners = rownames( | ||
available.packages(repos = "https://mlr3learners.github.io/mlr3learners.drat") | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Wrap in tryCatch()
because it should not block if no internet is available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is automated and created during builds then surely there is always internet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I did not see that this is not part of the function.
We could possibly store it in the package though the content then depends on the version which users have installed.
I guess querying an online resource (could also be the mlr3 GH repo) while requiring internet access would be better? What about both: Trying to query the online table and fall back to the local one included in the package (with a warning message that this one might not include all learners).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess querying an online resource (could also be the mlr3 GH repo) while requiring internet access would be better?
What's/where's the mlr3 GH repo?
all back to the local one included in the package (with a warning message that this one might not include all learners).
I think this would slightly defeat the point because say a user wants to install xgboost but does not know that it lives in mlr3learners and have therefore only installed mlr3. Then the table will not show xgboost nor will it be able to install it if called.
Unless you just mean only fall back to local when internet is not available? I guess that would make sense and be more intuitive than just erroring. Assuming the local one is identical to the code above except with no call to install.packages
/available.packages
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless you just mean only fall back to local when internet is not available?
I meant this, yes.
What's/where's the mlr3 GH repo?
The mlr3 GitHub repo.
Outlining the process again:
lrns()
queries an online resource which could live in the mlr3data repo (to not bloat mlr3 with too much CI updates) GitHub repo. This resource gets generated once a day.- If no internet access is available, a static version included in the mlr3 package installation is used. This option included a warning that a static version was used.
- In every mlr3 release, mlr3 ships with the most recent version of this static learner table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry completely misread and thought you were saying there's a separate repo just for certain GHaction automations.
I completely agree with the process outlined above. But would this not result in mlr3data
becoming an import not suggest as the lrn/lrns
functions would depend on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The latest version of the learner table (which would be updated by the CI every day) would live in mlr3data and would be queried via the web
- The static version would live in mlr3 and ship with the package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds good and just so I understand would the static version just have to be manually typed out and updated. Do you want me to try and set-up the build for mlr3data so we can close the first part (i.e. the online table)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First I would like to ask @mllg if he agrees that mlr3data would be a good place to store a .csv file containing this information?
That sounds good and just so I understand would the static version just have to be manually typed out
Not sure I understand what you mean by this.
- Read the .csv from mlr3data into an .rda file into mlr3
- Ship this .rda file (static snapshot) with an mlr3 release and use it as the static fallback
Apart from both you can continue to write a .csv containing the table that should be read in later.
Oh wait - is a .csv suitable? We have a nested structure here, don't we? We could write a JSON?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, sorry I have no experience with external data structures in R so just trying to visualise it, but understand it now :)
Sorry should have made this clear, I don't actually think |
No worries, I also have a better overview of the whole approach now. Should have suggested this in the first place probably. It is not a problem, we can simply open the final version once we are done iterating here :) |
@RaphaelS1 What is the status here? |
this is now implemented in mlr3extralearners, either it will stay in there if we use that or I can port it into another repo. I'll close this PR |
This is related to #81 and #82. As discussed with @pat-s I am opening this PR here to discuss my current implementation for this functionality, though it need not necessarily live in this repo (and is likely far from an end version).
In this branch are:
mlr3learners.C50
does not currently work because it gets pulled asmlr3learners.C.50
.lrn
function to first search the learner table to detect if the required mlr3 package is installed/attached. If this isn't the case thenusethis
is utilised to ask user if they want package installed. If so then learner is loaded, if not then errors. Probably makes more sense to move this functionality tomlr3misc::dictionary_sugar_get
but this would also required learner table creation moved here, or there is a dependency loop. A vectorised version forlrns/mlr3misc::dictionary_sugar_mget
follows naturally but is not included here.Code is currently commented (not roxygen) and is expected to break builds on this branch (also probably not lintr friendly yet)