eupolicy.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
This Mastodon server is a friendly and respectful discussion space for people working in areas related to EU policy. When you request to create an account, please tell us something about you.

Server stats:

224
active users

#tidymodels

1 post1 participant0 posts today

Introducing support for postprocessing in tidymodels!

Postprocessors refine predictions outputted from machine learning models to improve predictive performance or better satisfy distributional limitations.

The tidymodels team has been working on a set of changes across many #tidymodels packages to introduce support for postprocessing. They would love to hear your thoughts on their progress so far!

Learn more in the blog post: tidyverse.org/blog/2024/10/pos

We have five posit::conf(2024) workshops for #RStats
and #Python
modeling and ML enthusiasts!

• Causal Inference in R, led by @malcolmbarrett and @travisgerke
• Introduction to machine learning in Python with Scikit-learn, led by @TiffanyTimbers and Trevor Campbell
• Intro to MLOps with vetiver, led by @isabelizimm
• Introduction to tidymodels, led by @hfrick and @simonpcouch
• Advanced Tidymodels, led by @topepo

reg.conf.posit.co/flow/posit/p

reg.conf.posit.coRegistrationWelcome to your event

Preprint from Simon Wood on the new cross-validation smoothness estimation in #mgcv: arxiv.org/abs/2404.16490. It's a neat performant + data-efficient way to estimate GAMs based on complex CV splits (like spatial/temporal/phylo ones).

See ?NCV in latest {mgcv} for examples (cran.r-universe.dev/mgcv/doc/m)

I might write a helper to convert {rsample}/{spatialsample} objects into mgcv's funny CV indexing structure.

#rstats #ml #tidymodels #mgcvchat @MikeMahoney218 @gavinsimpson @ericJpedersen @millerdl

tidymodels has long supported parallelizing model fits across CPU cores. A couple of the modeling engines that #rstats #tidymodels supports for gradient boosting—#XGBoost and #LightGBM—have their own tools to parallelize model fits. A new blog post explores whether tidymodels users should use tidymodels' implementation, the engines', or both.

simonpcouch.com/blog/2024-05-1

www.simonpcouch.comHow to best parallelize boosted tree model fits with tidymodels | Simon P. Couch

It's a good weekend to learn survival analysis with tidymodels! ⏳

The tidymodels team wrote up a few case studies for you:

• Using survival analysis to see how long it takes the Department of Buildings in NYC to disposition complaints: tidymodels.org/learn/statistic
• Computing time-dependent measures of performance: tidymodels.org/learn/statistic

Read the announcement on survival analysis in tidymodels: tidyverse.org/blog/2024/04/tid

Happy learning! #RStats #tidymodels

www.tidymodels.orgtidymodels - How long until building complaints are dispositioned? A survival analysis case studyLearn how to use tidymodels for survival analysis.

Do any statistical/ML software tools explicitly incorporate reusable holdout, where one uses thresholding, noise, or bootstrapping in holdout validation to prevent garden-of-forking-paths or overfitting issues?

I feel this paper describing the method made a splash when it came out in 2015 but I haven't seen much in implementation, at least in the R ecosystem: doi.org/10.48550/arXiv.1411.26

Seems like something the #tidymodels team might think about? @topepo @juliasilge #rstats

arXiv.orgPreserving Statistical Validity in Adaptive Data AnalysisA great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple hypothesis testing. However, there is a fundamental disconnect between the theoretical results and the practice of data analysis: the theory of statistical inference assumes a fixed collection of hypotheses to be tested, or learning algorithms to be applied, selected non-adaptively before the data are gathered, whereas in practice data is shared and reused with hypotheses and new analyses being generated on the basis of data exploration and the outcomes of previous analyses. In this work we initiate a principled study of how to guarantee the validity of statistical inference in adaptive data analysis. As an instance of this problem, we propose and investigate the question of estimating the expectations of $m$ adaptively chosen functions on an unknown distribution given $n$ random samples. We show that, surprisingly, there is a way to estimate an exponential in $n$ number of expectations accurately even if the functions are chosen adaptively. This gives an exponential improvement over standard empirical estimators that are limited to a linear number of estimates. Our result follows from a general technique that counter-intuitively involves actively perturbing and coordinating the estimates, using techniques developed for privacy preservation. We give additional applications of this technique to our question.

📷 Let's take a moment to relive the moments from our recent in-person gathering through these snapshots!

We had the pleasure of hosting María Paula Caldas, Data Scientist at OECD, and Julie Aubert, INRAE Research Engineer, who respectively delivered #inspiring talks on the development of #packages and statistical models using {#Tidymodels} in #R.

You can find the replay here:
👉 youtu.be/wEVKoPhB25g

👥@chaimaboughanmi @mouna_belaid @RLadiesGlobal @Posit

Q: What's a good prediction performance metric to for a binary classification model where I expect all predictions to be well below 0.5? (<<.001)

We have calibration analysis (expected vs number of positives and CI coverage for various probability bins, on holdout data), but I'm not sure how to summarize this or something else to get a single metric to track model improvement.

Maybe some scoring rule designed for sparse count data?

@juliasilge @topepo?