Intro to `tidymodels` with `nflfastR`

Tom Mock: `@thomas_mock`

2020-10-24

1 / 90

cmsac-tidymodels.netlify.app/
What have I done?Scored 100% of our game-winning goals on my 2010 College Lacrosse Team (we only won 1 game 😢)  
Bachelor's in Kinesiology (2011) - Effect of Sugar vs Sugar-Free Mouth Rinse on Performance  
Master's in Exercise Physiology (2014) - Effect of Exercise on Circulating Brain Biomarkers  
PhD in Neurobiology (2018) - Effect of Aging + Glutathione Deficiency on Motor and Cognitive Function    
2 / 90

cmsac-tidymodels.netlify.app/
What have I done?Scored 100% of our game-winning goals on my 2010 College Lacrosse Team (we only won 1 game 😢)  
Bachelor's in Kinesiology (2011) - Effect of Sugar vs Sugar-Free Mouth Rinse on Performance  
Master's in Exercise Physiology (2014) - Effect of Exercise on Circulating Brain Biomarkers  
PhD in Neurobiology (2018) - Effect of Aging + Glutathione Deficiency on Motor and Cognitive Function    
What do I do?RStudio, Customer Success (2018 - Current) - I run our High Technology + Sports vertical, helping RStudio's customers use our Pro and open source software to solve problems with data  
2 / 90

What have I done?

Scored 100% of our game-winning goals on my 2010 College Lacrosse Team (we only won 1 game 😢)
Bachelor's in Kinesiology (2011) - Effect of Sugar vs Sugar-Free Mouth Rinse on Performance
Master's in Exercise Physiology (2014) - Effect of Exercise on Circulating Brain Biomarkers
PhD in Neurobiology (2018) - Effect of Aging + Glutathione Deficiency on Motor and Cognitive Function

What do I do?

RStudio, Customer Success (2018 - Current) - I run our High Technology + Sports vertical, helping RStudio's customers use our Pro and open source software to solve problems with data
#TidyTuesday - Weekly data analysis community project (~4,000 participants in past 3 years)

2 / 90

What have I done?

Scored 100% of our game-winning goals on my 2010 College Lacrosse Team (we only won 1 game 😢)
Bachelor's in Kinesiology (2011) - Effect of Sugar vs Sugar-Free Mouth Rinse on Performance
Master's in Exercise Physiology (2014) - Effect of Exercise on Circulating Brain Biomarkers
PhD in Neurobiology (2018) - Effect of Aging + Glutathione Deficiency on Motor and Cognitive Function

What do I do?

RStudio, Customer Success (2018 - Current) - I run our High Technology + Sports vertical, helping RStudio's customers use our Pro and open source software to solve problems with data
#TidyTuesday - Weekly data analysis community project (~4,000 participants in past 3 years)
TheMockUp.blog - Explanatory blogging about How to do Stuff with data + #rstats, mostly with NFL data

2 / 90

What have I done?

Scored 100% of our game-winning goals on my 2010 College Lacrosse Team (we only won 1 game 😢)
Bachelor's in Kinesiology (2011) - Effect of Sugar vs Sugar-Free Mouth Rinse on Performance
Master's in Exercise Physiology (2014) - Effect of Exercise on Circulating Brain Biomarkers
PhD in Neurobiology (2018) - Effect of Aging + Glutathione Deficiency on Motor and Cognitive Function

What do I do?

RStudio, Customer Success (2018 - Current) - I run our High Technology + Sports vertical, helping RStudio's customers use our Pro and open source software to solve problems with data
#TidyTuesday - Weekly data analysis community project (~4,000 participants in past 3 years)
TheMockUp.blog - Explanatory blogging about How to do Stuff with data + #rstats, mostly with NFL data
espnscrapeR - collect data from ESPN's public APIs, mostly for QBR and team standings

2 / 90

The `nflscrapR` team:

(data from 2009-current)

Maksim Horowitz
Ron Yurko
Sam Ventura
Rishav Dutta

The `nflfastR` team:

(data from 2000-current)

Ben Baldwin
Sebastian Carl

These datasets set the bar for publicly-available NFL data!

3 / 90

Shoutout to the NFL Big Data Bowl

Kaggle link

This competition uses NFL’s Next Gen Stats data, which includes the position and speed of every player on the field during each play. You’ll employ player tracking data for all drop-back pass plays from the 2018 regular season. The goal of submissions is to identify unique and impactful approaches to measure defensive performance on these plays.

4 / 90

cmsac-tidymodels.netlify.app/
PersonasYou've done some modeling in R
💻 + 📈
5 / 90

Personas

You've done some modeling in R
💻 + 📈
You're interested in fitting models in R with sports data
💻 + 📈 + 🏈

5 / 90

Personas

You've done some modeling in R
💻 + 📈
You're interested in fitting models in R with sports data
💻 + 📈 + 🏈
You've used the tidyverse for analyses and purrr specifically for nested data
⭐ + 🐱

5 / 90

Personas

You've done some modeling in R
💻 + 📈
You're interested in fitting models in R with sports data
💻 + 📈 + 🏈
You've used the tidyverse for analyses and purrr specifically for nested data
⭐ + 🐱
You're new to tidymodels AND you want to fit models to sports data
💡 + 💻 + 📈 + 🏈

5 / 90

Personas

You've done some modeling in R
💻 + 📈
You're interested in fitting models in R with sports data
💻 + 📈 + 🏈
You've used the tidyverse for analyses and purrr specifically for nested data
⭐ + 🐱
You're new to tidymodels AND you want to fit models to sports data
💡 + 💻 + 📈 + 🏈
You're awesome because you came to CMSAC!
🙌 🏈

5 / 90

Focus for Today

90 Minutes (with breaks)

Binary classification:

Logistic Regression
Random Forest

6 / 90

Focus for Today

90 Minutes (with breaks)

Binary classification:

Logistic Regression
Random Forest

Slides at: cmsac-tidymodels.netlify.app/
Source code at: github.com/jthomasmock/nfl-workshop

6 / 90

Focus for Today

90 Minutes (with breaks)

Binary classification:

Logistic Regression
Random Forest

Slides at: cmsac-tidymodels.netlify.app/
Source code at: github.com/jthomasmock/nfl-workshop

To follow along, you can read in the subsetted data with the code below:

raw_plays <- read_rds(
  url("https://github.com/jthomasmock/nfl-workshop/blob/master/raw_plays.rds?raw=true")
  )

6 / 90

Level-Setting

As much as I'd love to learn and teach all of Machine Learning/Statistics in 90 min...

7 / 90

Level-Setting

As much as I'd love to learn and teach all of Machine Learning/Statistics in 90 min...

It's just not possible!

7 / 90

Level-Setting

As much as I'd love to learn and teach all of Machine Learning/Statistics in 90 min...

It's just not possible!

Goals for today

Make you comfortable with the syntax and packages via tidymodels unified interface
So when you're learning or modeling on your own, you get to focus on the stats rather than re-learning different APIs over and over...

7 / 90

Level-Setting

As much as I'd love to learn and teach all of Machine Learning/Statistics in 90 min...

It's just not possible!

Goals for today

Make you comfortable with the syntax and packages via tidymodels unified interface
So when you're learning or modeling on your own, you get to focus on the stats rather than re-learning different APIs over and over...

Along the way, we'll cover minimal examples and then some more quick best practices where tidymodels makes it easier to do more things!

7 / 90

`tidymodels`

tidymodels is a collection of packages for modeling and machine learning using tidyverse principles.

Packages

rsample: efficient data splitting and resampling
parsnip: tidy, unified interface to models
recipes: tidy interface to data pre-processing tools for feature engineering
workflows: bundles your pre-processing, modeling, and post-processing
tune: helps optimize the hyperparameters and pre-processing steps
yardstick: measures the performance metrics
dials: creates and manages tuning parameters/grids
broom: converts common R statistical objects into predictable formats
- broom available methods

8 / 90

cmsac-tidymodels.netlify.app/
tidymodels vs broom alone9 / 90

`broom`

broom summarizes key information about models in tidy tibble()s.

10 / 90

`broom`

broom summarizes key information about models in tidy tibble()s.

broom tidies 100+ models from popular modelling packages and almost all of the model objects in the stats package that comes with base R. vignette("available-methods") lists method availability.

10 / 90

`broom`

broom summarizes key information about models in tidy tibble()s.

broom tidies 100+ models from popular modelling packages and almost all of the model objects in the stats package that comes with base R. vignette("available-methods") lists method availability.

While broom is useful for summarizing the result of a single analysis in a consistent format, it is really designed for high-throughput applications, where you must combine results from multiple analyses.

10 / 90

`broom`

broom summarizes key information about models in tidy tibble()s.

broom tidies 100+ models from popular modelling packages and almost all of the model objects in the stats package that comes with base R. vignette("available-methods") lists method availability.

While broom is useful for summarizing the result of a single analysis in a consistent format, it is really designed for high-throughput applications, where you must combine results from multiple analyses.

I personally use broom for more classical statistics and tidymodels for machine learning. A more detailed summary of what broom is about can be found in the broom docs.

10 / 90

Before we get to sweeping

Why do we care so much about QBs, passing, and EPA?

11 / 90

Before we get to sweeping

Why do we care so much about QBs, passing, and EPA?

11 / 90

Before we get to sweeping

Why do we care so much about QBs, passing, and EPA?

11 / 90

cmsac-tidymodels.netlify.app/
lm() example# get all weekly QBR for 2020 season
basic_data <- crossing(
  season = 2020, week = 1:6
  ) %>% 
  pmap_dfr(espnscrapeR::get_nfl_qbr)
basic_plot <- basic_data %>% 
  ggplot(
    aes(x = total_epa, y = qbr_total)
    ) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  labs(
    x = "EPA", y = "QBR", 
    title = "EPA is correlated with QBR"
    )

12 / 90

`lm()` example

# get all weekly QBR for 2020 season
basic_data <- crossing(
  season = 2020, week = 1:6
  ) %>% 
  pmap_dfr(espnscrapeR::get_nfl_qbr)
basic_plot <- basic_data %>% 
  ggplot(
    aes(x = total_epa, y = qbr_total)
    ) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_minimal() +
  labs(
    x = "EPA", y = "QBR", 
    title = "EPA is correlated with QBR"
    )

12 / 90

cmsac-tidymodels.netlify.app/
base example# fit a basic linear model
basic_lm <- lm(qbr_total~total_epa, data = basic_data)

13 / 90

cmsac-tidymodels.netlify.app/
base example# fit a basic linear model
basic_lm <- lm(qbr_total~total_epa, data = basic_data)

basic_lm

## 
## Call:
## lm(formula = qbr_total ~ total_epa, data = basic_data)
## 
## Coefficients:
## (Intercept)    total_epa  
##      31.696        6.007
13 / 90

cmsac-tidymodels.netlify.app/
base example# fit a basic linear model
basic_lm <- lm(qbr_total~total_epa, data = basic_data)

basic_lm

## 
## Call:
## lm(formula = qbr_total ~ total_epa, data = basic_data)
## 
## Coefficients:
## (Intercept)    total_epa  
##      31.696        6.007
summary(basic_lm)

## 
## Call:
## lm(formula = qbr_total ~ total_epa, data = basic_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.6789  -7.3896  -0.2737   7.1539  27.7642 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.6959     1.3716   23.11   <2e-16 ***
## total_epa     6.0065     0.2186   27.48   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.78 on 182 degrees of freedom
## Multiple R-squared:  0.8058,    Adjusted R-squared:  0.8047 
## F-statistic: 755.2 on 1 and 182 DF,  p-value: < 2.2e-16
13 / 90

`broom` example

broom::tidy(basic_lm)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    31.7      1.37       23.1 5.02e-56
## 2 total_epa       6.01     0.219      27.5 1.12e-66

14 / 90

`broom` example

broom::tidy(basic_lm)

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)    31.7      1.37       23.1 5.02e-56
## 2 total_epa       6.01     0.219      27.5 1.12e-66

broom::glance(basic_lm)

## # A tibble: 1 x 12
##   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.806         0.805  10.8      755. 1.12e-66     1  -698. 1401. 1411.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

14 / 90

Want more `broom`?

There's a lot more to broom for tidy-ier modeling - out of scope for today, but I have a detailed blogpost.

tidy_off_models <- nest_off_data %>% 
    mutate(
      fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
      tidy_output = map(fit, glance)
    )
tidy_off_models

15 / 90

Want more `broom`?

There's a lot more to broom for tidy-ier modeling - out of scope for today, but I have a detailed blogpost.

tidy_off_models <- nest_off_data %>% 
    mutate(
      fit = map(data, ~ lm(value.y ~ value.x, data = .x)),
      tidy_output = map(fit, glance)
    )
tidy_off_models

# # A tibble: 12 x 4
#    metric         data               fit    tidy_output      
#    <chr>          <list>             <list> <list>           
#  1 pass_att       <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
#  2 pass_comp_pct  <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
#  3 yds_att        <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
#  4 pass_yds       <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
#  5 pass_td        <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
#  6 int            <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
#  7 pass_rating    <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
#  8 first_downs    <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
#  9 pass_first_pct <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
# 10 pass_20plus    <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
# 11 pass_40plus    <tibble [593 × 5]> <lm>   <tibble [1 × 12]>
# 12 sacks          <tibble [593 × 5]> <lm>   <tibble [1 × 12]>

15 / 90

Want more `broom`?

There's a lot more to broom for tidy-ier modeling - out of scope for today, but I have a detailed blogpost.

16 / 90

Tidy Machine Learning w/ `tidymodels`

17 / 90

Core ideas for Today

A workflow for tidy machine learning

Split the data
Pre-Process and Choose a Model
Combine into a Workflow
Generate Predictions and Assess Model Metrics

18 / 90

cmsac-tidymodels.netlify.app/
Goal of Machine Learning🔨 construct models that19 / 90

cmsac-tidymodels.netlify.app/
Goal of Machine Learning🔨 construct models that🎯 generate accurate predictions19 / 90

cmsac-tidymodels.netlify.app/
Goal of Machine Learning🔨 construct models that🎯 generate accurate predictions🆕 for future, yet-to-be-seen data19 / 90

Goal of Machine Learning

🔨 construct models that

🎯 generate accurate predictions

🆕 for future, yet-to-be-seen data

Feature Engineering - Max Kuhn and Kjell Johnston and Alison Hill

19 / 90

Classification

Showing two examples today, comparing their outcomes, and then giving you the chance to explore on your own!

20 / 90

Classification

Showing two examples today, comparing their outcomes, and then giving you the chance to explore on your own!

But how do you assess classifier accuracy?

20 / 90