Formula1 Data Analysis

About this notebook

This is a Quarto website. To learn more about Quarto websites visit https://quarto.org/docs/websites.

Hi, my name is Athena Hawkins.

  • I’m building a website with F1 stats using Formula1 data stored in the Ergast API. I created this notebook so I could easily update my website with driver and constructor standings after each round, or race, of the season.

  • To do this, I’ll pull the data from the API, convert into a format R can understand, and then upload it to a Google sheet, which will then update my website.

  • This notebook shows how I get the data from the API and organize the data for my website. Since I’ve never used an API before, I used AI to help me troubleshoot errors. If I used AI to help me with any code, I’ll explain what each line of code does.

  • Let’s start by installing some libraries we need.

library(tidyverse)
library(lubridate)
library(dplyr) #for getting API data
library(httr) #for converting API data into a format R can understand
library(jsonlite) #for converting API data into a format R can understand
library(googlesheets4) #for writing data to a Google sheet

Create functions for standings data

Now I’m going to create a function that will allow me to get updated driver standings (how many points a driver has so far) after any given race in a season. I’ll get updated data after each race so I can update the standings on my website.

Here’s my function. It looks like a lot, but I’m going to break it down step by step below.

getDriverStandings <- function(year, race) {
  url <- paste0("http://ergast.com/api/f1/", year, "/", race, "/driverStandings.json?limit=100")
  raw_dstandings <- GET(url)
  #get the raw content from the API for driverstandings
  
  dstandings_data <- content(raw_dstandings, as = "text") |> 
    fromJSON()
  #convert raw content into data R can read

  if (
    !is.null(dstandings_data$MRData$StandingsTable$StandingsLists$DriverStandings[[1]])) 
  # condition for when the data does exist
    
    {
    dstandings <- dstandings_data$MRData$StandingsTable$StandingsLists$DriverStandings[[1]]
  # extract the first response from the driver stanidngs list within all the standing tables

    clean_driver_standings <- dstandings |> 
      mutate(
        driverId = Driver$driverId,
        constructorId = Constructors$constructorId
      ) |> 
      select(-positionText, -Driver, -Constructors) |> 
      mutate(across(c("position", "points", "wins"), as.integer))
  # get the driverID from the Driver table and create acolumn for it and mutate numeric columns to read as integers

    return(clean_driver_standings)
  # the return for this condition
  } 
  
  else {
    stop("Driver standings data not found or has unexpected structure.")
  }
  # message to be returned if data can't be found, especially if the race hasn't happened or there is missing data in the API
}

The standings function result

And here’s the result of that function when we feed it a year and a round.

currentdstandings <- getDriverStandings(2023, 22)

currentdstandings |> head(10)
   position points wins       driverId
1         1    575   19 max_verstappen
2         2    285    2          perez
3         3    234    0       hamilton
4         4    206    0         alonso
5         5    206    0        leclerc
6         6    205    0         norris
7         7    200    1          sainz
8         8    175    0        russell
9         9     97    0        piastri
10       10     74    0         stroll

Getting the standings data from an API

The first few lines of code in our function set up the API call. Here’s what that means:

  • When an API is queried (when we ask it for data), it responds with the data in a structured format, commonly JSON. The Ergast API we’re using offers data in several formats, but I wanted to work with JSON, so we added a .json to the url tibble.

  • I wanted to create a function that would let me easily make new API calls based on the year and the race. I left spaces for the variables year and race in the url tibble.

  • I wrote the rest of the call according to the Ergast documentation along with the GET function. Here it is.

  • Then I used the fromJSON() function to put the content into a data frame in an R tibble.

Extracting nested data - The API data often contains nested structures, where certain fields are lists or objects containing multiple elements. You can check out the structure of the Ergast API data here.

  • In R (or many programming languages), the $ operator is used to access components within a structure, like a list or a JSON object. That’s why we used the fromJSON() function, which I found in R documentation.

  • I asked ChatGPT to help me write this line after I gave it specific information about the structure of the data and asked to extract the constructor standings. Here’s what it told me:

  • The line standings_data$MRData$StandingsTable$StandingsLists$ConstructorStandings is a chain of references to access nested components within standings_data which allows us to specify that we want the constructor standings table.

  • Once at a certain level of the nested structure, sometimes you might need to access specific elements within a list or object. That’s where the [[ ]] operator comes from. It’s used in R to extract elements from a list by specifying an index or a key.

  • [[1]] is the key we used to access the first element of the list, constructor standings. In our case. this is the first response for “constructor standings” that comes back when we query the API.

The same function but for Constructors standings

I used the same structure to create a formula for getting standings for the constructors (teams). Here’s what it looks like.

getConstructorStandings <- function(year, race) {
url <- paste0("http://ergast.com/api/f1/", year,"/", race, "/constructorStandings.json?limit=100")
  cstandings_raw <- GET(url)

  cstandings_data <- content(cstandings_raw, as = "text") |> 
    fromJSON()

  if (!is.null(cstandings_data$MRData$StandingsTable$StandingsLists$ConstructorStandings[[1]])) {
    cstandings <- cstandings_data$MRData$StandingsTable$StandingsLists$ConstructorStandings[[1]]

   cstandings_clean <- cstandings |> 
      mutate(
        constructorId = Constructor$constructorId
      ) %>%
      select(-Constructor, -positionText) |> 
      mutate(across(c("position", "points", "wins"), as.integer))

    return(cstandings_clean)
  } else {
    stop("Constructor standings data not found or has unexpected structure.")
  }
}

Constructors standings results

Here are the constructor standings we got. I hate that Ferrari is in second. But all looks correct!

currentcstandings <- getConstructorStandings(2023, 22)

currentcstandings |> head(5)
  position points wins constructorId
1        1    860   21      red_bull
2        2    409    0      mercedes
3        3    406    1       ferrari
4        4    302    0       mclaren
5        5    280    0  aston_martin

Writing results to Google Sheets

We installed the googlesheets4 package in our setup. It’s part of tidyverse and it’s really cool. You can read this documentation here to learn more about how to write data to an existing sheet.

Write drivers standings to google sheet

I’m going to put the url for the Google sheet with my drivers data into a tibble, then use the range_write() funcion to import the current standings to the range I want!

# Add Google sheet URL
driversheet_url <- "https://docs.google.com/spreadsheets/d/1ixos7uWmrzk0IJBtCcbNDnBh5UO6FDklzT7eTd1MXgE/edit#gid=0"

# Authenticate my Google account
gs4_auth()
! Using an auto-discovered, cached token.
  To suppress this message, modify your code or options to clearly consent to
  the use of a cached token.
  See gargle's "Non-interactive auth" vignette for more details:
  <https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
  'athenahawkins@utexas.edu'.
# Read the existing Google Sheet into R
existing_driversheet <- gs4_get(driversheet_url)
Auto-refreshing stale OAuth token.
# Write the data frame to the Google Sheet
existing_driversheet |>  
  range_write(currentdstandings, range = "A1:D23")
✔ Editing "driverstats".
✔ Writing to sheet 'bios'.

Here’s a link to the Google sheet. The first four rows of data come from this notebook!

Write constructor standings to Google sheets

I’m going to do the same thing here for constructor standings.

# Add Google sheet URL
constructorsheet_url <- "https://docs.google.com/spreadsheets/d/1NjnxaNQst0E1y6l-e4GTIqDsM2sRLll3ekCWpcG-lDo/edit?usp=sharing"

# Authenticate my Google account
gs4_auth()
! Using an auto-discovered, cached token.
  To suppress this message, modify your code or options to clearly consent to
  the use of a cached token.
  See gargle's "Non-interactive auth" vignette for more details:
  <https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
  'athenahawkins@utexas.edu'.
# Read the existing Google Sheet into R
existing_constructorsheet <- gs4_get(constructorsheet_url)

# Write the data frame to the Google Sheet
existing_constructorsheet |>  
  range_write(currentcstandings, range = "A1:D11")
✔ Editing "constructors".
✔ Writing to sheet 'teams'.

Get circuits data

I’m going to download the circuits data because I’ll need to pull it each race anyways to get the most recent past winner.

circuits_raw <- read_csv("data-raw/circuits.csv")
results_raw <- read_csv("data-raw/lap_times.csv")
races_raw <- read_csv("data-raw/races.csv")
drivers_raw <- read_csv("data-raw/drivers.csv")

Get circuit win data

Now I want to get data for the past winners at each circuit, plus the circuit location. I’ll build these into my baked circuit pages.

race_winners <- results_raw |> 
  group_by(raceId) |> 
  filter(lap == max(lap)) |>  # Filter for the last lap
  filter(position == 1) |>    # Filter for position 1 in the last lap
  ungroup()

# Joining race_winners with results_raw based on raceId to get circuitId
race_winners_with_circuitId <- race_winners |> 
  left_join(races_raw |>  select(raceId, circuitId, date), by = "raceId")

race_winners_with_circuitName <- race_winners_with_circuitId |>
  left_join(circuits_raw |> select(circuitId, name, location,country), by ="circuitId")

race_winners_per_circuit <- race_winners_with_circuitName |> 
  left_join(drivers_raw |> select(driverId, surname, forename), by ='driverId')
  
race_winners_circuit_name_years <- race_winners_per_circuit |> 
  mutate(yr = year(date),
        winner_name = paste(forename, surname, sep = " "))

circuit_race_winners <- race_winners_circuit_name_years |> 
  filter(yr >= 2016) |> 
  select(name, winner_name, yr)

Check the result

Let’s check to make sure we have the name of each race, the year it happened, and the winner.

circuit_race_winners |> head(10)
# A tibble: 10 × 3
   name                           winner_name       yr
   <chr>                          <chr>          <dbl>
 1 Albert Park Grand Prix Circuit Nico Rosberg    2016
 2 Bahrain International Circuit  Nico Rosberg    2016
 3 Shanghai International Circuit Nico Rosberg    2016
 4 Sochi Autodrom                 Nico Rosberg    2016
 5 Circuit de Barcelona-Catalunya Max Verstappen  2016
 6 Circuit de Monaco              Lewis Hamilton  2016
 7 Circuit Gilles Villeneuve      Lewis Hamilton  2016
 8 Baku City Circuit              Nico Rosberg    2016
 9 Red Bull Ring                  Lewis Hamilton  2016
10 Silverstone Circuit            Lewis Hamilton  2016

I also want a tibble for just the most recent race winners for a card on the website.

recent_circuit_winners<- circuit_race_winners |> 
  filter(yr == 2023)

recent_circuit_winners |> head(10)
# A tibble: 10 × 3
   name                           winner_name       yr
   <chr>                          <chr>          <dbl>
 1 Bahrain International Circuit  Max Verstappen  2023
 2 Jeddah Corniche Circuit        Sergio Pérez    2023
 3 Albert Park Grand Prix Circuit Max Verstappen  2023
 4 Baku City Circuit              Sergio Pérez    2023
 5 Miami International Autodrome  Max Verstappen  2023
 6 Circuit de Monaco              Max Verstappen  2023
 7 Circuit de Barcelona-Catalunya Max Verstappen  2023
 8 Circuit Gilles Villeneuve      Max Verstappen  2023
 9 Red Bull Ring                  Max Verstappen  2023
10 Silverstone Circuit            Max Verstappen  2023

Write circuits to Google sheets

recent_circuit_winners
# A tibble: 22 × 3
   name                           winner_name       yr
   <chr>                          <chr>          <dbl>
 1 Bahrain International Circuit  Max Verstappen  2023
 2 Jeddah Corniche Circuit        Sergio Pérez    2023
 3 Albert Park Grand Prix Circuit Max Verstappen  2023
 4 Baku City Circuit              Sergio Pérez    2023
 5 Miami International Autodrome  Max Verstappen  2023
 6 Circuit de Monaco              Max Verstappen  2023
 7 Circuit de Barcelona-Catalunya Max Verstappen  2023
 8 Circuit Gilles Villeneuve      Max Verstappen  2023
 9 Red Bull Ring                  Max Verstappen  2023
10 Silverstone Circuit            Max Verstappen  2023
# ℹ 12 more rows

I’m going to do the same thing I did with Google sheets for my drivers and constructors standings, but this time with circuits!

Let’s start with just the recent winners.

# Add Google sheet URL
recent_circuitsheet_url <- "https://docs.google.com/spreadsheets/d/1VOS_3RxH2i3Y7__A1VfdbmX_RPi7JmTa6WmIzfgne84/edit?usp=sharing"

# Read the existing Google Sheet into R
recent_existing_circuitsheet <- gs4_get(recent_circuitsheet_url)

# Write the data frame to the Google Sheet
recent_existing_circuitsheet |>  
  range_write(recent_circuit_winners, range = "A1:C100")
✔ Editing "circuits".
✔ Writing to sheet 'tracks'.

And now for the all time stats.

# Add Google sheet URL
all_circuitsheet_url <- "https://docs.google.com/spreadsheets/d/1re902kjUgsT_eoqXD5Pk_KFj2uIumfIH_0qk9ax3S_g/edit?usp=sharing"

# Authenticate my Google account
gs4_auth()
! Using an auto-discovered, cached token.
  To suppress this message, modify your code or options to clearly consent to
  the use of a cached token.
  See gargle's "Non-interactive auth" vignette for more details:
  <https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
  'athenahawkins@utexas.edu'.
# Read the existing Google Sheet into R
existing_circuitsheet <- gs4_get(all_circuitsheet_url)

# Write the data frame to the Google Sheet
existing_circuitsheet |>  
  range_write(circuit_race_winners)
✔ Editing "all-time circuits".
✔ Writing to sheet 'tracks'.