library(tidyverse)
library(lubridate)
library(dplyr) #for getting API data
library(httr) #for converting API data into a format R can understand
library(jsonlite) #for converting API data into a format R can understand
library(googlesheets4) #for writing data to a Google sheet
Formula1 Data Analysis
About this notebook
This is a Quarto website. To learn more about Quarto websites visit https://quarto.org/docs/websites.
Hi, my name is Athena Hawkins.
I’m building a website with F1 stats using Formula1 data stored in the Ergast API. I created this notebook so I could easily update my website with driver and constructor standings after each round, or race, of the season.
To do this, I’ll pull the data from the API, convert into a format R can understand, and then upload it to a Google sheet, which will then update my website.
This notebook shows how I get the data from the API and organize the data for my website. Since I’ve never used an API before, I used AI to help me troubleshoot errors. If I used AI to help me with any code, I’ll explain what each line of code does.
Let’s start by installing some libraries we need.
Create functions for standings data
Now I’m going to create a function that will allow me to get updated driver standings (how many points a driver has so far) after any given race in a season. I’ll get updated data after each race so I can update the standings on my website.
Here’s my function. It looks like a lot, but I’m going to break it down step by step below.
<- function(year, race) {
getDriverStandings <- paste0("http://ergast.com/api/f1/", year, "/", race, "/driverStandings.json?limit=100")
url <- GET(url)
raw_dstandings #get the raw content from the API for driverstandings
<- content(raw_dstandings, as = "text") |>
dstandings_data fromJSON()
#convert raw content into data R can read
if (
!is.null(dstandings_data$MRData$StandingsTable$StandingsLists$DriverStandings[[1]]))
# condition for when the data does exist
{<- dstandings_data$MRData$StandingsTable$StandingsLists$DriverStandings[[1]]
dstandings # extract the first response from the driver stanidngs list within all the standing tables
<- dstandings |>
clean_driver_standings mutate(
driverId = Driver$driverId,
constructorId = Constructors$constructorId
|>
) select(-positionText, -Driver, -Constructors) |>
mutate(across(c("position", "points", "wins"), as.integer))
# get the driverID from the Driver table and create acolumn for it and mutate numeric columns to read as integers
return(clean_driver_standings)
# the return for this condition
}
else {
stop("Driver standings data not found or has unexpected structure.")
}# message to be returned if data can't be found, especially if the race hasn't happened or there is missing data in the API
}
The standings function result
And here’s the result of that function when we feed it a year and a round.
<- getDriverStandings(2023, 22)
currentdstandings
|> head(10) currentdstandings
position points wins driverId
1 1 575 19 max_verstappen
2 2 285 2 perez
3 3 234 0 hamilton
4 4 206 0 alonso
5 5 206 0 leclerc
6 6 205 0 norris
7 7 200 1 sainz
8 8 175 0 russell
9 9 97 0 piastri
10 10 74 0 stroll
Getting the standings data from an API
The first few lines of code in our function set up the API call. Here’s what that means:
When an API is queried (when we ask it for data), it responds with the data in a structured format, commonly JSON. The Ergast API we’re using offers data in several formats, but I wanted to work with JSON, so we added a .json to the url tibble.
I wanted to create a function that would let me easily make new API calls based on the year and the race. I left spaces for the variables year and race in the url tibble.
I wrote the rest of the call according to the Ergast documentation along with the GET function. Here it is.
Then I used the
fromJSON()
function to put the content into a data frame in an R tibble.
Extracting nested data - The API data often contains nested structures, where certain fields are lists or objects containing multiple elements. You can check out the structure of the Ergast API data here.
In R (or many programming languages), the
$
operator is used to access components within a structure, like a list or a JSON object. That’s why we used thefromJSON()
function, which I found in R documentation.I asked ChatGPT to help me write this line after I gave it specific information about the structure of the data and asked to extract the constructor standings. Here’s what it told me:
The line
standings_data$MRData$StandingsTable$StandingsLists$ConstructorStandings
is a chain of references to access nested components within standings_data which allows us to specify that we want the constructor standings table.Once at a certain level of the nested structure, sometimes you might need to access specific elements within a list or object. That’s where the
[[ ]]
operator comes from. It’s used in R to extract elements from a list by specifying an index or a key.[[1]]
is the key we used to access the first element of the list, constructor standings. In our case. this is the first response for “constructor standings” that comes back when we query the API.
The same function but for Constructors standings
I used the same structure to create a formula for getting standings for the constructors (teams). Here’s what it looks like.
<- function(year, race) {
getConstructorStandings <- paste0("http://ergast.com/api/f1/", year,"/", race, "/constructorStandings.json?limit=100")
url <- GET(url)
cstandings_raw
<- content(cstandings_raw, as = "text") |>
cstandings_data fromJSON()
if (!is.null(cstandings_data$MRData$StandingsTable$StandingsLists$ConstructorStandings[[1]])) {
<- cstandings_data$MRData$StandingsTable$StandingsLists$ConstructorStandings[[1]]
cstandings
<- cstandings |>
cstandings_clean mutate(
constructorId = Constructor$constructorId
%>%
) select(-Constructor, -positionText) |>
mutate(across(c("position", "points", "wins"), as.integer))
return(cstandings_clean)
else {
} stop("Constructor standings data not found or has unexpected structure.")
} }
Constructors standings results
Here are the constructor standings we got. I hate that Ferrari is in second. But all looks correct!
<- getConstructorStandings(2023, 22)
currentcstandings
|> head(5) currentcstandings
position points wins constructorId
1 1 860 21 red_bull
2 2 409 0 mercedes
3 3 406 1 ferrari
4 4 302 0 mclaren
5 5 280 0 aston_martin
Writing results to Google Sheets
We installed the googlesheets4 package in our setup. It’s part of tidyverse and it’s really cool. You can read this documentation here to learn more about how to write data to an existing sheet.
Write drivers standings to google sheet
I’m going to put the url for the Google sheet with my drivers data into a tibble, then use the range_write() funcion to import the current standings to the range I want!
# Add Google sheet URL
<- "https://docs.google.com/spreadsheets/d/1ixos7uWmrzk0IJBtCcbNDnBh5UO6FDklzT7eTd1MXgE/edit#gid=0"
driversheet_url
# Authenticate my Google account
gs4_auth()
! Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to
the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
<https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
'athenahawkins@utexas.edu'.
# Read the existing Google Sheet into R
<- gs4_get(driversheet_url) existing_driversheet
Auto-refreshing stale OAuth token.
# Write the data frame to the Google Sheet
|>
existing_driversheet range_write(currentdstandings, range = "A1:D23")
✔ Editing "driverstats".
✔ Writing to sheet 'bios'.
Here’s a link to the Google sheet. The first four rows of data come from this notebook!
Write constructor standings to Google sheets
I’m going to do the same thing here for constructor standings.
# Add Google sheet URL
<- "https://docs.google.com/spreadsheets/d/1NjnxaNQst0E1y6l-e4GTIqDsM2sRLll3ekCWpcG-lDo/edit?usp=sharing"
constructorsheet_url
# Authenticate my Google account
gs4_auth()
! Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to
the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
<https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
'athenahawkins@utexas.edu'.
# Read the existing Google Sheet into R
<- gs4_get(constructorsheet_url)
existing_constructorsheet
# Write the data frame to the Google Sheet
|>
existing_constructorsheet range_write(currentcstandings, range = "A1:D11")
✔ Editing "constructors".
✔ Writing to sheet 'teams'.
Get circuits data
I’m going to download the circuits data because I’ll need to pull it each race anyways to get the most recent past winner.
<- read_csv("data-raw/circuits.csv")
circuits_raw <- read_csv("data-raw/lap_times.csv")
results_raw <- read_csv("data-raw/races.csv")
races_raw <- read_csv("data-raw/drivers.csv") drivers_raw
Get circuit win data
Now I want to get data for the past winners at each circuit, plus the circuit location. I’ll build these into my baked circuit pages.
<- results_raw |>
race_winners group_by(raceId) |>
filter(lap == max(lap)) |> # Filter for the last lap
filter(position == 1) |> # Filter for position 1 in the last lap
ungroup()
# Joining race_winners with results_raw based on raceId to get circuitId
<- race_winners |>
race_winners_with_circuitId left_join(races_raw |> select(raceId, circuitId, date), by = "raceId")
<- race_winners_with_circuitId |>
race_winners_with_circuitName left_join(circuits_raw |> select(circuitId, name, location,country), by ="circuitId")
<- race_winners_with_circuitName |>
race_winners_per_circuit left_join(drivers_raw |> select(driverId, surname, forename), by ='driverId')
<- race_winners_per_circuit |>
race_winners_circuit_name_years mutate(yr = year(date),
winner_name = paste(forename, surname, sep = " "))
<- race_winners_circuit_name_years |>
circuit_race_winners filter(yr >= 2016) |>
select(name, winner_name, yr)
Check the result
Let’s check to make sure we have the name of each race, the year it happened, and the winner.
|> head(10) circuit_race_winners
# A tibble: 10 × 3
name winner_name yr
<chr> <chr> <dbl>
1 Albert Park Grand Prix Circuit Nico Rosberg 2016
2 Bahrain International Circuit Nico Rosberg 2016
3 Shanghai International Circuit Nico Rosberg 2016
4 Sochi Autodrom Nico Rosberg 2016
5 Circuit de Barcelona-Catalunya Max Verstappen 2016
6 Circuit de Monaco Lewis Hamilton 2016
7 Circuit Gilles Villeneuve Lewis Hamilton 2016
8 Baku City Circuit Nico Rosberg 2016
9 Red Bull Ring Lewis Hamilton 2016
10 Silverstone Circuit Lewis Hamilton 2016
I also want a tibble for just the most recent race winners for a card on the website.
<- circuit_race_winners |>
recent_circuit_winnersfilter(yr == 2023)
|> head(10) recent_circuit_winners
# A tibble: 10 × 3
name winner_name yr
<chr> <chr> <dbl>
1 Bahrain International Circuit Max Verstappen 2023
2 Jeddah Corniche Circuit Sergio Pérez 2023
3 Albert Park Grand Prix Circuit Max Verstappen 2023
4 Baku City Circuit Sergio Pérez 2023
5 Miami International Autodrome Max Verstappen 2023
6 Circuit de Monaco Max Verstappen 2023
7 Circuit de Barcelona-Catalunya Max Verstappen 2023
8 Circuit Gilles Villeneuve Max Verstappen 2023
9 Red Bull Ring Max Verstappen 2023
10 Silverstone Circuit Max Verstappen 2023
Write circuits to Google sheets
recent_circuit_winners
# A tibble: 22 × 3
name winner_name yr
<chr> <chr> <dbl>
1 Bahrain International Circuit Max Verstappen 2023
2 Jeddah Corniche Circuit Sergio Pérez 2023
3 Albert Park Grand Prix Circuit Max Verstappen 2023
4 Baku City Circuit Sergio Pérez 2023
5 Miami International Autodrome Max Verstappen 2023
6 Circuit de Monaco Max Verstappen 2023
7 Circuit de Barcelona-Catalunya Max Verstappen 2023
8 Circuit Gilles Villeneuve Max Verstappen 2023
9 Red Bull Ring Max Verstappen 2023
10 Silverstone Circuit Max Verstappen 2023
# ℹ 12 more rows
I’m going to do the same thing I did with Google sheets for my drivers and constructors standings, but this time with circuits!
Let’s start with just the recent winners.
# Add Google sheet URL
<- "https://docs.google.com/spreadsheets/d/1VOS_3RxH2i3Y7__A1VfdbmX_RPi7JmTa6WmIzfgne84/edit?usp=sharing"
recent_circuitsheet_url
# Read the existing Google Sheet into R
<- gs4_get(recent_circuitsheet_url)
recent_existing_circuitsheet
# Write the data frame to the Google Sheet
|>
recent_existing_circuitsheet range_write(recent_circuit_winners, range = "A1:C100")
✔ Editing "circuits".
✔ Writing to sheet 'tracks'.
And now for the all time stats.
# Add Google sheet URL
<- "https://docs.google.com/spreadsheets/d/1re902kjUgsT_eoqXD5Pk_KFj2uIumfIH_0qk9ax3S_g/edit?usp=sharing"
all_circuitsheet_url
# Authenticate my Google account
gs4_auth()
! Using an auto-discovered, cached token.
To suppress this message, modify your code or options to clearly consent to
the use of a cached token.
See gargle's "Non-interactive auth" vignette for more details:
<https://gargle.r-lib.org/articles/non-interactive-auth.html>
ℹ The googlesheets4 package is using a cached token for
'athenahawkins@utexas.edu'.
# Read the existing Google Sheet into R
<- gs4_get(all_circuitsheet_url)
existing_circuitsheet
# Write the data frame to the Google Sheet
|>
existing_circuitsheet range_write(circuit_race_winners)
✔ Editing "all-time circuits".
✔ Writing to sheet 'tracks'.