stravadata is an R package I use to organize and analyze my Strava activity data. This post offers some example analyses:
- Computing annual totals
- Making activity heat maps
- Counting efforts
- Making training calendars
- Tracking personal records
My examples use data on my running activities from the last five years:
library(dplyr)
library(lubridate)
library(stravadata)
runs = activities %>%
filter(type == 'Run') %>%
mutate(year = year(start_time),
date = date(start_time)) %>%
filter(year %in% 2018:2022)
Computing annual totals
runs contains activity-level features like distance traveled and time spent moving.
I sum these features by year, then use knitr::kable to display these sums in a table:
library(knitr)
runs %>%
group_by(Year = year) %>%
summarise(Runs = n(),
`Distance (km)` = sum(distance) / 1e3,
`Time (hours)` = sum(time_moving) / 3600) %>%
mutate_at(3:4, ~format(round(.), big.mark = ',')) %>%
kable(align = 'crrr')
| Year | Runs | Distance (km) | Time (hours) |
|---|---|---|---|
| 2018 | 68 | 544 | 52 |
| 2019 | 152 | 1,085 | 92 |
| 2020 | 224 | 2,026 | 172 |
| 2021 | 207 | 2,149 | 173 |
| 2022 | 145 | 1,517 | 120 |
Making activity heat maps
I record my runs with a watch that tracks my GPS coordinates.
stravadata stores these coordinates in streams.
For example, here’s the course for last year’s Moonlight Run in Palo Alto:
library(ggplot2)
p = runs %>%
filter(name == 'Moonlight Run' & year == 2022) %>%
select(id) %>%
left_join(streams, by = 'id') %>%
ggplot(aes(lon, lat)) +
geom_path()
plot_nicely(p) # Add text and formatting
Combining the GPS coordinates from many runs yields a local map. For example, suppose I want to map my runs near Stanford. I first make a table of GPS paths near a local landmark:
coords = c(-122.16, 37.44) # Trader Joe's
tol = 0.08
stanford_paths = streams %>%
semi_join(runs, by = 'id') %>%
mutate(step = row_number()) %>%
filter(sqrt((lon - coords[1]) ^ 2 + (lat - coords[2]) ^ 2) < tol) %>%
filter(lon != lag(lon) | lat != lag(lat)) %>% # Remove pauses
mutate(new_path = row_number() == 1 | id != lag(id) | step != lag(step) + 1) %>%
mutate(path = cumsum(new_path)) %>%
select(path, lat, lon)
I increment path every time I start a new run, unpause a previous run, or re-enter the area defined by coords and tol.
I use path as a grouping variable so that ggplot2::ggplot knows to draw each path separately.
I then use the alpha argument of ggplot2::geom_path to create a “heat map” of paths I run most often:
p = stanford_paths %>%
ggplot(aes(lon, lat, group = path)) +
geom_path(alpha = 0.1)
plot_nicely(p)

Counting efforts
best_efforts stores my fastest times running a range of distances (that Strava calls “efforts”) within each activity:
head(best_efforts)
## # A tibble: 6 × 4
## id effort start_index end_index
## <dbl> <chr> <int> <int>
## 1 1253004287 1 mile 15 447
## 2 1253004287 1/2 mile 11 232
## 3 1253004287 1k 12 284
## 4 1253004287 2 mile 11 876
## 5 1253004287 400m 11 120
## 6 1253004287 5k 11 1342
The id column stores activity IDs and the effort column stores effort descriptions.
I focus on 5k, 10k, and half marathon efforts:
focal_efforts = c('5k', '10k', 'Half-Marathon')
efforts = runs %>%
left_join(best_efforts, by = 'id') %>%
filter(effort %in% focal_efforts) %>%
mutate(effort = factor(effort, focal_efforts)) %>%
select(year, date, id, effort, start_index, end_index)
efforts inherits the year variable from runs.
I use this variable to count efforts within each year.
I then use tidyr::spread and knitr::kable to display these counts in a table:
library(tidyr)
efforts %>%
count(Year = year, effort) %>%
spread(effort, n, fill = 0) %>%
kable(align = 'c')
| Year | 5k | 10k | Half-Marathon |
|---|---|---|---|
| 2018 | 64 | 24 | 0 |
| 2019 | 136 | 34 | 2 |
| 2020 | 191 | 88 | 21 |
| 2021 | 200 | 90 | 25 |
| 2022 | 131 | 85 | 9 |
Making training calendars
efforts also inherits the date variable from runs.
I use this variable to create GitHub-esque training calendars.
For example, here’s my running calendar for 2021:
p = efforts %>%
filter(year == 2021) %>%
group_by(date) %>%
slice_max(effort) %>%
distinct(effort) %>% # I ran twice on some days
mutate(Week = floor_date(date, 'weeks', week_start = 1),
Weekday = wday(date, label = T, week_start = 1)) %>%
ggplot(aes(Week, Weekday)) +
geom_tile(aes(alpha = effort), col = 'white', linewidth = 0.5)
plot_nicely(p)
I use lubridate::floor_date to identify weeks and lubridate::wday to identify weekdays.
The col and size arguments of ggplot2::geom_tile add space between tiles.
Tracking personal records
I combine runs, streams, and efforts to track my record running paces over time.
I follow a three-step process:
First, I compute the mean pace for each effort.
I do this using the start_index and end_index columns that efforts inherits from best_efforts.
These columns tell me where each effort occurs in the corresponding activity’s stream:
effort_paces = streams %>%
filter(id %in% runs$id) %>%
# Create indices
group_by(id) %>%
mutate(index = row_number()) %>%
ungroup() %>%
# Extract stream segment for each effort
inner_join(efforts, by = 'id') %>%
filter(index >= start_index & index <= end_index) %>%
# Compute mean paces
group_by(id, date, effort) %>%
summarise(distance = max(distance) - min(distance),
time = max(time) - min(time)) %>%
ungroup() %>%
mutate(pace = (time / 60) / (distance / 1e3))
head(effort_paces)
## # A tibble: 6 × 6
## id date effort distance time pace
## <dbl> <date> <fct> <dbl> <dbl> <dbl>
## 1 1335437333 2018-01-01 5k 5002. 1442 4.81
## 2 1338123783 2018-01-03 5k 5000. 1605 5.35
## 3 1344338907 2018-01-07 5k 5000. 1455 4.85
## 4 1347622521 2018-01-09 5k 5000 1493 4.98
## 5 1353889714 2018-01-13 5k 5001. 1622 5.41
## 6 1353889714 2018-01-13 10k 10001. 3380 5.63
The values in the distance column differ slightly from the descriptions in the effort column.
This is because the stream segment doesn’t always cover the described distance exactly.
But the multiplicative errors in distance and time should be equal on average, making pace is an unbiased estimate of my true mean pace.
I measure this pace in minutes per kilometer.
Next, I extract my record paces by deleting efforts slower than my previous best:
record_paces = effort_paces %>%
group_by(effort) %>%
arrange(date) %>%
filter(pace == cummin(pace)) %>%
ungroup()
Finally, I “fill in the gaps” by adding days on which I don’t set a new record.
I do this using tidyr::crossing and tidyr::fill:
date_range = seq(date('2018-01-01'), date('2022-12-31'), by = 'day')
record_paces_filled = crossing(date = date_range, effort = focal_efforts) %>%
left_join(record_paces) %>%
group_by(effort) %>%
fill(pace) %>%
filter(!is.na(pace))
record_paces and record_paces_filled differ in that the latter includes date-effort pairs with no new records.
This makes record_paces_filled produce horizontal lines when I plot its data:
p = record_paces_filled %>%
ggplot(aes(date, pace, group = effort)) +
geom_line()
plot_nicely(p)