stravadata is an R package I use to organize and analyze my Strava activity data. This post offers some example analyses:

My examples use data on my running activities from the last five years:

library(dplyr)
library(lubridate)
library(stravadata)

runs = activities %>%
  filter(type == 'Run') %>%
  mutate(year = year(start_time),
         date = date(start_time)) %>%
  filter(year %in% 2018:2022)

Computing annual totals

runs contains activity-level features like distance traveled and time spent moving. I sum these features by year, then use knitr::kable to display these sums in a table:

library(knitr)

runs %>%
  group_by(Year = year) %>%
  summarise(Runs = n(),
            `Distance (km)` = sum(distance) / 1e3,
            `Time (hours)` = sum(time_moving) / 3600) %>%
  mutate_at(3:4, ~format(round(.), big.mark = ',')) %>%
  kable(align = 'crrr')
Year Runs Distance (km) Time (hours)
2018 68 544 52
2019 152 1,085 92
2020 224 2,026 172
2021 207 2,149 173
2022 145 1,517 120

Making activity heat maps

I record my runs with a watch that tracks my GPS coordinates. stravadata stores these coordinates in streams. For example, here’s the course for last year’s Moonlight Run in Palo Alto:

library(ggplot2)

p = runs %>%
  filter(name == 'Moonlight Run' & year == 2022) %>%
  select(id) %>%
  left_join(streams, by = 'id') %>%
  ggplot(aes(lon, lat)) +
  geom_path()

plot_nicely(p)  # Add text and formatting

Combining the GPS coordinates from many runs yields a local map. For example, suppose I want to map my runs near Stanford. I first make a table of GPS paths near a local landmark:

coords = c(-122.16, 37.44)  # Trader Joe's
tol = 0.08

stanford_paths = streams %>%
  semi_join(runs, by = 'id') %>%
  mutate(step = row_number()) %>%
  filter(sqrt((lon - coords[1]) ^ 2 + (lat - coords[2]) ^ 2) < tol) %>%
  filter(lon != lag(lon) | lat != lag(lat)) %>%  # Remove pauses
  mutate(new_path = row_number() == 1 | id != lag(id) | step != lag(step) + 1) %>%
  mutate(path = cumsum(new_path)) %>%
  select(path, lat, lon)

I increment path every time I start a new run, unpause a previous run, or re-enter the area defined by coords and tol. I use path as a grouping variable so that ggplot2::ggplot knows to draw each path separately. I then use the alpha argument of ggplot2::geom_path to create a “heat map” of paths I run most often:

p = stanford_paths %>%
  ggplot(aes(lon, lat, group = path)) +
  geom_path(alpha = 0.1)

plot_nicely(p)

Counting efforts

best_efforts stores my fastest times running a range of distances (that Strava calls “efforts”) within each activity:

head(best_efforts)
## # A tibble: 6 × 4
##           id effort   start_index end_index
##        <dbl> <chr>          <int>     <int>
## 1 1253004287 1 mile            15       447
## 2 1253004287 1/2 mile          11       232
## 3 1253004287 1k                12       284
## 4 1253004287 2 mile            11       876
## 5 1253004287 400m              11       120
## 6 1253004287 5k                11      1342

The id column stores activity IDs and the effort column stores effort descriptions. I focus on 5k, 10k, and half marathon efforts:

focal_efforts = c('5k', '10k', 'Half-Marathon')

efforts = runs %>%
  left_join(best_efforts, by = 'id') %>%
  filter(effort %in% focal_efforts) %>%
  mutate(effort = factor(effort, focal_efforts)) %>%
  select(year, date, id, effort, start_index, end_index)

efforts inherits the year variable from runs. I use this variable to count efforts within each year. I then use tidyr::spread and knitr::kable to display these counts in a table:

library(tidyr)

efforts %>%
  count(Year = year, effort) %>%
  spread(effort, n, fill = 0) %>%
  kable(align = 'c')
Year 5k 10k Half-Marathon
2018 64 24 0
2019 136 34 2
2020 191 88 21
2021 200 90 25
2022 131 85 9

Making training calendars

efforts also inherits the date variable from runs. I use this variable to create GitHub-esque training calendars. For example, here’s my running calendar for 2021:

p = efforts %>%
  filter(year == 2021) %>%
  group_by(date) %>%
  slice_max(effort) %>%
  distinct(effort) %>%  # I ran twice on some days
  mutate(Week = floor_date(date, 'weeks', week_start = 1),
         Weekday = wday(date, label = T, week_start = 1)) %>%
  ggplot(aes(Week, Weekday)) +
  geom_tile(aes(alpha = effort), col = 'white', linewidth = 0.5)

plot_nicely(p)

I use lubridate::floor_date to identify weeks and lubridate::wday to identify weekdays. The col and size arguments of ggplot2::geom_tile add space between tiles.

Tracking personal records

I combine runs, streams, and efforts to track my record running paces over time. I follow a three-step process:

First, I compute the mean pace for each effort. I do this using the start_index and end_index columns that efforts inherits from best_efforts. These columns tell me where each effort occurs in the corresponding activity’s stream:

effort_paces = streams %>%
  filter(id %in% runs$id) %>%
  # Create indices
  group_by(id) %>%
  mutate(index = row_number()) %>%
  ungroup() %>%
  # Extract stream segment for each effort
  inner_join(efforts, by = 'id') %>%
  filter(index >= start_index & index <= end_index) %>%
  # Compute mean paces
  group_by(id, date, effort) %>%
  summarise(distance = max(distance) - min(distance),
            time = max(time) - min(time)) %>%
  ungroup() %>%
  mutate(pace = (time / 60) / (distance / 1e3))

head(effort_paces)
## # A tibble: 6 × 6
##           id date       effort distance  time  pace
##        <dbl> <date>     <fct>     <dbl> <dbl> <dbl>
## 1 1335437333 2018-01-01 5k        5002.  1442  4.81
## 2 1338123783 2018-01-03 5k        5000.  1605  5.35
## 3 1344338907 2018-01-07 5k        5000.  1455  4.85
## 4 1347622521 2018-01-09 5k        5000   1493  4.98
## 5 1353889714 2018-01-13 5k        5001.  1622  5.41
## 6 1353889714 2018-01-13 10k      10001.  3380  5.63

The values in the distance column differ slightly from the descriptions in the effort column. This is because the stream segment doesn’t always cover the described distance exactly. But the multiplicative errors in distance and time should be equal on average, making pace is an unbiased estimate of my true mean pace. I measure this pace in minutes per kilometer.

Next, I extract my record paces by deleting efforts slower than my previous best:

record_paces = effort_paces %>%
  group_by(effort) %>%
  arrange(date) %>%
  filter(pace == cummin(pace)) %>%
  ungroup()

Finally, I “fill in the gaps” by adding days on which I don’t set a new record. I do this using tidyr::crossing and tidyr::fill:

date_range = seq(date('2018-01-01'), date('2022-12-31'), by = 'day')

record_paces_filled = crossing(date = date_range, effort = focal_efforts) %>%
  left_join(record_paces) %>%
  group_by(effort) %>%
  fill(pace) %>%
  filter(!is.na(pace))

record_paces and record_paces_filled differ in that the latter includes date-effort pairs with no new records. This makes record_paces_filled produce horizontal lines when I plot its data:

p = record_paces_filled %>%
  ggplot(aes(date, pace, group = effort)) +
  geom_line()

plot_nicely(p)