stravadata demo

stravadata is an R package I use to organize and analyze my Strava activity data. This post offers some example analyses:

Computing annual totals
Making activity heat maps
Counting efforts
Making training calendars
Tracking personal records

My examples use data on my running activities from the last five years:

library(dplyr)
library(lubridate)
library(stravadata)

runs = activities %>%
  filter(type == 'Run') %>%
  mutate(year = year(start_time),
         date = date(start_time)) %>%
  filter(year %in% 2018:2022)

Computing annual totals

runs contains activity-level features like distance traveled and time spent moving. I sum these features by year, then use knitr::kable to display these sums in a table:

library(knitr)

runs %>%
  group_by(Year = year) %>%
  summarise(Runs = n(),
            `Distance (km)` = sum(distance) / 1e3,
            `Time (hours)` = sum(time_moving) / 3600) %>%
  mutate_at(3:4, ~format(round(.), big.mark = ',')) %>%
  kable(align = 'crrr')

Year	Runs	Distance (km)	Time (hours)
2018	68	544	52
2019	152	1,085	92
2020	224	2,026	172
2021	207	2,149	173
2022	145	1,517	120

Making activity heat maps

I record my runs with a watch that tracks my GPS coordinates. stravadata stores these coordinates in streams. For example, here’s the course for last year’s Moonlight Run in Palo Alto:

library(ggplot2)

p = runs %>%
  filter(name == 'Moonlight Run' & year == 2022) %>%
  select(id) %>%
  left_join(streams, by = 'id') %>%
  ggplot(aes(lon, lat)) +
  geom_path()

plot_nicely(p)  # Add text and formatting

Combining the GPS coordinates from many runs yields a local map. For example, suppose I want to map my runs near Stanford. I first make a table of GPS paths near a local landmark:

coords = c(-122.16, 37.44)  # Trader Joe's
tol = 0.08

stanford_paths = streams %>%
  semi_join(runs, by = 'id') %>%
  mutate(step = row_number()) %>%
  filter(sqrt((lon - coords[1]) ^ 2 + (lat - coords[2]) ^ 2) < tol) %>%
  filter(lon != lag(lon) | lat != lag(lat)) %>%  # Remove pauses
  mutate(new_path = row_number() == 1 | id != lag(id) | step != lag(step) + 1) %>%
  mutate(path = cumsum(new_path)) %>%
  select(path, lat, lon)

I increment path every time I start a new run, unpause a previous run, or re-enter the area defined by coords and tol. I use path as a grouping variable so that ggplot2::ggplot knows to draw each path separately. I then use the alpha argument of ggplot2::geom_path to create a “heat map” of paths I run most often:

p = stanford_paths %>%
  ggplot(aes(lon, lat, group = path)) +
  geom_path(alpha = 0.1)

plot_nicely(p)

Counting efforts

best_efforts stores my fastest times running a range of distances (that Strava calls “efforts”) within each activity:

head(best_efforts)

## # A tibble: 6 × 4
##           id effort   start_index end_index
##        <dbl> <chr>          <int>     <int>
## 1 1253004287 1 mile            15       447
## 2 1253004287 1/2 mile          11       232
## 3 1253004287 1k                12       284
## 4 1253004287 2 mile            11       876
## 5 1253004287 400m              11       120
## 6 1253004287 5k                11      1342

The id column stores activity IDs and the effort column stores effort descriptions. I focus on 5k, 10k, and half marathon efforts:

focal_efforts = c('5k', '10k', 'Half-Marathon')

efforts = runs %>%
  left_join(best_efforts, by = 'id') %>%
  filter(effort %in% focal_efforts) %>%
  mutate(effort = factor(effort, focal_efforts)) %>%
  select(year, date, id, effort, start_index, end_index)

efforts inherits the year variable from runs. I use this variable to count efforts within each year. I then use tidyr::spread and knitr::kable to display these counts in a table:

library(tidyr)

efforts %>%
  count(Year = year, effort) %>%
  spread(effort, n, fill = 0) %>%
  kable(align = 'c')

Year	5k	10k	Half-Marathon
2018	64	24	0
2019	136	34	2
2020	191	88	21
2021	200	90	25
2022	131	85	9

Making training calendars

efforts also inherits the date variable from runs. I use this variable to create GitHub-esque training calendars. For example, here’s my running calendar for 2021:

p = efforts %>%
  filter(year == 2021) %>%
  group_by(date) %>%
  slice_max(effort) %>%
  distinct(effort) %>%  # I ran twice on some days
  mutate(Week = floor_date(date, 'weeks', week_start = 1),
         Weekday = wday(date, label = T, week_start = 1)) %>%
  ggplot(aes(Week, Weekday)) +
  geom_tile(aes(alpha = effort), col = 'white', linewidth = 0.5)

plot_nicely(p)

I use lubridate::floor_date to identify weeks and lubridate::wday to identify weekdays. The col and size arguments of ggplot2::geom_tile add space between tiles.

Tracking personal records

I combine runs, streams, and efforts to track my record running paces over time. I follow a three-step process:

First, I compute the mean pace for each effort. I do this using the start_index and end_index columns that efforts inherits from best_efforts. These columns tell me where each effort occurs in the corresponding activity’s stream:

effort_paces = streams %>%
  filter(id %in% runs$id) %>%
  # Create indices
  group_by(id) %>%
  mutate(index = row_number()) %>%
  ungroup() %>%
  # Extract stream segment for each effort
  inner_join(efforts, by = 'id') %>%
  filter(index >= start_index & index <= end_index) %>%
  # Compute mean paces
  group_by(id, date, effort) %>%
  summarise(distance = max(distance) - min(distance),
            time = max(time) - min(time)) %>%
  ungroup() %>%
  mutate(pace = (time / 60) / (distance / 1e3))

head(effort_paces)

## # A tibble: 6 × 6
##           id date       effort distance  time  pace
##        <dbl> <date>     <fct>     <dbl> <dbl> <dbl>
## 1 1335437333 2018-01-01 5k        5002.  1442  4.81
## 2 1338123783 2018-01-03 5k        5000.  1605  5.35
## 3 1344338907 2018-01-07 5k        5000.  1455  4.85
## 4 1347622521 2018-01-09 5k        5000   1493  4.98
## 5 1353889714 2018-01-13 5k        5001.  1622  5.41
## 6 1353889714 2018-01-13 10k      10001.  3380  5.63

The values in the distance column differ slightly from the descriptions in the effort column. This is because the stream segment doesn’t always cover the described distance exactly. But the multiplicative errors in distance and time should be equal on average, making pace is an unbiased estimate of my true mean pace. I measure this pace in minutes per kilometer.

Next, I extract my record paces by deleting efforts slower than my previous best:

record_paces = effort_paces %>%
  group_by(effort) %>%
  arrange(date) %>%
  filter(pace == cummin(pace)) %>%
  ungroup()

Finally, I “fill in the gaps” by adding days on which I don’t set a new record. I do this using tidyr::crossing and tidyr::fill:

date_range = seq(date('2018-01-01'), date('2022-12-31'), by = 'day')

record_paces_filled = crossing(date = date_range, effort = focal_efforts) %>%
  left_join(record_paces) %>%
  group_by(effort) %>%
  fill(pace) %>%
  filter(!is.na(pace))

record_paces and record_paces_filled differ in that the latter includes date-effort pairs with no new records. This makes record_paces_filled produce horizontal lines when I plot its data:

p = record_paces_filled %>%
  ggplot(aes(date, pace, group = effort)) +
  geom_line()

plot_nicely(p)