Ben Davies Blog Research Vita

stravadata now provides data on activity streams, which are “the raw measurement sequences that define” Strava activities (source). For example, here are some of the streams from when I ran Round the Bays in February:

library(dplyr)
library(stravadata)

rtb_streams <- activities %>%
  filter(name == 'Round the Bays' & grepl('2020', start_time)) %>%
  select(id) %>%
  left_join(streams) %>%
  select(distance, time, speed, hr, cadence)

rtb_streams
## # A tibble: 1,543 x 5
##    distance  time speed    hr cadence
##       <dbl> <dbl> <dbl> <dbl>   <dbl>
##  1      0       0   0     110       0
##  2      1.7     1   1.7   110       0
##  3     12.5     6   2.1   109      76
##  4     22.2    10   2.3   113      75
##  5     27.3    12   2.5   112      77
##  6     30.1    14   2.2   117      76
##  7     33.2    15   2.2   119      76
##  8     39.4    17   2.4   123      76
##  9     42.9    18   2.6   126      76
## 10     53.4    21   3.4   130      79
## # … with 1,533 more rows

The distance and time columns report cumulative distance travelled and time elapsed, while the other three columns provide (unevenly spaced) time series of performance indicators. I smooth these series by computing means over rolling 10-observation windows, and plot the smoothed series together in the chart below. The chart shows the gradual increase in my speed and cadence—and, consequently, heart rate—throughout the race, and my final sprint near the finish line.

library(ggplot2)
library(tidyr)

rtb_streams %>%
  mutate_at(c('speed', 'hr', 'cadence'), zoo::rollmean, 10, na.pad = T) %>%
  filter(row_number() %% 10 == 0) %>%  # Prevent over-plotting
  select(distance, `Speed (m/s)` = speed,
         `Heart rate (bpm)` = hr, `Cadence (rpm)` = cadence) %>%
  gather(key, value, -distance) %>%
  ggplot(aes(distance / 1e3, value)) +
  geom_line() +
  facet_wrap(~forcats::fct_rev(key), scales = 'free', ncol = 1) +
  labs(x = 'Cumulative distance (km)',
       y = NULL,
       title = 'Performance indicators from Round the Bays 2020',
       subtitle = 'Means over rolling 10-observation windows')

As another example, on Tuesday I ran from the Wellington CBD to the Brooklyn wind turbine and back. That run included 447 metres of total elevation gain. streams disaggregates this total into a sequence of altitude values, allowing me to plot the activity’s elevation profile:

activities %>%
  filter(grepl('2020-04-07', start_time)) %>%
  select(id) %>%
  left_join(streams) %>%
  mutate(altitude = zoo::rollmean(altitude, 10, na.pad = T)) %>%
  filter(row_number() %% 10 == 0) %>%
  ggplot(aes(distance / 1e3, altitude)) +
  geom_line() +
  labs(x = 'Cumulative distance (km)',
       y = 'Altitude (m)',
       title = 'Tuesday\'s climb to the Brooklyn wind turbine',
       subtitle = 'Mean altitudes over rolling 10-observation windows')

Having disaggregated stream data makes it possible to determine how Strava computes aggregate activity-level features. For example, suppose I want to reconstruct the mean_hr column of activities using the hr column of streams. The naive approach is to group streams by id and compute within-activity means. However, this approach may generate biased estimates of mean_hr because the observations in streams are unevenly spaced with respect to time. For example, if I stop to recover at the end of a short sprint or uphill climb then my measured heart rate spike (which, in my experience, tends to lag the corresponding effort spike) will be concentrated within a single observation, and subsequent observations will have relatively low heart rates. This will bias naive estimates downwards.

I can correct for this potential bias in the naive estimator by weighting observations by how much they increase my total time. Similarly, Strava may estimate mean heart rates based on moving time only, which I can replicate by weighting observations in streams by the boolean values in its moving column when computing within-activity means.

I compute the naive mean_hr estimates, and the estimates based on total and moving times, as follows:

hr_estimates <- streams %>%
  group_by(id) %>%
  mutate(dt = time - lag(time)) %>%
  filter(cumsum(moving) > 0) %>%  # Ignore idle starts
  summarise(est_naive = mean(hr),
            est_total = sum(hr * dt) / sum(dt),
            est_moving = sum(hr * moving * dt) / sum(moving * dt)) %>%
  ungroup() %>%
  mutate_all(round, 1)  # Strava reports mean HRs to 1dp

Finally, I compute the frequency with which each type of estimate equals, almost equals, and falls below the corresponding activity’s mean_hr value in activities:

hr_estimates %>%
  drop_na() %>%  # Ignore activities with no HR data
  left_join(activities) %>%
  select(mean_hr, starts_with('est')) %>%
  gather(type, value, -mean_hr) %>%
  group_by(type) %>%
  summarise(equal = mean(value == mean_hr),
            almost = mean(abs(value - mean_hr) < 1),
            below = mean(value < mean_hr)) %>%
  mutate_if(is.double, round, 3)
## # A tibble: 3 x 4
##   type       equal almost below
##   <chr>      <dbl>  <dbl> <dbl>
## 1 est_moving 1      1     0    
## 2 est_naive  0.023  0.425 0.924
## 3 est_total  0.548  0.856 0.384

Computing mean heart rates based on moving time recovers all of the 341 non-missing mean_hr values in my copy of activities. In contrast, computing mean heart rates based on total time recovers 55% of these values only. The total-time estimates get within a heart beat per second of the “true” value about twice as often as the naive estimates, which appear to be biased downwards (possibly due to the phenomenon described above).