stravadata now provides data on activity streams, which are “the raw measurement sequences that define” Strava activities (source). For example, here are some of the streams from when I ran Round the Bays in February:
library(dplyr) library(stravadata) rtb_streams <- activities %>% filter(name == 'Round the Bays' & grepl('2020', start_time)) %>% select(id) %>% left_join(streams) %>% select(distance, time, speed, hr, cadence) rtb_streams
## # A tibble: 1,543 x 5 ## distance time speed hr cadence ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 0 0 110 0 ## 2 1.7 1 1.7 110 0 ## 3 12.5 6 2.1 109 76 ## 4 22.2 10 2.3 113 75 ## 5 27.3 12 2.5 112 77 ## 6 30.1 14 2.2 117 76 ## 7 33.2 15 2.2 119 76 ## 8 39.4 17 2.4 123 76 ## 9 42.9 18 2.6 126 76 ## 10 53.4 21 3.4 130 79 ## # … with 1,533 more rows
time columns report cumulative distance travelled and time elapsed, while the other three columns provide (unevenly spaced) time series of performance indicators.
I smooth these series by computing means over rolling 10-observation windows, and plot the smoothed series together in the chart below.
The chart shows the gradual increase in my speed and cadence—and, consequently, heart rate—throughout the race, and my final sprint near the finish line.
library(ggplot2) library(tidyr) rtb_streams %>% mutate_at(c('speed', 'hr', 'cadence'), zoo::rollmean, 10, na.pad = T) %>% filter(row_number() %% 10 == 0) %>% # Prevent over-plotting select(distance, `Speed (m/s)` = speed, `Heart rate (bpm)` = hr, `Cadence (rpm)` = cadence) %>% gather(key, value, -distance) %>% ggplot(aes(distance / 1e3, value)) + geom_line() + facet_wrap(~forcats::fct_rev(key), scales = 'free', ncol = 1) + labs(x = 'Cumulative distance (km)', y = NULL, title = 'Performance indicators from Round the Bays 2020', subtitle = 'Means over rolling 10-observation windows')
As another example, on Tuesday I ran from the Wellington CBD to the Brooklyn wind turbine and back.
That run included 447 metres of total elevation gain.
streams disaggregates this total into a sequence of
altitude values, allowing me to plot the activity’s elevation profile:
activities %>% filter(grepl('2020-04-07', start_time)) %>% select(id) %>% left_join(streams) %>% mutate(altitude = zoo::rollmean(altitude, 10, na.pad = T)) %>% filter(row_number() %% 10 == 0) %>% ggplot(aes(distance / 1e3, altitude)) + geom_line() + labs(x = 'Cumulative distance (km)', y = 'Altitude (m)', title = 'Tuesday\'s climb to the Brooklyn wind turbine', subtitle = 'Mean altitudes over rolling 10-observation windows')
Having disaggregated stream data makes it possible to determine how Strava computes aggregate activity-level features.
For example, suppose I want to reconstruct the
mean_hr column of
activities using the
hr column of
The naive approach is to group
id and compute within-activity means.
However, this approach may generate biased estimates of
mean_hr because the observations in
streams are unevenly spaced with respect to time.
For example, if I stop to recover at the end of a short sprint or uphill climb then my measured heart rate spike (which, in my experience, tends to lag the corresponding effort spike) will be concentrated within a single observation, and subsequent observations will have relatively low heart rates.
This will bias naive estimates downwards.
I can correct for this potential bias in the naive estimator by weighting observations by how much they increase my total time.
Similarly, Strava may estimate mean heart rates based on moving time only, which I can replicate by weighting observations in
streams by the boolean values in its
moving column when computing within-activity means.
I compute the naive
mean_hr estimates, and the estimates based on total and moving times, as follows:
hr_estimates <- streams %>% group_by(id) %>% mutate(dt = time - lag(time)) %>% filter(cumsum(moving) > 0) %>% # Ignore idle starts summarise(est_naive = mean(hr), est_total = sum(hr * dt) / sum(dt), est_moving = sum(hr * moving * dt) / sum(moving * dt)) %>% ungroup() %>% mutate_all(round, 1) # Strava reports mean HRs to 1dp
Finally, I compute the frequency with which each type of estimate equals, almost equals, and falls below the corresponding activity’s
mean_hr value in
hr_estimates %>% drop_na() %>% # Ignore activities with no HR data left_join(activities) %>% select(mean_hr, starts_with('est')) %>% gather(type, value, -mean_hr) %>% group_by(type) %>% summarise(equal = mean(value == mean_hr), almost = mean(abs(value - mean_hr) < 1), below = mean(value < mean_hr)) %>% mutate_if(is.double, round, 3)
## # A tibble: 3 x 4 ## type equal almost below ## <chr> <dbl> <dbl> <dbl> ## 1 est_moving 1 1 0 ## 2 est_naive 0.023 0.425 0.924 ## 3 est_total 0.548 0.856 0.384
Computing mean heart rates based on moving time recovers all of the 341 non-missing
mean_hr values in my copy of
In contrast, computing mean heart rates based on total time recovers 55% of these values only.
The total-time estimates get within a heart beat per second of the “true” value about twice as often as the naive estimates, which appear to be biased downwards (possibly due to the phenomenon described above).