vignettes/report3.Rmd
report3.Rmd
This document reports on work done during the third quarter of the SaferActive project. To put this work in context, the report comes around half way in the overall project timeline, which can be defined in terms of the following quarterly milestones:
We have adapted the project since its initial inception in the pre-pandemic world, to broaden the definition of ‘traffic calming measures’ to include ‘Low Traffic Neighbourhood’ (LTN) interventions. This, and unexpected issues around the analysis of cycle counter datasets from DfT and TfL has led to more time spent on the analysis phase and less work on the web app. In terms of timelines, the project has been extended by 4 months, meaning that milestones 4 and 5 above will be delivered by the end of July and October 2021.
It is now almost a year since the coronavirus pandemic caused major changes to social, economic and transport systems. Although the future course of the pandemic (despite the planned roll-out of vaccination programs during 2021), let alone long-term impacts, are still uncertain, there are several clear trends that have consequences for road safety in relation to walking and cycling that look highly likely to continue for the next 12 months at least:
In the policy realm, the £250m Active Travel Fund has now been fully allocated and is being spent in local authorities nationwide, with particular emphasis on LTNs and new walking and cycling improvements. Academic research on historic LTNs suggest that they can lead to substantial increases of walking and cycling, with reported walking and cycling levels increasing by 29% and 51% in Waltham Forest between 2016 and 2019 and reduced overall risk levels for these modes reducing by three-to-four fold (Goodman and Aldred, n.d.). Such increases in road safety for active modes could, if they are repeated across multiple local authorities nationwide, herald the start of a step change in road safety for walking and cycling. Increases in active travel levels according to localised count data should not be over-interpreted, however: according to the DfT’s All Change? report tracking behaviour changes in response to the pandemic, 73% and 28% of respondents to a major survey reported no walking or cycling within the past week in June/July 2020, respectively. Furthermore, data published by Sport England in February 2021 suggests that despite the perception of a boom in physical activity fuelled by sights of parks full of people exercising, the physical inactivity epidemic actually got worse during the pandemic. These findings emphasise the need for policies to enable substantial uptake of walking and cycling as physical distancing measures gradually ease.
This report sets out to explore the evidence for changes at the city level across London and nationwide.
The aim of this section is to provide estimates of casualty risk for walking and cycling at high geographic resolution and over time. This is not an easy task: evidence on cycling and particularly walking levels is patchy, while crash datasets are intermittent and sporadic. This section outlines the approach we took to help answer the question, which consists of the following main stages, each of which support the aim of quantifying risk over space and time in established measures, primarily the number of people killed and seriously injured per billion km (KSI/bkm):
Grouping crashes together at a small spatial scale help to identify crash hot spots, areas where crashes happen unusually often. To this end crash data for the last 10 years was ‘rasterised’ (aggregated to rectangular chunks, with crash counts per pixel) to a 500m grid. In each raster a single pixel covers a 500m x 500m area and contains a single value, for example, the number of people killed in that area in 2018. It is also possible to create a stack of multiple raster layers to represent a chaining variable, such as time.
We created twelve raster stacks with each stack covering the years ten years of available STATS19 data (which contains a known under-reporting of slight injuries) 2010 to 2019. Each raster stack was created for the whole of Great Britain (results shown below for London for all modes).
Cyclists killed or seriously injured during commuting hours
Cyclists killed or seriously injured during any time of day or night
Cyclists with any injury during commuting hours
Cyclists with any injury during during any time of day
Pedestrians killed or seriously injured during commuting hours
Pedestrians killed or seriously injured during any time of day
Pedestrians with any injury during commuting hours
Pedestrians with any injury during during any time of day
All people killed or seriously injured during commuting hours
All people killed or seriously injured during any time of day
All people with any injury during commuting hours
All people with any injury during during any time of day
Commuting hours were defined as being between 7-10 AM and 4-7 PM, Monday to Friday. Crashes during these hours can be compared with estimates of cycling levels from the PCT’s travel to work layer.
To estimate the spatial distribution of cycling activity, measured in km per year, we started with data from the Propensity to Cycle Tool (PCT). The Route Network layer in the PCT provides an indication of the number of commuter cyclists on different parts of the network, based on 2011 Census data reporting the number of people travelling between small (LSOA, average population ~2000) zones by mode of travel. The route network in the PCT represents the ‘fastest’ routes for cycling between these LSOA zones and is not therefore designed to represent where cycling takes place (there are may be some links on the route network where the people prefer a slower but safer route).
Converting the spatial network representation of cycling to a spatial grid has the advantage of ‘smoothing’ out cycling activity across cities and allowing cycling levels to be presented in a form that is directly comparable with the crash data, which as also assigned to the spatial grid, as described in Section 2.1. The three main steps were:
line_breakup()
in the stplanr
package for the purpose.
After realising that this implementation would take several hours to complete on the national route network, we opted for a more computationally efficient approach that used R’s interface to GIS algorithms in the qgisprocess
package.nts-national-distance-cycled-year.R
for details).The results in rural (Hereford) and urban (London) areas are shown in Figure 2.6.
In the first SaferActive report (Section 8) we reported estimated KSI/bkm at the local authority level across London, highlighting the fact that although more crashes happen in the city centre, the risk per km cycled is higher in the outskirts of the city.
The additional datasets generated using methods outlined in the previous sections allow us to generate these estimates at higher levels of geographic resolution, as illustrated in Figure 2.7.
It is clear from the results that outer London is a more dangerous place to cycle than inner London. However, places where cycling levels are low will tend to have more variable collision rates with less certainty, which means that an apparent high rate in a given year could be an artefact of the high variability in collision rates per km cycled. There are at least three ways of dealing with this issue:
A second potential source of error is that, even in peak hours, many cycle journeys will be for non-commute purposes. The proportion of journeys that are for the purpose of travel to work will vary from one area to another. Since we are only counting journeys to work, our estimated collision rates per km cycled will be relatively higher than they should be in areas where less cycle journeys are for commuting purposes.
As described in Section 2.3, our primary data source for estimating levels of cycling activity across space is the Propensity to Cycle Tool. This takes 2011 Census data on method of travel to work, and routes these journeys using a CycleStreets routing algorithm, allowing us to estimate cycle potential on each link of the road network. However, this does not give us any information on changes in cycling uptake through time. To estimate changes through time, we must investigate data from manual cycle counters.
We have used two key data sources to model this spatio-temporal change in cycling uptake. The first is DfT count data. This is open source data available from the Department for Transport website https://roadtraffic.dft.gov.uk/downloads. While the raw data is available for the years 2000 - 2019, we have focused on the decade 2010-2019, since there is more consistency in the type of roads being surveyed over this time period.
The second data source is Transport for London cycle counts, collected as part of three data collection programmes, for Central London, Inner London and Outer London https://cycling.data.tfl.gov.uk/. There are other TfL count programmes focusing on specific interventions such as the cycle superhighways and Mini-Holland schemes, but we have not used these to keep our results representative of London as a whole. The Central London counts are available from 2014, but the other counts are only available from 2015 to 2019, so we have used this time period.
To make the best possible use of the available data, we have developed two models. For change in cycling uptake over the years 2010-2015, we have a model based solely on the DfT counts, as these are the only ones available during this time period. For the years 2015-2019, our model combines both the DfT and TfL counts. To assess change across the full decade, we combine these two models, computing changes since 2015 on top of the changes seen up to this point.
Rather than using the raw cycle counts, i.e. number of cyclists passing a given point per hour, we assess relative change in these cycle counts. The number of cyclists varies greatly from one road to another, and the same count points are not always used each year. Therefore the variable of interest is the change in cycle volumes at a given count point: \[ Ch_y = C_y / C_m \] where \(Ch_y\) is the change in cycle count for a given year, \(C_y\) is the cycle count for the year of interest, and \(C_m\) is the mean cycle count at that count point across all years within the model period that the count point is in use.
To avoid the impacts of high relative change in cycle flows at count points with very low absolute numbers of cyclists, we excluded all data from locations with a mean cycle flow < 5, or with zero cyclists in any single year. We also excluded locations that did not have at least two years of data within the relevant period.
Since our primary spatial model of cycling potential is derived from Census 2011 data accessed via the Propensity to Cycle Tool, we had to adapt the count data accordingly. As part of the data cleaning process, we made efforts to correspond closely with Propensity to Cycle Tool methodologies. All of these adjustments were made on the raw count data, prior to calculating change in cycle volumes. Firstly, we combined bi-directional data to get a single measure of cycle volume per site, similar to the PCT which combines journeys in both directions between a common origin and destination. Secondly, we used peak hour flows only. Our spatial model is based solely on travel to work, so temporal changes to this model should reflect changes in commuter volumes, not changes in whole-day cycle volumes. The TfL counts run from 06:00 to 22:00, while the DfT counts run from 07:00 to 19:00, but we selected only the counts from peak commuter hours, defined by TfL as being 07:00 - 10:00 and 16:00 - 19:00.
Further data cleaning was conducted for the seasonal adjustment of cycle flows and the generation of single annual estimates for each count point. In the TfL Central London dataset, four Survey waves are conducted per year, corresponding with the standard quarterly periods. The final wave for which data is available is 2019 Q3. By contrast, in the TfL Inner and Outer London counts and the DfT counts, each location is surveyed at most once per year.
Cycling uptake varies across the year due to factors such as weather conditions. TfL has produced seasonal adjustment factors for each of the four quarterly periods, to account for this. We have used these adjustment factors to control for seasonality, and to calculate a mean annual count for each Central London site. We desired a single annual figure to avoid giving these central London locations four times the weighting of the other counts when generating the GAM models, as described below. Meanwhile, the Inner and Outer London TfL counts are designed to represent the second quarterly period (April to June), so we used the Q2 adjustment factor to normalise these counts. For the DfT counts, we normalised the raw values based on the adjustment factor for whichever quarterly period the count date fell within.
Mean change in cycle flows over the years 2010-2015 is shown for all DfT count points in London in Figure 2.8. This shows a steady increase in cycling uptake over the length of the period.
We found a strong correspondence between cycling levels inferred from the cycle count data and from the National Travel Survey at the national level, as illustrated in Figure 2.9.
In Figure 2.10, we show mean change in cycle counts over the years 2015-2019, for both DfT and TfL count points. We can see that the trends are quite different in TfL and DfT count points. The TfL counts rise throughout this period, while the DfT counts peak in 2016 and fall thereafter.
To estimate spatio-temporal changes in cycling uptake over the decade 2010-2019, we have created two GAM models using the mgcv
R library.
These use the count data to generate smoothed estimates of change in cycling levels in London across this period.
The models use splines to represent the partial effects of time and space on cycling uptake.
The two models follow the same structure, but the first is for the years 2010-2015, using DfT data only, while the second is for the years 2015-2019 and uses both DfT and TfL data.
The response variable in these models is \(Ch_y\) (change in cycle flows). However, total number of cycles (across all years within the model period) per count point is used as a weighting factor. This avoids giving undue influence to locations where the relative change in cycle flows may be high but the absolute number of cycles is low. It also reduces the influence of count locations that are sampled in some years only, compared to those sampled every year.
The error structure in both models follows a Scaled-t distribution.
This is appropriate for continuous response variables which are heavily tailed.
Temporal change is modelled as a cubic regression spline for the term year
.
We gave a low number of knots (4 knots for 2010-2015 and 3 knots for 2015-2019) to prevent overfitting.
Cubic regression splines are appropriate for variables with relatively few knots, spread evenly across the extent of the parameter values.
Space (eastings
and northings
) is modelled using a Duchon spline, which works well with two-dimensional parameters.
We used 100 knots, to allow more complex spatial patterns to be represented.
An interaction term (using the same spline types and numbers of knots) is used for the interaction between time and space.
The partial effects of year
in these models are shown in Figure 2.11. These closely resemble the trends observed in the raw data, in Figures 2.8 and 2.10.
These graphs show the mean effect of year
, but this will vary spatially due to the presence of the interaction terms in the GAM models.
We use the models to predict annual changes in cycling uptake over a 500m grid covering the whole of London. This creates a smoothed surface that matches the resolution of the spatial grids we have already generated for collision and census-derived cycling data. We can therefore obtain estimates of change in cycling uptake at the same spatial resolution as the collision data.
By assigning each grid point to a London Borough, we can then investigate changes in cycling uptake at the Borough level. These can be seen in Figure 2.12 (figure shows unweighted values).
We have verified the GAM predictions against the TfL count data, to ensure we are representing the true variability of the data. This is conducted using the coefficient of determination (R squared). The result gives an R squared of 0.194, showing the extent to which the model is able to predict counts across London in the years 2015-2019 (see Figure 2.13).
Further validation work should include sensitivity analysis involving the exclusion of portions of the dataset, and the prediction of these excluded portions. We also need to obtain confidence intervals to constrain the cycling uptake predictions.
We can also investigate how correlations change with greater temporal and spatial aggregation of data points. This may include aggregation to the Borough level, and the grouping of neighbouring Boroughs.
There is OpenStreetMap (OSM) data for the years 2014 - 2019, providing insight on the date of installation of traffic calming measures. Using two time points (e.g. 2016 and 2018), we have found ways to investigate differences before and after changes go in.
Counts from cycle superhighways and quietways are available from TfL, typically including 1 year of baseline count data from before installation, and around 4 years of post-installation count data. We have found a reliable way to access data on change in cycle provision, based on a broad definition of cycle infrastructure from the ohsome project.
Counts from Mini-Holland schemes are available from TfL, typically including 1 year of baseline count data from before installation, and around 4 years of post-installation count data. We can speak to TfL to get better insight on the exact timings of implementation of these schemes.
We continue to develop approaches to visually analysing road crash data and have divided our efforts here into two areas: exploring and quantifying risk.
Road crash data are spatially and temporally precise but also attribute-rich. The Stats19 dataset for example has many variables describing numerous aspects associated with road conditions and vehicle types. Exploratory visual analysis interfaces provide a mechanism for quickly investigating these.
As a quick way of illustrating this, below are bar charts displaying crash frequencies for all and active casualties by road speed classification. Crashes resulting in a KSI (fatal or serious injury) are dark red, those resulting a slight injury light red. It is not surprising that 30mph roads are the modal category of road for all crashes, but especially those involving active travel modes. This is likely to be due to the relatively high proportion of active travel (and hence exposure) in residential and quiet roads that tend to have 30 mph limits.
A more interesting question is around how relative injury severity varies by road type and vehicle mode. To quickly investigate this we generate mosaic plots, a sort of visual contingency table where plot width varies according to absolute number of crashes in this case and height according to relative number of crashes resulting in a KSI (e.g. the relative height of the dark red bars). Whilst the pattern implied by this layout again could be anticipated, it does demonstrate the obvious effect of increasing road speed limits on injury severity (quantified as KSI rates), for crashes involving pedestrians and cyclists.
Once the code template for these plots is created, it is very easy to substitute in other variables that we think may be discriminating – lighting and surface condition (wetness). Here again we see the strong effect on relative injury severity.
And here by road class and whether or not the crash involved some form of junction. Note that the plot size is kept consistent here by road class – casualties involving active modes occuring on Motorways will be a very low base.
We do not advocate using these plots for decision-making, for example prioritising lighting and surface drainage over other candidate interventions, but instead for their application as a low-cost mechanism for initially proposing factors that might then be incorporated into a more formal data analysis.
A benefit of using Mosaic plots for presenting categorical data such as this is that they are space-filling. For example, we might extend this analysis to profile crashes by travel mode. In the plots below we focus on London. That cars are the modal vehicle category is to be expected, but it is again useful to emphasise the fact that relative injury severity increases with motorbikes and bikes and also that the vehicle-crash mix varies according to geographic context.
We could further update these plots to generate a ‘who-hit-who’ type representation. Here, crashes are organised according to the largest vehicle that was involved – we call this the dominant or aggressor vehicle. We then identify the casualty types by travel mode attached to each of these crashes. So in the plot immediately below, if a crash involved an HGV and a car, that crash would appear in the fourth row of the plot (HGV is the aggressor) and we colour according to the travel mode of the casualties involved. Mosaic plots are necessary here as it is very difficult to see the vehicle-casualty mix for dominant/aggressor vehicles that appear less frequently. Again we present these by borough and with an approximate spatial arrangement.
Inevitably when analysing road crash data we wish to make inferences about casualty rates through comparison – over time and by geographic areas. There are difficulties with doing this – comparing rates across areas (and time) – that are more idiosyncratic relating to the way in which road crash data are collected and choice of denominator, but also established problems in statistics – for example, uncertainties due to sampling size and multiple comparison.
Key to this work is incorporating appropriate denominators for estimating exposure, discussed above. However, we are also implementing visual data analysis approaches to uncertainty representation. For illustrative purposes below are maps of relative risk comparing relative injury severity ratio for crashes in each Police Force area against that which would be expected given the national average. Where a Police Force has an injury severity ratio greater than the national average it is red, less than the national average blue. The left-most map contains risk ratios derived from crash data between 2017-2019; in the middle we demonstrate how these have shifted year-on-year, with no adjustment made for CRASH and COPA re-coding and in the right-most map we represent model uncertainty by drawing year-on-year lines that result from a bootstrap resample. Whilst representing a full empirical bootstrap distribution may be one means of representing uncertainty, we are currently investigating others, for example hypothetical outcome plots, increasingly used in journalism and public-facing domains.
The next steps on the project are to:
Barrero, Jose Maria, Nicholas Bloom, and Steven J. Davis. 2020. “Why Working from Home Will Stick.” SSRN Scholarly Paper ID 3741644. Rochester, NY: Social Science Research Network. https://doi.org/10.2139/ssrn.3741644.
Department for Transport. 2020. “Active Mode Appraisal Toolkit User Guide.” Department for Transport.
Goodman, Anna, and Rachel Aldred. n.d. “The Impact of Introducing a Low Traffic Neighbourhood on Road Traffic Injuries.”
We have a prototype of an interactive web application that is ready to add any of the outputs above. For instance, currently it shows a cycling level route network overlaid with points from STATS19 (the latter only London at the moment). We are looking at various visualization methods to present the outputs and allow users to use the tool to query the data. 5.1 shows what the “test deployment” looks like right now. We are also testing real-time querying of the entire STATS19 dataset back to 1979.↩︎