« Helmets via vending machines | Main | Bike lanes, sharrows, signs, bike boxes all work »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Not surprising. Compare Cape May (or any smallish beach town where there's a bunch of bike traffic) in the summertime, and compare cycling there with the Hellish conditions in towns where all the vacationers come from.

Jacobsen presented evidence that is consistent with the safety in numbers hypothesis. Given that you can generate the figures that Jacobsen uses through a pure random number generator, simply showing these graphs without a more sophisticated analysis is (unfortunately) poor proof of the hypothesis.


Simply based on a casual reading of the literature, I get conflicting views regarding safety in numbers. That is, there are several papers that suggest that safety in numbers for cyclists and pedestrians is still hotly debated. But I have had a few conversations with traffic engineers that suggested safety in numbers has been historically observed with automobiles; so why wouldn't it apply to peds/bikes?. And some European papers suggest that safety in numbers is rather accepted.

Anyway, it is important is that we try to identify the true causal factors here. While I happen to believe that there is a safety in numbers effect, overlooking other explanations (or components of the observed effect) can lead us to negative consequences.

Could it be that more cyclists lead to better bike facilities which are safer to use?

Personally I do believe in the safety in numbers hypothesis. When automobilists begin to commonly encounter other road users they begin to modify their driving patterns to be more accommodative.

Question: The last time you drove were you ready for a polar bear encounter? Didn't think so.

No one expects a polar bear encounter. That's what makes them so dangerous.

Whatever problems there may be with what Jacobsen did, Forrester's alleged statistical debunking is pure bunk. His point about two ratios of random numbers showing a declining relationship when one variable is in the denominator and then the numerator, and that this somehow means that one can never learn anything from the relationship between two ratios is pseudo-statistics at its worst. His analysis would make the Tobacco Institute proud.

The actual variables Jacobsen is looking at are not random. There is a high correlation betweeen the number of accidents in a city and the population of the city, as well as the number of cyclists in a city and the population. Contrary to what Forrester implies, looking at the accident rate (accidents/cyclist) and cycling rate (cyclist/population) is not some sort of statistical hocus pocus guranteed to produce a pseudo result when there is just a random relationship. These are the variables about which one would logically be curious. One could just as easily show that accidents/population rise with about the 0.4 power of cyclists/population--random numbers will not get you that result!

Just to be a bit more concise: Jacobsen did a series of regressions and found the coefficients to be different from 1 (no safety in numbers) at a high level of significance. You don't get that with random numbers. The regressions are the basis of his conclusions. But then, just to help the reader understand the analysis, he divides both sides of the equation by amount of cycling, and graphs that for your reading pleasure. Forrester, who admits he does not understand statistics, then bases his entire critique on the graphics, as if they were the analysis.

I'd like to believe in the safety in numbers theory. I think it holds true for pedestrians, at the very least. Maybe I'm being obtuse here, but the statement "As ridership goes up, crash rates stay flat" implies that there is NO improved safety in numbers. If the conclusion were that the number of crashes were flat, as ridership increased, then that would help validate the safety-in-numbers theory. What am I missing?

Bryon, by crash rates I think they mean the raw number of crashes.

Hi Jim,

I never read Forester's webpage on the article. But elsewhere, Forester has stated that A/B = B/C issue simply means that Jacobsen failed to prove safety in numbers. That is, given that the problem exists does not preclude the phenomenon from existing. But with regards to the second statement, yes, you can get a coefficient <1 with random numbers. It is straightforward to demonstrate.

Mind you, if you find the spreadsheet on John S Allen's website unconvincing, suppose that in truth there is zero effect; i.e., the coefficient is 1. But there is measurement error with the "B" statistic: a realistic assumption in my opinion. Now you have an endogeneity issue with the regression which results in biased coeffients which will produce the observed results.

For what it is worth, I suspect that there is a safety in numbers effect. But that it is much smaller -- although probably still meaningful -- than the Portland, NYC, and Jacobsen statistics suggest.

Robert Hurst always suggests that the safety statistics improve because the number of adults increase disproportionately with respect to children. Since adults are commonly found to have fewer crashes/collisions than children, this results in better safety statistics.

I think that there is a potentially large selection effect happening. That is, as one provides a reportedly safer cycling environment, risk adverse people who would normally avoid road riding are willing to try cycling. Given that most reported crashes are solo events -- do not involve automobiles -- if you simply put more risk adverse people on bikes and one would suspect that they ride slower, safety statistics would improve.

The link to JS Allen's blog includes a paper on safety in numbers with respect to pedestrians. That paper discusses the statistical issue of measuring safety in numbers as well as their attempt to address the issue.

Geof, those are very interesting alternative theories. Food for thought.

Hi Geof,
My impression is that Jacobsen's basic model is

Accidents = a Cycling^b

where a and b are parameters that need to be estimated. In a regression one would estimate this equation by taking the logarithm:

log(Accidents) = A + b log(Cycling)

Since he has cross-sectional data for cities with varying sizes and populations, we would expect that the value of A varies by city even if the sensitivity parameter b is meaningful accross cities, so his actual regression model was

log (Accidents/Population) = A + b log (Cycling/Population)

He gets 95% confidence interval estimates for b that almost always are entirely below 1.0. His typical value is 0.4 but there is some variation.

So his final result would be

(Accidents/Population) = A (Cycling/Population)^0.4

So I am wondering how one would get that result with a bunch of random numbers.

After reporting those results, they multiply both sides by Population/Cycling to get

Accident/Cycling = A * (Cycling/Population)-0.6 which they proceed to graph. That's so one can understand the relevane of their result for an individual cyclist, which is that the risk goes down even though--in that study--the total numer of accidents goes down.

I certainly can imagine that bad data can give spurious results, but I just don't see the Forrester logic, or in general the complaint about their use of A/C against C/P given that this was just a final display after doing the analysis. Moreover, I am unclear how a large measurement error in the data would be expected to cause their result, unless you were to hypothesize that the measurement error for accidents is negatively correlated with the measurement error for cycling. Yes of course one might get the spurious result had the actual regression model been log(Accidents/Cycling) = A + b (Cycling/Population) but according to their paper, it wasn't.

Perhaps JS Allen explained it all. My comments only address what I saw on Forrester's web site.


PS: Typo in one equation which should say

Accident/Cycling = a * (Cycling/Population)^-0.6

Who would have thought that the software's inability to take a "" would matter

ok. I looked at the Allen site and it seemd to just be summarizing the Forrester critique. I left essentially the same comment there in case anyone is still tracking it, but that is 3 months old so maybe not.

Getting back to the actual causes. The shifting composition toward adults seems plausible--and estimatable. Then again, when there are large upticks one might expect less informed cyclists as well, e.g. the sidewalk riders and the curb huggers who intend to be safe.

Fatalities tend to involve motor vehicles while accidents not so much. We are assuming here that safety in numbers means reported accidents. I am guessing that they are reported when a car is involved, but that a cyclist falling by himself may not report the accident to the sources for these studies.

I should think that as far as fatalities and other serious accidents with cars go, we would also expect a harvesting effect. A given bad driver typically is only going to kill one cyclist before learning to be more careful. While we tend to think that the supply of aggressive drivers is unlimited, maybe the ones that are so bad as to kill someone is more limited than we think (though they are minted every day).

Like you, I certainly doubt the idea that a 1% increase in cycling causes a drop in fatalities, regardless of what the time series show for specific cities. Motor vehicle deaths have also declined as driving increased--do we think that more driving really makes us safer?

But I can believe that over a range, the Jacobsen estimate that a 1% increase in cycling causes a 0.5% increase in reported accidents is about right.

FWIW, I'm with Jim on this. I think the Random Numbers argument is totally bogus. Here's why.

- The Safety in Numbers (SIN) concept is based on the expectation that the number of collisions will be proportional to the number of cyclists. That is, doubling the number of cyclists will, roughly, double the number of collisions. I will refer to this as the Proportionality Expectation.

- SIN proponents note that doubling the number of cyclists results in less than a doubling in the number of collisions. This result contradicts the Proportionality Expectation. SIN is put forth as the explanation.

- SIN opponents cook up what they say are reasonable random data for numbers of cyclists and numbers of collisions. These are values selected randomly within reasonable upper and lower limits, with the limits being based on actual data. They also find that doubling the number of cyclists results in less than a doubling in the number of collisions, reproducing the SIN effect. This they, say, debunks the analysis upon which the SIN effect is based.

Take note, however, that the method of producing so-called reasonable random data does not account for the Proportionality Expectation. Had they shared this expectation with the SIN proponents, they would have built it into their method of generating so-called reasonable random data and would not have reproduced the SIN effect. In other words, the random data people got out exactly what they put in--they didn't put in Proportionality and (surprise!) didn't find Proportionality. This proves nothing (OK, maybe it proves that Forrester deserves another medal from the Society for Impeding Bicycle Infrastructure Through Obtuse and Bogus Arguments).

What I find interesting about all of this is that the Proportionality Expectation has not been examined or debated. It sounds reasonable, but is it? Maybe the real effect is that increased traffic slows everyone down so that the number (or severity) of collisions is decreased without anyone, people in cars or on bicycles, changing their degree of carefulness or consideration. Maybe the effect comes from other changes in the population, such as Geof suggests.

For advocacy purposes (is there another one that matters?), here is my conclusion: While many people expect to see the number of collisions increase in proportion to the number of cyclists, the data (even the fake data!) shows that they do not. As is so often the case, cycling is once again shown to be safer than people are led to believe.

Unfortunately, I am unable to determine what a reported crash is. For instance, does it include hospitalizations in some manner that would predominately be solo events? Or would it only include auto collisions that require an accident report of a LEO? Moreover, since a lot of Jacobsen's data is from overseas, I am unsure whether my concepts of when a LEO reports a collision -- based on what I observe here -- would be consistent with the data. The Portland and NYC paper/graphs linked on Grist, unfortunately, do not include a clear definition of crashes (Portland) or injuries/fatalities (NYC). I agree that the interpretation of the findings could change with the safety metric.

Regarding the statistical result, we agree that Jacobsen regressed something that normalized for population as opposed to the raw counts (Table 1). But you have correlated variables on the left and right hand side of the regression resulting in the error term being correlated with the predictor. This is the only factor needed to produce a biased result. A regression of ln(accidents) = a + b ln(cyclists) will result in a consistent estimate of b. A regression of ln(accidents/pop) = a + b ln(cyclists/pop) will result in a biased estimate of b. You can think of it a few ways but I just took Tom Revay's spreadsheet, added a tab, and did the regressions to demonstrate. Making it more sophisticated will make no meaningful difference in the underlying result.** Going through a few iterations and the effect is obvious and quite large. I'll pass a copy for you.

Mind you, people make this error all the time. I first ran across it as a RA for Phil Cook who found a related problem with John Locke's work (the more guns = less crime guy). There is an extended literature of work trying to get around the issue. Moreover, the effect on parameter estimates can be quite large.

** There is proportionatity happening here since the draw from the uniform distribution has the same mu and sigma. The whole point of the exercise is that the expected accidents divided by cyclists is constant across the population. Yet nonethless, the parameter estimate is consistently biased downward from 1. So if you do the regression in the manner I claim is unbiased, the coefficient for b = 0. When done in the manner I claim is biased, b ~ 0.5.

BTW, I got the Jacobsen paper and the spreadsheet mixed up in my head at the end.

The spreadsheet clearly has no relationship (b = 0)between accidents and cyclists. But the mean accidents to cyclists ratio has the same expectation across all simulated observations (a = some positive number). Now if I normalize both sides by population, one might expect that the coefficient remains zero since by construction, we know that there is no relationship between the two. But instead, I roughly get a coeffiecient of 0.5.

My suggestion was that the Jacobsen regression of the form:

ln(accidents) = a + b ln(cyclists)

which is a functional specification you approve.

That is what the paper says. And if there was ay doubt, how else would the regression coefficients generally be about 0.5 or so? Had he done the regression to which you object, the coefficients would have been negative, right?

OK. I re-read what you said. In a cross-sectional regression, why do you think that the error term is correlated with the independent variable, and why would that result from dividing both sides of the equation by population?

And can you confirm that you do not agree with Forrester's complaint that the analysis itself was similar to his parody (aka A/C = a + b C/P )

To be more concise, I doubg one could test safety in numbers with the model

log(accidents) = a + b log (cyclists),

using raw data, since this specification does not account for cycling density.

Accross cities, the parameter a would be highly correlated with the population. That is, for a given amount of cycling, the higher the population, the more accidents we would expect. But since a is fixed parameter in the regression, the error term instead is highly correlated with population.

And of course, since cycling is also highly correlated with population, that almost surely means that the error term would be highly correlated with the independent variable cycling.

Divide everything by population to make the model per capita accidents as a function of per capita cycling, and we don't have that problem anymore. Yes of course in reality every city would have a different "a", which represents all accidents caused by anything other than per-capita cycling. But I don't see any obvious reason why that "a" would be correlated with per capita cycling enough to alter the results. At least no more so than any other regression anyone ever does.

One further point about Forrester's critique. Correct me if I am wrong, but with random numbers, a regression of

log (A/C) = a + B(C/P) should get a parameter of B= -1. (And that would mean b=0 for the equation that they estimated. But they got 0.4 instead.

The in spreadsheet example, accidents and cyclists are uncorrelated ... b = 0. But when I scale both sides by a perfectly correlated variable, the randomly drawn population figure from the simulated observation, the coefficient changes to ~0.5. I need to sit down to to work through the math to be more precise; but in my head, I think it might work out naturally.

In regards to the Jacobsen paper, you mentioned why this is precisely such a problem. You need a measure of density but it often create a endogenity problem that biases the coefficient.

... I have to run Jim. Sorry. I'll sit down soon to write something that should clear up the issue. It might be Monday with the two kiddies. But I will e-mail you when finished.

Yes we might as well close out this thread.

I follow what you are saying about how one might get a similar coefficient from a completely random set of data. Though I do not know how you can scale both sides with a variable that is perfectly correlated with each side. How could A and C be uncorrelated, if both are perfectly correlated with P?

I still don't really see what that has to do with the Jacobsen data set, which didn't use three random data sets. Clearly, the whole reason for dividing by population was that the variables were highly correlated with population.

I'd be more persuaded by a data set generated with a model that shows no safety in numbers (accidents are proportional to cycling with a random component), but then when you do the Jacobsen test you get Jacobsen's results. If the assumptions used to generate that data set were plausible, it would be persuasive that there is less than meets the eye to Jacobsen's study.

But I simply do not believe--nor do I think Forrester believes--that Jacobsen went out and just got some random numbers (or data that is so bad it might as well be random numbers). So the fact that random numbers could do that leaves me wondering about the point.

If a reputable scientist did a careful lab experiment and found 20 ug lead per liter in a child's blood, someone else might say: "I find 20 ug Pb/l in the water from a drainage ditch leading from a gas station accross the street from the lab." OK. So what am I to make of that? Someone mailed a sample meant for EPA to the hospital, and they got switched? All labs default to that concentration? Or is it just an irrelevant coincidence?

Of course, with the Jacobsen paper, one wonders why they didn't simply regress

log(accidents) = a + b log(cycling) + c log (population)

Yes cycling and population are correlated, but not perfectly correlated. My hunch would be that they got nonsense parameters too much of the time when they did this. So they chose between dividing both sides by population, and not doing so, and got a better fit and an interesting result when they did the former. An F test comparing how much better the model is when dividing by population would have been nice.

Hi Jim,

Sorry for the delay but here is my post ...


The comments to this entry are closed.

Banner design by creativecouchdesigns.com

City Paper's Best Local Bike Blog 2009


 Subscribe in a reader