Replicability and Pitfalls in the Interpretation of Resampled Data: A Correction and a Randomization Test for Anwar and Fang


Total: 4,922


In their article “An Alternative Test of Racial Prejudice in Motor Vehicle Searches: Theory and Evidence,” published in the American Economic Review in 2006, Shamena Anwar and Hanming Fang study racial prejudice in motor vehicle searches by Florida Highway Patrol officers (“troopers”). Their data include the race and ethnicity of the trooper and of the motorist stopped and possibly searched. A search is deemed successful if the trooper finds contraband in the vehicle. Using data on troopers and motorists of three race-ethnicity groups (white non-Hispanic, black, and white Hispanic, with others being dropped), Anwar and Fang compute nine trooper-on-motorist search rates and nine search-success rates. They present a model that exploits this information to test whether troopers go beyond statistical discrimination to racial prejudice. Irrespective of whether troopers exhibit racial prejudice, the model has a crucial testable implication, an implication that concerns the rank-order of the search and search-success rates. Anwar and Fang report that their data neatly fit this predicted rank-order implication with high statistical significance across the board, strongly supporting the soundness of the model. In turn, the model is applied to address the question of racial prejudice. They do not find evidence of racial prejudice, and neither do I—so the present critique does not arrive at results about prejudice contrary to their results. The present critique starts by reporting on my effort to replicate Anwar and Fang’s preliminary rank-order findings. I am unable to replicate two of their nine reported search-success rates, nor can I replicate the reported statistical significance of four of the six Z-statistics and one of the three χ2 test statistics for the rankings of the search-success rates. My new results imply that the empirical support for the model’s soundness is not what Anwar and Fang claim it to be. This problem of irreplicability is my primary point, but I then move on to another matter: My replications draw attention to a neglected statistical caveat in Anwar and Fang’s implementation of the empirical tests of racial prejudice. It turns out that the novel resampling procedure they employ does not provide robust results. I pinpoint the empirical source of this issue and, in an appendix, show how a simple extension to their method improves robustness. In another appendix I put forth an alternative randomization test that seems more appropriate when testing such resampled data.