Police violence is an increasingly controversial topic within America’s civic discourse. Every year, police violence affects countless Americans and regularly incites debate about the presence of police and the tactics that they employ. When incidents of police violence occur, details about the victim’s race, age, and income are regularly discussed. Consequently, these small details coalesce into a larger discussion about the socioeconomic characteristics that influence disproportionate rates of police violence. Utilizing data from FiveThirtyEight and The Guardian, this paper contributes to a larger discussion of police violence by analyzing the attributes of victims and the places they live.

According to FiveThirtyEight, police shooting victims tend to live in poorer areas of the country. Because this relationship does exist, we sought to investigate the factors that create this reality. Accounting for various economic characteristics on a census level, we aimed to predict the median household income of census tracts where fatal police confrontations occur. Income prediction is a utile method to investigate the lives of low-income police violence victims. Through finding significant predictors of median income, policymakers are better able to interpret systemic differences, or barriers, in our country. Amongst vulnerable groups, like victims of police shootings, policymakers can address systemic inequality, then work towards actionable change. Overall, analyzing variables that influence income elucidates socioeconomic realities that affect police shooting victims and Americans in similar economic circumstances.

Many of the victims within this dataset were armed, therefore, access to firearms may impact the likelihood of being a police victim. We investigated the relationship between state gun laws and instances of fatal police killings. Gun laws vary from state to state. As such, differences in regulation have long been a subject of dispute amongst policymakers. In what seems to be a gun violence epidemic, some lawmakers argue that all states should have universal firearms regulations. Individually, information about the “armed” status of victims may impact the decision to carry a weapon. Broadly, if any relationship exists between state-level firearm regulations and the rate of police killings, there would be major policy implications for the ownership of firearms that could lead to the minimization of police-related fatalities. In order to address the disparate levels of police violence amongst specific communities, we must understand the broader instrumentalities that directly impact firearm accessibility.

The widespread reach of police violence has become a household topic in American politics. The systemic nature of its reach has allowed for a dichotomy in power to emerge between victim and assailant. In order to better understand the nature of police violence, it is integral to analyze the data that underpins its occurrence. This project, while a small piece of a larger picture, seeks to illuminate the societal factors that lead to higher rates of fatal police confrontations, in addition to understanding the commonalities between victims and gun control laws.


Our initial data, acquired from FiveThirtyEight, detailed 467 Americans who were killed by police violence from January through June of 2015. The dataset we used was partially collected by The Guardian through its ongoing project “The Counted”. Unlike police records, this data incorporated journalist-verified information about the locations where incidents of police violence occured. The guardian collected a comprehensive list of characteristics of each police killing including, but not limited to: the victim’s name, age, gender, race, whether the victim was armed, and the cause, location, and time of death. The Guardian utilized its resources to build, in its own words, a “more comprehensive” record of police fatalities in order to boost public accountability. Utilizing The Guardian’s information, FiveThirtyEight appended additional data. FiveThirtyEight incorporated socioeconomic information on a census-tract level basis for each victim. They incorporated variables including the census tract in which the shooting occurred, racial proportions of the census tract, the median personal and household income of the census tract, the unemployment rate of the census tract, the poverty rate of the census tract, and the percentage of the citizens over the age of 25 with a B.A. degree or more. Beyond FiveThirtyEight’s data, our group added data from the Giffords Law Center, an authority on gun regulations and firearm safety throughout the country, and factored it into our model.

At the beginning of our analysis, 467 observations were contained in our dataset. After deliberation, we decided to remove observations that were missing crucial socioeconomic information. For example, two men were shot in airports and had multiple N/A values for their census data. Because we were unable to estimate and introduce new information from the missing values, we decided to remove data with missing variables. With such a small dataset for such a large area, we maintained that every observation’s data must be complete in order to conduct a definitive analysis. In line with this logic, we removed the variable county_bucket due to many N/A values for observations. While the county bucket served as a geographic grouping variable for the US census, it was extraneous information for our analysis. By the end of our data cleaning, we had 465 observations and a variety of variables left in our dataset.

Table One: Summary Statistics for Critical Values Utilized in Income Prediction Model

X1 Population Proportion White Median Tract Household Income Poverty Rate Unemployment Rate Proportion Attended College
Mean 4804.295 51.91742 46627.18 21.11161 0.11739939 0.22021668
Median 4465.000 56.50000 42759.00 18.20000 0.10518053 0.16954430
Std. Dev. 2358.772 30.00139 20511.19 13.21596 0.06917513 0.15834723
IQR 2463.500 51.45000 23704.50 17.85000 0.07232124 0.17877701
Min 403.000 0.00000 10290.00 1.10000 0.01133501 0.01354724

Figure One: Proportion of Deaths per Region by Median Household Income

The census tract demographic characteristics appended by FiveThirtyEight served as a basis for our income prediction model. Utilizing the 465 observations, we regressed the numeric variable, which represented the median household income of the census tract, on a variety of characteristics. The base regression model was a simple linear model utilizing the numerical variable population as its sole regressor. Over time, we added additional numerical and categorical regressors into the model utilizing statistical significance and the model’s overall predictive accuracy as benchmarks for inclusion. Utilizing the stepwise regression technique, we incorporated several statistically significant numerical variables into the model. A census tract’s poverty rate, unemployment rate, population, proportion of White residents, and the proportion of people who attended college were significant within the final regression of the model. A quadratic of the proportion who attended college was also taken into consideration. The region categorical variable remained significant during the stepwise regression fitting. In order to control for broader regional characteristics, we merged a dataset that indicated the geographic region for each state according to the census. Utilizing a combination of cross validation and stepwise regression techniques, we were able to whittle down the model’s bias.

Figure Two: Regional Proportion of Police Killings by Race

Table Two: Proportions of Populatiuon by Race and Region

Region White Black Native.American Asian.Pacific.Islander Hispanic Total
Northeast 0.4667101 0.37396877 0.059521605 0.03100421 0.06879530 0.2907287
Midwest 0.7926081 0.10411630 0.005904141 0.02644776 0.07092366 0.1880175
South 0.6108384 0.19184457 0.006475967 0.02878843 0.16205262 0.3217498
West 0.5449439 0.04609114 0.014380769 0.09926779 0.29531643 0.1995040

To analyze the relationship between gun laws and instances of fatal police confrontations we required a method of ranking states based on the safety of their firearm regulations. We accomplished this by utilizing data from the Giffords Law Center in which they graded each state on a scale of “A” to “F” based on their relative levels of gun safety as determined by the strictness of their firearm regulations. When ranking states, Giffords created a points system. Each state received or lost points for meeting specific criteria. For example, if a state has risky firearm practices such as allowing guns in schools or having “Stand your Ground” laws, this resulted in point deductions, but if a state enacted safer firearm regulations such as limiting bulk firearm purchases, then points were awarded. Once these points for each state were summed, Giffords divided the point ranges to create hierarchical safety grades. The majority of states are classified in the “F” rank. Proportionally, all other ranks have approximately the same number of states. In order to perform the analysis we merged the gun safety rankings with the police killings dataset. Additionally, we had to perform some cleaning of the data concerning how some of the grades had +/-, but not enough to justify creating another category of ranks for these grades. To deal with +/- we simply removed them so that these ranks were whole letter grades.

When looking at occurrences of fatal police confrontations we found it necessary to standardize the instances of police killings across states. To do this, we created a metric for the number of police killings per 100,000 people for each state. In order to create this metric we required a dataset that contained each state’s population in 2015. We were able to procure this data from the website This dataset encompasses each state’s population from 2010-2016 as taken from the US Census Bureau. We created the metric for police killings per 100,000 people by dividing the number of fatal police confrontations in each state by the relevant population in 2015 for that state and then multiplying by 100,000. By creating this measure, we were able to account for the population differences between states, allowing us to analyze police killings through a “per-capita” metric.


Question One:

Utilizing census level variables, we created a simple base linear model utilizing population as the main predictor of median household income. Models two, three, and four added additional regressors to the base model. Respectively, each iteration of the base model regressed the census level variables poverty rate, the age of the victim, and the census level-tract proportion of citizens over the age of 25 who attended college or more. The fifth model incorporated several other census-level variables: the proportion of White citizens, the race of each citizen, and the region of the country were factored into the fifth model. Additionally, an interaction between race and region was included within the fifth model to control for broader socioeconomic characteristics. Given that a majority of these variables in the first five models were not incredibly significant, we utilized a stepwise regression technique to harness statistically significant variables in a sixth iteration of the model. To do this technique, we used two models: one full model with all the variables we deemed suitable for regression, and one model that was empty and had no predictors. For the full model, we excluded variables that had superfluous or redundant data, such as county ID numbers and tract ID numbers. To stay consistent with census-level data, we removed categorical variables with too many unique entries, such as the Law Enforcement Agency involved in the police killing, the City, and State. After this data cleaning, we plotted our full and empty models into the “step” function, and used an equation for the Mean Squared Error (MSE) as our scale. From there, R added and removed variables until it found a model with optimal significance. From this stepwise regression, we added an additional quadratic to improve the predictive power of the final model. The final model is summarized in the following equation:

\(Median.Income = PovertyRate(X_i) + Population(X_i) + ProportionWhite (X_i) + ProportionAttendedCollege(X_i) + ProportionAttendedCollege(X^2_i) + Region_j(X_i)\)

Table Three: Measures of Error within each Model

Base Model -1.351333e+01 14623.944 19594.470
Model 2 -4.630080e-11 9078.931 13327.418
Model 3 5.018324e-11 9133.069 13382.642
Model 4 6.585511e-11 8118.101 11018.829
Model 5 6.419875e-11 7739.584 10522.944
Stepwise Model -5.697986e-11 7277.157 9975.437

Utilizing K-Fold cross validation, we ran each model through a rigorous series of tests and 100 folds. Table Three is a result of this testing, displaying a series of statistics detailing the overall error within each model. There are several trends to observe. Between the base model and model five, bias tends to increase. Between models four and five, however, bias decreases slightly. The mean absolute error (MAE) and the root mean squared error (RMSE) also decrease between the base model and model five. Paying special attention to the RMSE, it is important to note that the variance in residuals is decreasing as more variables are factored into the model. Referencing Figure Three, the distribution of residuals increases across each model iteration as the variance between residuals decreases. The base model has the lowest spread and highest variance in residuals, while model five’s residuals have the largest spread and smallest variance. Figure Three corroborates the decreasing magnitude of residuals with each new model. With five regressors, the mean error decreases significantly from the base model’s average error. Even with insignificant regressors, the decrease in the frequency distribution of error magnitudes indicates higher overall predictive accuracy of the model. The stepwise model incorporated six statistically significant variables into the model. From Table Three, this change made the bias of the model negative overall. Unlike three previous model iterations with positive biases, the inclusion of statistically significant terms switched the sign of the stepwise model’s error. The MAE and RMSE also decreased, albeit slightly, in the stepwise model. Returning to Figure Three, there appears to be little difference in the magnitude of change between errors. It should be noted, however, that there is greater variance in the residuals at the tail end of the stepwise model. To better visualize these residuals, Figure Four shows that the spread and variance of residuals in model five differ greatly from the stepwise model. Overall, with the greatest number of significant regressors, the stepwise model contains the lowest levels of error and bias in predicting median household income. Logically, the errors in our model can be attributed to the natural distribution of income throughout the United States. Income is not normally distributed; as shown in Figure One, it can best be described as a normal distribution that is skewed to the right by several significant outliers.

Figure Three: Residuals Across Models