Police Killings Analysis

Introduction

Police violence is an increasingly controversial topic within America’s civic discourse. Every year, police violence affects countless Americans and regularly incites debate about the presence of police and the tactics that they employ. When incidents of police violence occur, details about the victim’s race, age, and income are regularly discussed. Consequently, these small details coalesce into a larger discussion about the socioeconomic characteristics that influence disproportionate rates of police violence. Utilizing data from FiveThirtyEight and The Guardian, this paper contributes to a larger discussion of police violence by analyzing the attributes of victims and the places they live.

According to FiveThirtyEight, police shooting victims tend to live in poorer areas of the country. Because this relationship does exist, we sought to investigate the factors that create this reality. Accounting for various economic characteristics on a census level, we aimed to predict the median household income of census tracts where fatal police confrontations occur. Income prediction is a utile method to investigate the lives of low-income police violence victims. Through finding significant predictors of median income, policymakers are better able to interpret systemic differences, or barriers, in our country. Amongst vulnerable groups, like victims of police shootings, policymakers can address systemic inequality, then work towards actionable change. Overall, analyzing variables that influence income elucidates socioeconomic realities that affect police shooting victims and Americans in similar economic circumstances.

Many of the victims within this dataset were armed, therefore, access to firearms may impact the likelihood of being a police victim. We investigated the relationship between state gun laws and instances of fatal police killings. Gun laws vary from state to state. As such, differences in regulation have long been a subject of dispute amongst policymakers. In what seems to be a gun violence epidemic, some lawmakers argue that all states should have universal firearms regulations. Individually, information about the “armed” status of victims may impact the decision to carry a weapon. Broadly, if any relationship exists between state-level firearm regulations and the rate of police killings, there would be major policy implications for the ownership of firearms that could lead to the minimization of police-related fatalities. In order to address the disparate levels of police violence amongst specific communities, we must understand the broader instrumentalities that directly impact firearm accessibility.

The widespread reach of police violence has become a household topic in American politics. The systemic nature of its reach has allowed for a dichotomy in power to emerge between victim and assailant. In order to better understand the nature of police violence, it is integral to analyze the data that underpins its occurrence. This project, while a small piece of a larger picture, seeks to illuminate the societal factors that lead to higher rates of fatal police confrontations, in addition to understanding the commonalities between victims and gun control laws.

Data

Our initial data, acquired from FiveThirtyEight, detailed 467 Americans who were killed by police violence from January through June of 2015. The dataset we used was partially collected by The Guardian through its ongoing project “The Counted”. Unlike police records, this data incorporated journalist-verified information about the locations where incidents of police violence occured. The guardian collected a comprehensive list of characteristics of each police killing including, but not limited to: the victim’s name, age, gender, race, whether the victim was armed, and the cause, location, and time of death. The Guardian utilized its resources to build, in its own words, a “more comprehensive” record of police fatalities in order to boost public accountability. Utilizing The Guardian’s information, FiveThirtyEight appended additional data. FiveThirtyEight incorporated socioeconomic information on a census-tract level basis for each victim. They incorporated variables including the census tract in which the shooting occurred, racial proportions of the census tract, the median personal and household income of the census tract, the unemployment rate of the census tract, the poverty rate of the census tract, and the percentage of the citizens over the age of 25 with a B.A. degree or more. Beyond FiveThirtyEight’s data, our group added data from the Giffords Law Center, an authority on gun regulations and firearm safety throughout the country, and factored it into our model.

At the beginning of our analysis, 467 observations were contained in our dataset. After deliberation, we decided to remove observations that were missing crucial socioeconomic information. For example, two men were shot in airports and had multiple N/A values for their census data. Because we were unable to estimate and introduce new information from the missing values, we decided to remove data with missing variables. With such a small dataset for such a large area, we maintained that every observation’s data must be complete in order to conduct a definitive analysis. In line with this logic, we removed the variable county_bucket due to many N/A values for observations. While the county bucket served as a geographic grouping variable for the US census, it was extraneous information for our analysis. By the end of our data cleaning, we had 465 observations and a variety of variables left in our dataset.

Table One: Summary Statistics for Critical Values Utilized in Income Prediction Model

X1	Population	Proportion White	Median Tract Household Income	Poverty Rate	Unemployment Rate	Proportion Attended College
Mean	4804.295	51.91742	46627.18	21.11161	0.11739939	0.22021668
Median	4465.000	56.50000	42759.00	18.20000	0.10518053	0.16954430
Std. Dev.	2358.772	30.00139	20511.19	13.21596	0.06917513	0.15834723
IQR	2463.500	51.45000	23704.50	17.85000	0.07232124	0.17877701
Min	403.000	0.00000	10290.00	1.10000	0.01133501	0.01354724

Figure One: Proportion of Deaths per Region by Median Household Income

The census tract demographic characteristics appended by FiveThirtyEight served as a basis for our income prediction model. Utilizing the 465 observations, we regressed the numeric variable med.tract.household.inc, which represented the median household income of the census tract, on a variety of characteristics. The base regression model was a simple linear model utilizing the numerical variable population as its sole regressor. Over time, we added additional numerical and categorical regressors into the model utilizing statistical significance and the model’s overall predictive accuracy as benchmarks for inclusion. Utilizing the stepwise regression technique, we incorporated several statistically significant numerical variables into the model. A census tract’s poverty rate, unemployment rate, population, proportion of White residents, and the proportion of people who attended college were significant within the final regression of the model. A quadratic of the proportion who attended college was also taken into consideration. The region categorical variable remained significant during the stepwise regression fitting. In order to control for broader regional characteristics, we merged a dataset that indicated the geographic region for each state according to the census. Utilizing a combination of cross validation and stepwise regression techniques, we were able to whittle down the model’s bias.

Figure Two: Regional Proportion of Police Killings by Race

Table Two: Proportions of Populatiuon by Race and Region

Region	White	Black	Native.American	Asian.Pacific.Islander	Hispanic	Total
Northeast	0.4667101	0.37396877	0.059521605	0.03100421	0.06879530	0.2907287
Midwest	0.7926081	0.10411630	0.005904141	0.02644776	0.07092366	0.1880175
South	0.6108384	0.19184457	0.006475967	0.02878843	0.16205262	0.3217498
West	0.5449439	0.04609114	0.014380769	0.09926779	0.29531643	0.1995040

To analyze the relationship between gun laws and instances of fatal police confrontations we required a method of ranking states based on the safety of their firearm regulations. We accomplished this by utilizing data from the Giffords Law Center in which they graded each state on a scale of “A” to “F” based on their relative levels of gun safety as determined by the strictness of their firearm regulations. When ranking states, Giffords created a points system. Each state received or lost points for meeting specific criteria. For example, if a state has risky firearm practices such as allowing guns in schools or having “Stand your Ground” laws, this resulted in point deductions, but if a state enacted safer firearm regulations such as limiting bulk firearm purchases, then points were awarded. Once these points for each state were summed, Giffords divided the point ranges to create hierarchical safety grades. The majority of states are classified in the “F” rank. Proportionally, all other ranks have approximately the same number of states. In order to perform the analysis we merged the gun safety rankings with the police killings dataset. Additionally, we had to perform some cleaning of the data concerning how some of the grades had +/-, but not enough to justify creating another category of ranks for these grades. To deal with +/- we simply removed them so that these ranks were whole letter grades.

When looking at occurrences of fatal police confrontations we found it necessary to standardize the instances of police killings across states. To do this, we created a metric for the number of police killings per 100,000 people for each state. In order to create this metric we required a dataset that contained each state’s population in 2015. We were able to procure this data from the website data.world. This dataset encompasses each state’s population from 2010-2016 as taken from the US Census Bureau. We created the metric for police killings per 100,000 people by dividing the number of fatal police confrontations in each state by the relevant population in 2015 for that state and then multiplying by 100,000. By creating this measure, we were able to account for the population differences between states, allowing us to analyze police killings through a “per-capita” metric.

Results

Question One:

Utilizing census level variables, we created a simple base linear model utilizing population as the main predictor of median household income. Models two, three, and four added additional regressors to the base model. Respectively, each iteration of the base model regressed the census level variables poverty rate, the age of the victim, and the census level-tract proportion of citizens over the age of 25 who attended college or more. The fifth model incorporated several other census-level variables: the proportion of White citizens, the race of each citizen, and the region of the country were factored into the fifth model. Additionally, an interaction between race and region was included within the fifth model to control for broader socioeconomic characteristics. Given that a majority of these variables in the first five models were not incredibly significant, we utilized a stepwise regression technique to harness statistically significant variables in a sixth iteration of the model. To do this technique, we used two models: one full model with all the variables we deemed suitable for regression, and one model that was empty and had no predictors. For the full model, we excluded variables that had superfluous or redundant data, such as county ID numbers and tract ID numbers. To stay consistent with census-level data, we removed categorical variables with too many unique entries, such as the Law Enforcement Agency involved in the police killing, the City, and State. After this data cleaning, we plotted our full and empty models into the “step” function, and used an equation for the Mean Squared Error (MSE) as our scale. From there, R added and removed variables until it found a model with optimal significance. From this stepwise regression, we added an additional quadratic to improve the predictive power of the final model. The final model is summarized in the following equation:

$Median.Income = PovertyRate(X_i) + Population(X_i) + ProportionWhite (X_i) + ProportionAttendedCollege(X_i) + ProportionAttendedCollege(X^2_i) + Region_j(X_i)$

Table Three: Measures of Error within each Model

	Bias	MAE	RMSE
Base Model	-1.351333e+01	14623.944	19594.470
Model 2	-4.630080e-11	9078.931	13327.418
Model 3	5.018324e-11	9133.069	13382.642
Model 4	6.585511e-11	8118.101	11018.829
Model 5	6.419875e-11	7739.584	10522.944
Stepwise Model	-5.697986e-11	7277.157	9975.437

Utilizing K-Fold cross validation, we ran each model through a rigorous series of tests and 100 folds. Table Three is a result of this testing, displaying a series of statistics detailing the overall error within each model. There are several trends to observe. Between the base model and model five, bias tends to increase. Between models four and five, however, bias decreases slightly. The mean absolute error (MAE) and the root mean squared error (RMSE) also decrease between the base model and model five. Paying special attention to the RMSE, it is important to note that the variance in residuals is decreasing as more variables are factored into the model. Referencing Figure Three, the distribution of residuals increases across each model iteration as the variance between residuals decreases. The base model has the lowest spread and highest variance in residuals, while model five’s residuals have the largest spread and smallest variance. Figure Three corroborates the decreasing magnitude of residuals with each new model. With five regressors, the mean error decreases significantly from the base model’s average error. Even with insignificant regressors, the decrease in the frequency distribution of error magnitudes indicates higher overall predictive accuracy of the model. The stepwise model incorporated six statistically significant variables into the model. From Table Three, this change made the bias of the model negative overall. Unlike three previous model iterations with positive biases, the inclusion of statistically significant terms switched the sign of the stepwise model’s error. The MAE and RMSE also decreased, albeit slightly, in the stepwise model. Returning to Figure Three, there appears to be little difference in the magnitude of change between errors. It should be noted, however, that there is greater variance in the residuals at the tail end of the stepwise model. To better visualize these residuals, Figure Four shows that the spread and variance of residuals in model five differ greatly from the stepwise model. Overall, with the greatest number of significant regressors, the stepwise model contains the lowest levels of error and bias in predicting median household income. Logically, the errors in our model can be attributed to the natural distribution of income throughout the United States. Income is not normally distributed; as shown in Figure One, it can best be described as a normal distribution that is skewed to the right by several significant outliers.

Figure Three: Residuals Across Models

Figure Four: Residuals in Model 5 and Stepwise

Table Four: Table of Coefficients for Stepwise Model

term	estimate	std.error	statistic	p.value
(Intercept)	6.184182e+04	2.957732e+03	20.9085265	0.0000000
Poverty_Rate	-9.815092e+02	4.772353e+01	-20.5665656	0.0000000
Population	8.658478e-01	2.099317e-01	4.1244268	0.0000442
Proportion_White	-6.523843e+01	1.860826e+01	-3.5058858	0.0005001
Proportion_Attended_College	-2.505136e+04	1.079690e+04	-2.3202358	0.0207686
I(Proportion_Attended_College^2)	1.152724e+05	1.476009e+04	7.8097401	0.0000000
RegionNortheast	5.953780e+03	2.021266e+03	2.9455704	0.0033886
RegionSouth	-4.132176e+00	1.410114e+03	-0.0029304	0.9976632
RegionWest	3.773242e+03	1.481926e+03	2.5461746	0.0112189

Table Four contains the coefficients of our linear income prediction model. At a glance, the most influential predictors of income are a census region’s poverty rate, proportion who attended college, and region, particularly the Northeast and West. As the poverty rate increases by one percentage point, the median household income of an area decreases by a factor of $981.51. In line with expectations, higher poverty rates would logically indicate lower levels of income. Police killings victims in areas with higher poverty rates, it would appear, are more likely to have lower income and be predisposed to otherwise unknown effects caused by poverty. Census tracts with higher proportions of those who attend college tend to be richer after factoring in the quadratic. Census tracts with 21.7342% or more of the population with college degrees or higher will have increased median incomes. Intuitively, the median income grows the farther a census tract moves past this minimum. Overall, substantial investments in human capital and skills are more likely to have higher salaries, which is in line with the breakeven point deduced by the polynomial of the stepwise function. Interestingly, different regions of the country are richer than others. Using the Midwest as a reference variable, living in the Northeast increases the median household income by a factor of $5953.78 while living in the West increases income by $3773.24. While variables denoting geographic region are statistically significant, we recognize that regions are large; varying taxes, population levels, and industrial capabilities may disproportionately affect these median income differences. Given statistical significance, however, we maintain that areas of the country tend to be poorer than others. When predicting the median income of police killing victims, systemic differences in access to resources and social mobility are dictated by these three variables. Education, poverty, and geography are all intertwined, and police killings victims demonstrate that their influence cannot be underestimated when predicting income. By adopting income to serve as a standardized metric, police killing victims demonstrate that distinct demographic differences elucidate internal divisions within America and the way that it is policed.

Question Two:

During our initial analyses, we investigated the potential relationship between categorical variables, such as race, and the frequency of fatal police confrontations across the country. However these variables proved to not be significant predictors of police killings, leading us to consider other factors that differed between states. A factor that varies on a state basis and could have an effect on police killings was the difference in the strength of firearm regulations among states, leading us to look for an effective way to rank states based on firearm safety. To investigate the relationship between gun laws and observations of police killings, we utilized the rankings from Giffords as well as the metric for police killings per 100,000 people. When visualizing the number of police killings by gun law ranks, we found that states with the worst firearm safety rank, “F”, had the highest number of instances of fatal police confrontations; as seen in Table Five, there are about .18 deaths per 100,000 people. Surprisingly, states with the best firearm safety rank of “A” actually had the second highest number of police killings, with about .14 deaths per 100,000 people. However, as seen in Table Five, this may be explained by the difference in populations between the gun ranks, as the states ranked “A” and “F” have significantly higher populations than all other ranks. Additionally the rate of police killings for “F” states may be skewed as there are 23 states categorized in this rank, while the rank with the next highest number of states only has 7.

Figure Five: Police Killings per 100,000 People by Grade of Gun Laws

Table Five: Police Killing Rates by Gun Law Grades

2015 Grade	Number of States	Kill Count	Population	Kill Rate per 100,000 People
A	6	115	84040497	0.1368388
B	4	28	22368570	0.1251756
C	6	44	40787504	0.1078762
D	7	40	37418265	0.1068997
F	23	237	132314956	0.1791181

When looking at the gun safety ranks, it is important to note the population discrepancies between them and how this may play a role in enacting more robust firearm regulations. While the “A” rank has only six states the aggregate population for these states is about 84 million, while in comparison the “D” rank has seven states and has less than half the total population with 37 million. This indicates that firearm regulation is more of an issue among states with higher populations, as these states tend to have increased rates of violence simply due to large populations. Gifford’s rankings confirm this theory, as California has the largest state population in the country and is subsequently ranked number 1 in terms of the strength and safety of its gun laws. Additionally, New York is also part of the “A” rank states and has the fourth largest population in the country. The “A” rank, including exorbitantly high population states such as California and New York, explains why the rate of fatal police confrontations for this category ranks so high. In these high population states, there is more of an incentive to regulate firearms as an attempt to deal with the higher rates of violence stemming from more people. As such, firearm safety is a topic of much discussion among both policymakers and citizens alike in these areas. On the flip side, firearm regulation may not be an issue of much focus amongst policymakers in states with smaller populations. This can be observed from the Giffords ranking in how the states with the five lowest populations in 2015 being Wyoming, Vermont, Alaska, North Dakota, and South Dakota are all categorized as having the worst firearm regulations with a grade of “F”.

Figure Six: Police Killings in States with Ranked Gun Laws

The effect of this positive relationships between population and firearm regulations is that in states with low populations, firearms are more easily acquired than states with larger populations. This may be one factor in explaining why more police killings occur in lower population states, as it is more likely that someone in a police confrontation is to be armed in these areas than others. This is pertinent because we found in our initial EDA that the majority of victims in police killings tend to be armed. An armed individual is more likely to be seen as a threat to police officers, which may help explain why police killings occur at a higher rate in the “F” rank states. There are definitely exceptions to this relationship, such as Texas, which has the second highest population but is ranked in the “F” category. This could be somewhat attributed to differences in political ideology across the United States, as large states such as New York and California are considered more progressive concerning social reform than other states. To conceptualize the relationship between the police killing data and gun law ranks, we created a heat map which displays this information concurrently. The shade of each state represents their corresponding gun safety rank while each point represents a fatal police confrontation. It can be observed that the majority of police killings tend to be in higher population states such as California, Texas, and Florida, while sparsely populated states such as Idaho, Montana, and Wyoming tend to have the fewest instances of police killings.

Conclusion

Utilizing data from FiveThirtyEight and The Guardian, our group set out to find a model that accurately predicts the median household income of police killing victims and assess the relationship between gun laws and the number of police killings by state. Through cross validation and stepwise regression techniques, we discovered that education, the poverty rate, and region significantly influence median household income. The proportion of college educated citizens is an indicator of a wealthy census tract, while higher poverty and unemployment rates indicate low income census tracts. The Northern and Western regions of the country tend to have larger median incomes. Additionally, gun control laws display a relationship with police violence, but the population factor hinders how much we can truly conclude about the relationship between police violence and gun laws.

For all intents and purposes, income is used as a universal metric for valuing human utility. In the cultural zeitgeist, measuring an individual’s contributions to society, i.e. valuing whether a person is “lazy” or “productive”, from their income level is common. Attaching this prediction to a census region, therefore, characterizes an area with numerous socioeconomic connotations. Amongst a volatile group of individuals like police shooting victims, being able to efficiently track and predict median income of a census region allows us to both recognize and ameliorate the characteristics that contribute to police violence. Although victims of police killings are a subset of America’s general population, these victims represent a very tangible failure on the part of the government to regulate or provide for the general welfare. While income cannot be predicted accurately in every instance, the ability to accurately predict and model its distribution provides insight into the shared realities of vulnerable populations across America. Future studies could incorporate several approaches to improving this income prediction model. First, additional observations can be added to this model. In total, this is a very small dataset. Incorporating numerous years of data could help refine the model’s predictive accuracy. Second, further testing the geographic factors that affect income would provide greater insights on the true impact of census region on median income. Finally, incorporating different census data, such as data on infrastructure, transportation, and employment, would provide additional regressors for the model. Utilizing datasets such as the Census Bureau’s American Community Survey would provide easily accessible data to model income over time and region.

Combining data from the Giffords Law Center and the US Census, we aimed to analyze the relationship between police killing rates and gun laws. After visualizing the data through various techniques, we found a relatively inverse relationship between the strictness of a state’s gun laws and the rate of fatal police confrontations. This information could be very pertinent for policy makers in increasing the overall safety of our country as well as lowering the number of shooting victims. Our methods could be improved by performing significance tests, such as testing whether it is truly significant that the states with the worst gun safety rankings actually have the highest rates of police killings, or if it is just an anomaly due to our specific dataset. Additionally, our analyses could be improved finding a more effective way to rank states based on the robustness of their firearm regulations.

More recent data from a nonprofit called Mapping Police Violence confirms our findings and details the severity of the police violence epidemic in our country. In addition to urging policy reforms, this non profit addresses the potential solutions legislators can pursue to influence the policy side of police violence. Along with more recent statistics, further studies from this data could prove to be a significant part of the national discussion about gun violence prevention, as well as the larger discourse about structural barriers to socioeconomic equality.

Police Killings Analysis

Chris Jakuc, Harrison Cho, Avinash Gandhi, Mohammad Aamin, and KC Kurz

August, 2019

Introduction

Data

Results

Conclusion