NYC Crime Data Analyses

Perpetrator Demographics & Crime by Borough

Author

Devesh Shah, Sierra Stubs, Libby Gough, William Creamer

Introduction

Our project aims to analyze any potential trends between boroughs, types of violent crimes committed, and perpetrator demographics in the New York City Metro Area. In this report, we’ve narrowed our focus to explore the relationship between perpetrator sex and types of violent crimes to understand how crime-based gender disparities could vary among the different boroughs. We believe modeling these relationships could help understand crime characteristics within the boroughs and potentially help prevent future crime. After initially exploring the relationships between these variables, we created a predictor model, which we intend to use by law enforcement to help in the allocation of resources and combating crime more efficiently.

Research Question: What is the relationship between borough and perpetrator sex for violent crimes committed throughout all 5 boroughs?

Data Source: https://data.cityofnewyork.us/Public-Safety/NYPD-Arrest-Data-Year-to-Date-/uip8-fykc/about_data

This data is collected by police officers and is then uploaded into the NYPD database. The NYPD Office of Management Analysis and Planning extracts and reviews the data each quarter. This reviewed data is then scraped and uploaded to NYC OpenData.

Installing libraries

Tidyverse, tidymodels, and knitr models were installed.

Reading & Cleaning Dataset

The data set initially included 64 unique types of crime and ~227,000 observations. We were most interested in seeing trends across violent crimes. We were interested in violent crime, because this type of crime is highly reported within the media and highly discussed within the general population.

Thus, we filtered for the 7 different types of violent crime, but given the similar nature of “sex crimes” and “felony sex crimes” we combined those two variables into one, to give us a final 6 unique crime variables. After filtering for the violent crimes of interest, we then mutated the data set so that each borough code was transformed into the full borough name, as we believed doing so would increase the accessibility for those who are unfamiliar with borough code names. Additionally, such would grant for increased ease at later points in the project, specifically when we create visualizations, as we do not have to transform the variables for each plot. We then selected for the variables most relevant to our research question of interest, which included arrest date, violent crime, borough, perpetrator demographics such as age, sex, and race, coordinate location, and level of offense. Finally, we wrote this new, clean data set into the data file, titling it nypd_clean.

Variable Key

Variable	Definition
arrest_date	Exact date of arrest for the reported event
viol_crime	Type of violent crime associated with the arrest
arrest_boro	Borough of arrest
age_group	Perpetrator’s age within a category
perp_sex	Perpetrator’s sex description
perp_race	Perpetrator’s race description
latitude	Latitude coordinate for Global Coordinate System for arrest
longitude	Longitude coordinate for Global Coordinate System for arrest
law_cat_cd	Level of offense: felony (F), misdemeanor (M), violation (V)

Exploratory Data Analysis

We completed an initial exploratory data analysis to see how the proportions of the six violent crimes vary by borough and sex of perpetrator. We created one stacked bar graph that reveals the proportions of violent crime by borough. We created a second stacked bar graph that reveals the proportions of violent crime by perpetrator sex (male or female) and then faceted that visualization by borough, to see how gender influences the first visualization. After this, we looked at violent crime by perpetrator age group in order to better understand how different perpetrator demographics other than sex influence the type of violent crime committed. We created a stacked bar graph between the proportion of violent crime and age group faceted by borough in order to visualize how age group influences the type of violent crime committed by location. Finally, we repeated this process for a third component of perpetrator demographic, race, looking at both race versus violent crime proportions for the entire NYC metro area and by borough. This analysis allowed us to better understand the complexities of violent crimes committed and how such is influenced by both location and various perpetrator demographics, which ultimately aided in the formulation of our research question and models.

Methodology

From our EDA, we decided to include our inquiry in exploring gender-based crime disparities, as we felt this topic was less focused on in the media compared to age and racial demographics. To investigate the relationship between perpetrator sex and the borough of arrest for violent crimes, a logistic regression model was performed. The dataset was filtered to exclude cases where the perpetrator sex was undefined to ensure binary outcomes. Our predictor variable is the borough of arrest, arrest_boro, and our outcome is the perpetrator sex, perp_sex. We hypothesize that the borough of arrest might influence the perpetrator being male or female across the borough, which we believe could provide insight into understanding the relationship between gender-based disparities in crime. To arrive at the final model, we calculated the proportion of violent crimes committed by either gender in each borough, resulting in a graph that depicts the proportions we expected to guide our logistic model. After turning the boroughs into factors, we created the logistic model where the intercept serves as a baseline indicator for the Bronx borough. At the same time, the other estimate coefficients represent the change in log odds of the outcome variable associated with one unit change from the baseline Brox category. In summary, the logistic regression model estimates provide information about how the likelihood of perpetrator sex varies across different boroughs of arrest after accounting for other variables in the model. The augment function provided a visual to further understand the interpretation of the model results by creating a jitter plot to explore the relationship between arrest borough and perpetrator sex predicted by the logistic regression model.

[1] "Bronx"         "Brooklyn"      "Manhattan"     "Queens"       
[5] "Staten Island"

term	estimate	std.error	statistic	p.value
(Intercept)	1.204	0.023	52.006	0.000
arrest_boroBrooklyn	0.037	0.033	1.121	0.262
arrest_boroManhattan	0.254	0.037	6.843	0.000
arrest_boroQueens	0.183	0.035	5.259	0.000
arrest_boroStaten Island	-0.057	0.065	-0.867	0.386

Inferential Model:

We are also interested in observing whether there is a significant difference in the proportion of felonies committed by female or male perpetrators. This analysis’s data was straightforward, focusing on comparing proportions between two categorical variables (perp_sex and law_cat_cd). This analysis will allow us to formally assess whether there is a statistical difference in these proportions, which we believe to be male-leaning. To test these hypotheses, we analyzed the dataset containing information about arrests, focusing on the type of crime (Felony or Misdemeanor) and the perpetrator’s gender, and took the proportions of all female vs. male perpetrators. Subsequently, we conducted a permutation test to assess the statistical significance of our findings. This involved generating a null distribution of the difference in proportions under the assumption of independence between perpetrator gender and the type of crime. By comparing the observed statistic to the null distribution, we obtained a p-value.

Hypotheses:

Null Hypothesis: There is no significant difference in the proportion of felonies committed by female and male perpetrators across the different NYC boroughs.

\[ H_0 : P_{\text{Male}} = P_{\text{Female}} \]

\[ H_A: P_{\text{Male}} \neq P_{\text{Female}} \]
Alternative Hypothesis: There is a significant difference in the proportion of felonies committed by female and male perpetrators across the different NYC boroughs.

# A tibble: 4 × 4
# Groups:   perp_sex [2]
  perp_sex law_cat_cd      n p_hat
  <chr>    <chr>       <int> <dbl>
1 F        Felony       7271 0.863
2 F        Misdemeanor  1159 0.137
3 M        Felony      27729 0.895
4 M        Misdemeanor  3236 0.105

Response: law_cat_cd (factor)
Explanatory: perp_sex (factor)
# A tibble: 1 × 1
    stat
   <dbl>
1 0.0330

Warning: Please be cautious in reporting a p-value of 0. This result is an
approximation based on the number of `reps` chosen in the `generate()` step.
See `?get_p_value()` for more information.

# A tibble: 1 × 1
  p_value
    <dbl>
1       0

Results

Overall, the combination of EDA, logistic modeling, and inferential statistics revealed that among the perpetrator demographics, male perpetrations were more prevalent across the different types of violent crimes, except for sex crimes.

In the logistic analysis, we were able to investigate the relationship between perpetrator sex and the borough of arrest for violent crimes in New York City. The results of the model revealed significant associations between arrest borough and the likelihood of the perpetrator being male after controlling for other variables. The intercept served as a baseline log-odds of the Bronx borough’s perpetrator sex outcome variable when all other predictor variables were constant. Coefficients for each borough indicated the change in log odds of the perpetrator being male compared to the baseline borough. Manhattan and Queens exhibited statistically significant positive coefficients, suggesting higher male perpetrators odds than the baseline borough. However, Brooklyn and Staten Island did not show statistically significant associations with perpetrator sex.

Regarding our hypothesis testing, our observed stat was 0.328, and the p-value was reported to be 0, which could indicate that the actual p-value is extremely small. We can reject the null hypothesis with that information, which is smaller than the discernibility level 0.05. The data provides sufficient evidence that there is a difference between the proportions of perpetrator sex committing felonies, which is also supported by prior visualizations. This information affirms that there is a statistically supported gender disparity with the types of violent crimes committed, which we sought to explore from our research question.

Discussion

We delved into the relationships between boroughs, types of violent crimes, and perpetrator demographics in New York City, explicitly focusing on the influence of perpetrator sex. Our analysis provided insights into the dynamics of violent crimes across the boroughs and their overreaching disparities in perpetrator demographics.

The logistic regression model revealed significant associations between the borough of arrest having a higher likelihood of the perpetrator being male, excluding other variables. Moreover, our hypothesis testing provided additional evidence of gender disparities in violent crimes, particularly felonies. The permutation test showed a significant difference in the proportions of felonies committed by female and male perpetrators, corroborating our findings from the logistic regression analysis. While this answered our research question, our logistic model was restricted and did not establish causality as other demographic factors and unobserved components influence the borough of arrest and the demographic of perpetrators.

Returning to limitations in the data, our dataset reflects arrests made by NYPD in 2023, and many potential biases do not consider underreporting or discrepancies in arrests across the different boroughs. Firstly, there may be a reporting bias in which not all crimes are reported to the police, or if they are, they might not be treated with the same sense of urgency across all locations and groups. Minority communities or low-income neighborhoods may experience higher levels of crime and interactions with law enforcement, leading to potential biases in the data. Amongst the different perpetrator demographics, prejudice can lead to different arrest rates, such as through racial profiling. Building a model from this data would inherently possess biases, and combating such requires systemic changes to law enforcement practices.

We recognize for future applications that proposing violent crime predictions based on demographic information can be a significant source of continued bias, especially since the data can be manipulated to support different inquiries about violent crime outcomes. Avenues for future work include considering multinomial logistic regression models, which we could use to predict the type of violent crime committed based on any demographic characteristic, borough location, or other independent variable within this dataset or another. With the logistic regression model employed, we were limited to looking at binary variables due to the unconditional nature of our dataset for the 2023 year, so the scope of our research question was limited. However, using multinomial logistic regression would expand our prediction capabilities and allow us to create models that the NYPD and other governmental offices could employ to improve law enforcement practices and bias training when considering perpetrator data.

Sources:

Police Department (NYPD. (2018, June 5). NYPD Arrest Data (Year to Date). Cityofnewyork.us. https://data.cityofnewyork.us/Public-Safety/NYPD-Arrest-Data-Year-to-Date-/uip8-fykc/about_data