Exploring Project Datasets

Project Proposal

Author

inside_investigators

library(tidyverse)

Data 1: NYC PD Arrest Data

Introduction and data

  • Source: New York City Police Department (NYC Open Data)

  • This is a breakdown of every arrest occurring in NYC by the NYPD during the current year.

  • This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning, and has been since June 5th, 2018.

  • Each line item represents an arrest, and the type of crime and approximate location (precinct and coordinates) of where the arrest occurred are included. Additionally, each perpetrators race, sex, and age are available for each arrest. There are approximately 227,000 unique rows and 19 columns.

  • This data is collected by police officers and then uploaded into the NYPD database, which is then scraped and uploaded to NYC OpenData. The officers reporting the arrest could be biased, and making assumptions about potential perpetrators. There is a small possibility that some of these observations could contain data that is not 100% true or unbiased. 

Research question

  • Research question: What demographics (race, age, sex) are associated with different types of crimes within the New York City Police Department’s jurisdiction? Where in the jurisdiction are crimes most often committed, and does location or date play a role in what type of crime is committed?

  • This question is important because it visualizes how crime plays out across the New York City area, and may provide insights into why crimes are committed and how best to implement policy to prevent them. 

  • The research topic is focusing on analyzing the demographics of perpetrators in the NYC area since 2018 and determining if location and/or time of the arrest plays a role in why certain types of crimes are committed and how best to prevent them as a law enforcement official, or avoid them as a citizen. 

  • Hypotheses: Certain locations and periods of the year increase the prevalence of arrests within the NYPD. Additionally, we believe that certain demographics of potential perpetrators increase the likelihood that certain crimes may be committed, but that this increase likely stems from an increased policing of certain neighborhoods with a higher prevalence of different demographics. 

  • There are both categorical (race, type of crime, etc.) and quantitative (age, date, etc.) variables in the data set.

Glimpse of data

# Inputting arrests in NYC dataset
arrests_nyc <- read_csv("data/NYPD_Arrest_Data__Year_to_Date_.csv")
Rows: 226872 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (10): ARREST_DATE, PD_DESC, OFNS_DESC, LAW_CODE, LAW_CAT_CD, ARREST_BORO...
dbl  (9): ARREST_KEY, PD_CD, KY_CD, ARREST_PRECINCT, JURISDICTION_CODE, X_CO...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(arrests_nyc)
Rows: 226,872
Columns: 19
$ ARREST_KEY                 <dbl> 261265483, 261271301, 261336449, 261328047,…
$ ARREST_DATE                <chr> "01/03/2023", "01/03/2023", "01/04/2023", "…
$ PD_CD                      <dbl> 397, 105, 397, 105, 244, 109, 263, 109, 263…
$ PD_DESC                    <chr> "ROBBERY,OPEN AREA UNCLASSIFIED", "STRANGUL…
$ KY_CD                      <dbl> 105, 106, 105, 106, 107, 106, 114, 106, 114…
$ OFNS_DESC                  <chr> "ROBBERY", "FELONY ASSAULT", "ROBBERY", "FE…
$ LAW_CODE                   <chr> "PL 1600500", "PL 1211200", "PL 1601001", "…
$ LAW_CAT_CD                 <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F"…
$ ARREST_BORO                <chr> "B", "S", "K", "Q", "B", "K", "B", "K", "K"…
$ ARREST_PRECINCT            <dbl> 49, 120, 61, 114, 44, 76, 49, 90, 63, 34, 4…
$ JURISDICTION_CODE          <dbl> 0, 0, 0, 0, 0, 0, 71, 0, 0, 0, 0, 0, 0, 0, …
$ AGE_GROUP                  <chr> "18-24", "25-44", "<18", "18-24", "25-44", …
$ PERP_SEX                   <chr> "M", "M", "M", "M", "F", "M", "M", "M", "M"…
$ PERP_RACE                  <chr> "BLACK", "WHITE", "BLACK", "BLACK", "BLACK"…
$ X_COORD_CD                 <dbl> 1027430, 962808, 995118, 1007694, 1007174, …
$ Y_COORD_CD                 <dbl> 251104, 174275, 155708, 219656, 239542, 188…
$ Latitude                   <dbl> 40.85579, 40.64500, 40.59405, 40.76955, 40.…
$ Longitude                  <dbl> -73.84391, -74.07726, -73.96087, -73.91536,…
$ `New Georeferenced Column` <chr> "POINT (-73.843908 40.855793)", "POINT (-74…

Data 2: LAPD Prostitution Data

Introduction and data

  • Source:Los Angeles Open Data, Los Angeles Police Department

  • Data collected by Los Angeles Police Department, with the dataset owner “kyledrives”. The dataset is transcribed from original arrest reports that are typed on paper. 

  • About: This dataset reports prostitution arrest incidents in the city of Los Angeles from 2010 forward. The dataset includes columns for report ID, arrest date, time, area ID, area name, reporting district, age, sex code, descent code, charge group code, charge group description, arrest type code, charge, charge description, address, and cross street. There are 1,749 rows in the dataset, each indicating a prostitution arrest in the city of LA. 

  • Ethical considerations: The data is transcribed from original arrest reports written on paper, and therefore there is a chance there are inaccuracies in transcription of data. Officers reporting the prostitution arrest could also hold biases, and incorrectly report details of the incident.

Research question

  • How do prostitution arrest rates vary across age and race/ethnicity in the city of Los Angeles? Are prostitution arrests more likely to occur at certain cross streets at certain times? 

  • The research question is important as it provides insight into prostitution rates in the city of Los Angeles and those most likely to be arrested for such, by result allowing for better allocation of resources to combating prostitution among these groups. Further, by looking at arrests by demographic, one can better understand potential patterns and disparities in law enforcement practices. Researching prostitution arrests by time and location will also aid in resource allocation, and can help law enforcement agencies devise strategies to enhance public safety and improve the well-being of residents and businesses in those areas. By identifying hotspots, resources can be allocated more effectively to combat prostitution here. 

  • This research topic aims to look at prostitution arrest rates in the city of Los Angeles from 2013 onward by demographic, specifically by age and race. We hypothesize that certain demographics are more likely to be arrested for prostitution. We also aim to look at how prostitution arrests vary by location and time. We hypothesize that prostitution arrests are more likely at certain cross streets after dark/late into the night.

  • The variables we are researching include both categorical and quantitative variables. Categorical variables include the race of those arrested for prostitution and the cross street location, while quantitative variables include age and time of arrest. 

Glimpse of data

# Inputting prostitution in LA ataset
prostitution <- read_csv("data/Prostitution_along_N_Western_Ave_20240318.csv")
Rows: 1749 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Arrest Date, Time, Area ID, Area Name, Reporting District, Sex Cod...
dbl  (3): Report ID, Age, Charge Group Code

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(prostitution)
Rows: 1,749
Columns: 16
$ `Report ID`                <dbl> 120621756, 120627186, 120627728, 122020103,…
$ `Arrest Date`              <chr> "08/24/2012", "11/08/2012", "11/14/2012", "…
$ Time                       <chr> "0045", "0450", "0500", "2250", "0635", "06…
$ `Area ID`                  <chr> "06", "06", "06", "20", "06", "06", "06", "…
$ `Area Name`                <chr> "Hollywood", "Hollywood", "Hollywood", "Oly…
$ `Reporting District`       <chr> "0678", "0678", "0657", "2063", "0657", "06…
$ Age                        <dbl> 20, 26, 32, 18, 29, 43, 40, 24, 27, 62, 52,…
$ `Sex Code`                 <chr> "F", "M", "M", "F", "M", "M", "F", "F", "F"…
$ `Descent Code`             <chr> "H", "B", "H", "O", "H", "H", "B", "B", "B"…
$ `Charge Group Code`        <dbl> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,…
$ `Charge Group Description` <chr> "Prostitution/Allied", "Prostitution/Allied…
$ `Arrest Type Code`         <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M"…
$ Charge                     <chr> "647(B)PC", "653.22(A)PC", "653.22(A)PC", "…
$ `Charge Description`       <chr> "PROSTITUTION", "LOITER:INTENT:PROSTITUTION…
$ Address                    <chr> "MARATHON", "SIERRA VISTA", "SIERRA VISTA",…
$ `Cross Street`             <chr> "WESTERN", "WESTERN                      AV…

Data 3: Country-Based Emissions Data

Introduction and data

  • Source: CORGIS Dataset project, authored by Dr. Dennis Kafura (Virginia Polytechnic Institute and State University)

  • About: This dataset was originally sourced from the Emissions Database for Global Atmospheric Research (EDGAR) and compiled into the current dataset by the following authors: Austin Cory Bart, Dennis Kafura, Clifford A. Shaffer, Javier Tibau, Luke Gusukuma, Eli Tilevich noted in the CORGIS database. The last update as 6/24/2019.

  • Observations: Each observation represents a country’s emissions data collected from the 70s to at most 2015 depending on the country. This is broken down into the following columns: Country, Year, Emissions type, Emissions Sector, Emissions Buildings, Emissions Power Industry, Emissions Transport, Emissions Other, Ratio per GDP, Ratio per Capita, etc. There are roughly 12 unique columns and over 8000 observations.

  • Ethical Concerns: An ethical concern to consider is which countries may be excluded from the data since the dataset includes European Member States or parties under the United Nations Framework Convention on Climate Change (UNFCCC). Other factors that can be emissions sources are not specified, so this might leave out information that is more specific to countries that have larger industries that are not represented in the columns. 

Research question

  • Research Question: Which emissions sectors produce the most emissions per capita based on geographical region (NE, NW, SE, SW)? Which countries produce the most emissions from the emissions sector related to infrastructure development, and if so, how does this related to GDP?

  • Rationale: This question is important to consider as many countries are increasing development focused projects to revitalize their infrastructure. This dataset can assist in visualizing how specific category may contribute to increase greenhouse gas emissions and GDP. With this information, it can assist whomever the targeted demographic (policy makers, architects, engineers, etc.) to make more emissions conscious decisions when constructing future infrastructure revisions. Also depending on whether the research question changes, the data can be used to determine how anthropogenic behaviors influenced emissions during specific years. 

  • Hypothesis: There may be a relationship between countries with higher GDPs and emissions release (kilotons of CO2) in specific industries like infrastructure and transportation. 

  • Variable Types: The dataset has at least one categorical (country) and the rest are quantitative (year, quantities for different tracked chemicals, industries, and ratios per capita)

Glimpse of data

# Inputting emissions by country dataset
emissions <- read_csv("data/emissions.csv")
Rows: 8385 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Country
dbl (11): Year, Emissions.Type.CO2, Emissions.Type.N2O, Emissions.Type.CH4, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(emissions)
Rows: 8,385
Columns: 12
$ Country                           <chr> "Afghanistan", "Afghanistan", "Afgha…
$ Year                              <dbl> 1970, 1971, 1972, 1973, 1974, 1975, …
$ Emissions.Type.CO2                <dbl> 2670, 2630, 2180, 2310, 2520, 2720, …
$ Emissions.Type.N2O                <dbl> 1820, 1850, 1810, 1830, 2190, 1930, …
$ Emissions.Type.CH4                <dbl> 12800, 12900, 11900, 11600, 12800, 1…
$ `Emissions.Sector.Power Industry` <dbl> 0.06, 0.06, 0.12, 0.17, 0.21, 0.21, …
$ Emissions.Sector.Buildings        <dbl> 0.58, 0.58, 0.46, 0.57, 0.77, 0.59, …
$ Emissions.Sector.Transport        <dbl> 0.23, 0.23, 0.27, 0.24, 0.24, 0.29, …
$ `Emissions.Sector.Other Industry` <dbl> 0.07, 0.07, 0.05, 0.02, 0.03, 0.02, …
$ `Emissions.Sector.Other sectors`  <dbl> 0.53, 0.53, 0.61, 0.47, 0.65, 0.58, …
$ `Ratio.Per GDP`                   <dbl> 1.557705, 1.517670, 1.357590, 1.3079…
$ `Ratio.Per Capita`                <dbl> 0.000000, 0.000000, 0.000000, 0.0000…