Potential Data Sets for Analysis

Proposal

Author

Spelling Bees

library(tidyverse)

Birds

Introduction and data

  • Source: The source of this data is the birding website, eBird, that was created by the Cornell Lab of Ornithology.

  • Method of Collection: The data was downloaded and modified from the eBird website. The data itself was collected over the course of five years, from 2010 to 2015, and was collected by users of the eBird app that submitted data on bird sightings localized within Washington DC.

  • Description of Observations: There are a total of 12,024 observations of birds on specific dates from 2010 to 2015. There are a total of 11 columns, including 3 quantitative variables and 8 categorical variables. The most notable variables within the observations include the longitude and latitude of the particular bird sightings, as well as the bird species and general location (park, reserve, parking lot, etc.).

  • Ethical Concerns: There are no major ethical concerns. Although the terms and services of the eBird webiste prohibit reproduction of original data, it allows distribution of this data if it is significantly modified and eBird itself is credited, which the curator did.

Research question

  • Question: Is there a particular correlation between the species of a bird and the location in which it was spotted (longitude, latitude, and location type) ? If so, are certain species more concentrated in one area than others?
  • Importance: In an age where climate change is impacting ecosystems and the environment more than ever, understanding how exactly these changes impact organisms is important. By analyzing the distributions of birds among different locations, we can observe whether certain species are clustered in urban or suburban centers, as well as any abnormal behaviors concerning the location of birds that are normally found in other areas. 
  • Description of Research Topic + Hypothesis: This research topic focuses on the distribution of bird sightings at different dates and locations within Washington DC. I hypothesize that there will be a certain correlation between the spotting of one species of bird and the location, as well as having more birds spotted in parks.
  • Categorical variables: Species, Location Type
  • Quantitative variables: Longitude, Latitude, Count

Glimpse of data

ebird <- read_csv("data/eBird Data Washington DC 2010-2015.csv")
Rows: 12027 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (6): species, state, sn1, county, sn2, locality
dbl  (3): observedCount, latitude, longitude
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(ebird)
Rows: 12,027
Columns: 10
$ species       <chr> "American Robin", "Bonaparte's Gull", "Cackling Goose", …
$ observedCount <dbl> 970, 34, 1, 1, 1, 1, 1, 1, 6, 15, 20, 15, 20, 6, 11, 25,…
$ state         <chr> "District of Columbia", "District of Columbia", "Distric…
$ sn1           <chr> "US-DC", "US-DC", "US-DC", "US-DC", "US-DC", "US-DC", "U…
$ county        <chr> "District of Columbia", "District of Columbia", "Distric…
$ sn2           <chr> "US-DC-001", "US-DC-001", "US-DC-001", "US-DC-001", "US-…
$ locality      <chr> "Guy Mason - dog run", "East Potomac Park--Hains Pt.", "…
$ latitude      <dbl> 38.92213, 38.86088, 38.86088, 38.87179, 38.87179, 38.871…
$ longitude     <dbl> -77.07147, -77.02324, -77.02324, -76.98485, -76.98485, -…
$ date          <date> 2010-01-26, 2010-01-30, 2010-01-09, 2010-01-15, 2010-01…

Astronauts

Introduction and data

  • Source: The source of this data is the tidytuesday Git Repo with multiple contributors.

  • Method of Collection: Initial collection of the data began in 2020 by Tatsuya Corlett, Mariya Stavnichuk, and Svetlana Komarova, who are all associated with NASA and Roscosmos. Georgios Karamanis helped prepare the data set for addition to github.

  • Description of Observations: There are 24 columns (variables) and 1200+ observations on astronauts. There is a mix of categorical and quantitative variables, although the majority is categorical.

  • Ethical Concerns: There aren’t necessarily any ethical concerns as this information was likely required to be provided by the astronauts and intended to be publicly available.

Research question

  • Question:  Is there a relationship between ages of astronauts and the number of trips they’ve been on? Do certain occupations tend to be more frequent fliers?
  • Importance: This question aims to show what kinds of occupations tend to gain more experience in space. This could be useful for aspiring students who are considering an aerospace career.
  • Description of Research Topic + Hypothesis: This research topic focuses on specific traits of astronauts that might relate to how many missions they’ve been on. We hypothesize that older astronauts with past piloting experience tend to have been on more missions than other astronauts.
  • Categorical variables: name, military_civilian, occupation
  • Quantitative variables:  total_number_of_missions

Glimpse of data

astronauts <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-14/astronauts.csv')
Rows: 1277 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): name, original_name, sex, nationality, military_civilian, selectio...
dbl (13): id, number, nationwide_number, year_of_birth, year_of_selection, m...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
write_csv(astronauts, "data/astronauts.csv")

glimpse(astronauts)
Rows: 1,277
Columns: 24
$ id                       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
$ number                   <dbl> 1, 2, 3, 3, 4, 5, 5, 6, 6, 7, 7, 7, 8, 8, 9, …
$ nationwide_number        <dbl> 1, 2, 1, 1, 2, 2, 2, 4, 4, 3, 3, 3, 4, 4, 5, …
$ name                     <chr> "Gagarin, Yuri", "Titov, Gherman", "Glenn, Jo…
$ original_name            <chr> "ГАГАРИН Юрий Алексеевич", "ТИТОВ Герман Степ…
$ sex                      <chr> "male", "male", "male", "male", "male", "male…
$ year_of_birth            <dbl> 1934, 1935, 1921, 1921, 1925, 1929, 1929, 193…
$ nationality              <chr> "U.S.S.R/Russia", "U.S.S.R/Russia", "U.S.", "…
$ military_civilian        <chr> "military", "military", "military", "military…
$ selection                <chr> "TsPK-1", "TsPK-1", "NASA Astronaut Group 1",…
$ year_of_selection        <dbl> 1960, 1960, 1959, 1959, 1959, 1960, 1960, 196…
$ mission_number           <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 2, 3, 1, 2, 1, …
$ total_number_of_missions <dbl> 1, 1, 2, 2, 1, 2, 2, 2, 2, 3, 3, 3, 2, 2, 3, …
$ occupation               <chr> "pilot", "pilot", "pilot", "PSP", "Pilot", "p…
$ year_of_mission          <dbl> 1961, 1961, 1962, 1998, 1962, 1962, 1970, 196…
$ mission_title            <chr> "Vostok 1", "Vostok 2", "MA-6", "STS-95", "Me…
$ ascend_shuttle           <chr> "Vostok 1", "Vostok 2", "MA-6", "STS-95", "Me…
$ in_orbit                 <chr> "Vostok 2", "Vostok 2", "MA-6", "STS-95", "Me…
$ descend_shuttle          <chr> "Vostok 3", "Vostok 2", "MA-6", "STS-95", "Me…
$ hours_mission            <dbl> 1.77, 25.00, 5.00, 213.00, 5.00, 94.00, 424.0…
$ total_hrs_sum            <dbl> 1.77, 25.30, 218.00, 218.00, 5.00, 519.33, 51…
$ field21                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ eva_hrs_mission          <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…
$ total_eva_hrs            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.0…

Crimes in LA

Introduction and data

  • Source: The source of this data is the Google Dataset Search, provided by the LAPD (Los Angeles Police Department).

  • Method of Collection: The data was collected by the Los Angeles Records Management System. The system complies with all of the guidelines outlined by the FBI (NIBRS-only data)

  • Description of Observations: There are a total of 1,000 rows and 28 columns/variables. There is a mixture of categorical and numerical variables, including crime type, area code, victim age, victim sex, and time. Crimes appear to range from theft (identity/money/property) to vandalism to various assaults and more. Most victims are young adults. Majority of weapons are strong arms rather than actual weaponry. The latitudes and longitudes are very close to one another since the data comprises of a singular city.

  • Ethical Concerns: This dataset contains information regarding the victims of the crimes which could raise concerns regarding privacy. In addition, certain area codes may receive higher surveillance and reporting which could lead to bias in our analysis of differing crime prevalence.

Research question

  • Question:  Is there a relation between the demographics of the victims of crimes within Los Angeles county and the area codes in which the crimes occurred? Are certain crimes more prevalent in certain area codes than others?
  • Importance: As crime is always a major issue for any society, gathering more data on crime can help law enforcement better predict patterns and locations of crime. By gaining a deeper understanding of crime data for LA, which has high levels of crime, law enforcement can better formulate responses to deal with crime in the future.  
  • Description of Research Topic + Hypothesis: This research topic aims to investigate whether a crime or certain demographics are more likely to be a target in some places than others, which may allow us to focus more relevant resources toward combating their prevalence across Los Angeles. I hypothesize that certain areas will be prone to higher types of crime as well as demographics within LA, depending on their relevance.
  • Categorical variables: Area Code, Crime, Location
  • Quantitative variables: Victim Age, Latitude, Longitude

Glimpse of data

LAcrime <- read_csv("data/Crime_Data_2020_to_Present.csv")
Rows: 1000 Columns: 28
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Date Rptd, DATE OCC, TIME OCC, AREA, AREA NAME, Rpt Dist No, Crm C...
dbl (11): DR_NO, Part 1-2, Crm Cd, Vict Age, Premis Cd, Weapon Used Cd, Crm ...
lgl  (1): Crm Cd 4

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(LAcrime)
Rows: 1,000
Columns: 28
$ DR_NO            <dbl> 190326475, 200106753, 200320258, 200907217, 220614831…
$ `Date Rptd`      <chr> "03/01/2020 12:00:00 AM", "02/09/2020 12:00:00 AM", "…
$ `DATE OCC`       <chr> "03/01/2020 12:00:00 AM", "02/08/2020 12:00:00 AM", "…
$ `TIME OCC`       <chr> "2130", "1800", "1700", "2037", "1200", "2300", "0900…
$ AREA             <chr> "07", "01", "03", "09", "06", "18", "01", "03", "13",…
$ `AREA NAME`      <chr> "Wilshire", "Central", "Southwest", "Van Nuys", "Holl…
$ `Rpt Dist No`    <chr> "0784", "0182", "0356", "0964", "0666", "1826", "0182…
$ `Part 1-2`       <dbl> 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2,…
$ `Crm Cd`         <dbl> 510, 330, 480, 343, 354, 354, 354, 354, 354, 624, 354…
$ `Crm Cd Desc`    <chr> "VEHICLE - STOLEN", "BURGLARY FROM VEHICLE", "BIKE - …
$ Mocodes          <chr> NA, "1822 1402 0344", "0344 1251", "0325 1501", "1822…
$ `Vict Age`       <dbl> 0, 47, 19, 19, 28, 41, 25, 27, 24, 26, 26, 8, 7, 0, 1…
$ `Vict Sex`       <chr> "M", "M", "X", "M", "M", "M", "M", "F", "F", "M", "M"…
$ `Vict Descent`   <chr> "O", "O", "X", "O", "H", "H", "H", "B", "B", "H", "B"…
$ `Premis Cd`      <dbl> 101, 128, 502, 405, 102, 501, 502, 248, 750, 502, 501…
$ `Premis Desc`    <chr> "STREET", "BUS STOP/LAYOVER (ALSO QUERY 124)", "MULTI…
$ `Weapon Used Cd` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 400, NA, 400, 400…
$ `Weapon Desc`    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "STRONG-ARM (HAND…
$ Status           <chr> "AA", "IC", "IC", "IC", "IC", "IC", "IC", "IC", "IC",…
$ `Status Desc`    <chr> "Adult Arrest", "Invest Cont", "Invest Cont", "Invest…
$ `Crm Cd 1`       <dbl> 510, 330, 480, 343, 354, 354, 354, 354, 354, 624, 354…
$ `Crm Cd 2`       <dbl> 998, 998, NA, NA, NA, NA, NA, NA, NA, NA, NA, 821, 86…
$ `Crm Cd 3`       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ `Crm Cd 4`       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ LOCATION         <chr> "1900 S  LONGWOOD                     AV", "1000 S  F…
$ `Cross Street`   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "…
$ LAT              <dbl> 34.0375, 34.0444, 34.0210, 34.1576, 34.0944, 33.9467,…
$ LON              <dbl> -118.3506, -118.2628, -118.3002, -118.4387, -118.3277…