Render lab-4.qmd and commit and push your changes.
From last time – Application exercise
Goal
Scrape data and organize it in a tidy format in R
Perform light text parsing to clean data
Summarize and visualize the data
ae-09
Go to the project navigator in RStudio (top right corner of your RStudio window) and open the project called ae.
If there are any uncommitted files, commit them, and then click Pull.
Open the file called chronicle-scrape.R and follow along.
Recap
Use the SelectorGadget identify tags for elements you want to grab
Use rvest to first read the whole page (into R) and then parse the object you’ve read in to the elements you’re interested in
Put the components together in a data frame (a tibble) and analyze it like you analyze any other data
A new R workflow
When working in a Quarto document, your analysis is re-run each time you knit
If web scraping in a Quarto document, you’d be re-scraping the data each time you knit, which is undesirable (and not nice)!
An alternative workflow:
Use an R script to save your code
Saving interim data scraped using the code in the script as CSV or RDS files
Use the saved data in your analysis in your Quarto document
Web scraping considerations
Ethics: “Can you?” vs “Should you?”
“Can you?” vs “Should you?”
Challenges: Unreliable formatting
Challenges: Data broken into many pages
Workflow: Screen scraping vs. APIs
Two different scenarios for web scraping:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy)
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files
Workflow: Scraping from many pages
So far you’ve learned to scrape data from a single page
If you wanted to scrape data from multiple, similarly structured web pages (e.g., scrape the text and other metadata for each opinion article on The Chronicle):
Write the code to scrape one page
Turn it into a function that takes the webpage URL as an argument and returns the scraped, structured data
Map the function over the list of URLs of interest
The 🐘 in the room: Chat GPT
Using Chat GPT
How are you using Chat GPT for this class?
Example 1 - From Lab 4
Another variable that needs some cleaning up is membership_status. Currently it should either be NA or contain some text that says "Select ... to register for this group". Recode this variable to say "Closed" if the current value is NA or "Open" otherwise. Save the resulting dataset with these two new variables as clubs, i.e., overwrite the data frame.
Then, display the first 10 rows of the dataset, relocate()ing membership_status to the beginning of the dataset to make sure it appears in the output in your rendered document.
Sample data
df <-tribble(~club, ~membership_status,1, "Select AI Competition Club's group. Select the group and click on the Join button at the bottom of the page to register for this group",2, NA,3, "Select Black Pre-Law Society at Duke's group. Select the group and click on the Join button at the bottom of the page to register for this group",4, NA,5, NA,6, "Select Duke Women's Flag Football Team's group. Select the group and click on the Join button at the bottom of the page to register for this group",7, NA,8, NA,9, NA,10, "Select Latino Medical Student Association's group. Select the group and click on the Join button at the bottom of the page to register for this group",11, NA,12, "Select Native American/Indigenous Student Alliance's group. Select the group and click on the Join button at the bottom of the page to register for this group",13, NA,14, NA,15, NA,16, NA,17, NA,18, NA,19, "Select Society of Women Engineers's group. Select the group and click on the Join button at the bottom of the page to register for this group",20, NA,21, NA)df
Sample data
# A tibble: 21 × 2
club membership_status
<dbl> <chr>
1 1 Select AI Competition Club's group. Select the group and click on the Join butto…
2 2 <NA>
3 3 Select Black Pre-Law Society at Duke's group. Select the group and click on the …
4 4 <NA>
5 5 <NA>
6 6 Select Duke Women's Flag Football Team's group. Select the group and click on th…
7 7 <NA>
8 8 <NA>
9 9 <NA>
10 10 Select Latino Medical Student Association's group. Select the group and click on…
# ℹ 11 more rows
Chat GPT - Take 1 prompt
What is wrong with this promopt?
Another variable that needs some cleaning up is membership_status. Currently it should either be NA or contain some text that says "Select ... to register for this group". Recode this variable to say "Closed" if the current value is NA or "Open" otherwise. Save the resulting dataset with these two new variables as clubs, i.e., overwrite the data frame.
Then, display the first 10 rows of the dataset, relocate()ing membership_status to the beginning of the dataset to make sure it appears in the output in your rendered document.
Chat GPT - Take 1 response
Explain what the following code does.
Certainly! To achieve this in R, you can follow these steps:
# Assuming your dataset is named 'clubs'# Recode membership_status variableclubs$membership_status <-ifelse(is.na(clubs$membership_status), "Closed", "Open")# Relocate membership_status to the beginning of the datasetclubs <- clubs[c("membership_status", setdiff(names(clubs), "membership_status"))]# Display the first 10 rows of the datasethead(clubs, 10)
Make sure to replace ‘clubs’ with the actual name of your dataset. This code will recode the membership_status variable as per your instructions, then move it to the beginning of the dataset and display the first 10 rows.
Chat GPT - Take 1 code
# Assuming your dataset is named 'clubs'# Recode membership_status variableclubs$membership_status <-ifelse(is.na(clubs$membership_status), "Closed", "Open")
Error in eval(expr, envir, enclos): object 'clubs' not found
# Relocate membership_status to the beginning of the datasetclubs <- clubs[c("membership_status", setdiff(names(clubs), "membership_status"))]
Error in eval(expr, envir, enclos): object 'clubs' not found
# Display the first 10 rows of the datasethead(clubs, 10)
Error in eval(expr, envir, enclos): object 'clubs' not found
Chat GPT - Take 1 code + edit
clubs <- df# Assuming your dataset is named 'clubs'# Recode membership_status variableclubs$membership_status <-ifelse(is.na(clubs$membership_status), "Closed", "Open")# Relocate membership_status to the beginning of the datasetclubs <- clubs[c("membership_status", setdiff(names(clubs), "membership_status"))]# Display the first 10 rows of the datasethead(clubs, 10)
# A tibble: 10 × 2
membership_status club
<chr> <dbl>
1 Open 1
2 Closed 2
3 Open 3
4 Closed 4
5 Closed 5
6 Open 6
7 Closed 7
8 Closed 8
9 Closed 9
10 Open 10
Chat GPT - Take 2 prompt
Can you suggest an alternate solution in R?
Chat GPT - Take 2 response
Explain what the following code does.
Certainly! An alternate solution can use the mutate and relocate functions from the dplyr package. Here’s an alternative code:
library(dplyr)# Assuming your dataset is named 'clubs'clubs <- clubs %>%mutate(membership_status =ifelse(is.na(membership_status), "Closed", "Open")) %>%relocate(membership_status, .before =1)# Display the first 10 rows of the datasethead(clubs, 10)
This code achieves the same result but uses the mutate function from the dplyr package for variable recoding and the relocate function to move the membership_status variable to the beginning of the dataset. Make sure to replace ‘clubs’ with the actual name of your dataset.
Chat GPT - Take 2 code
library(dplyr)# Assuming your dataset is named 'clubs'clubs <- clubs %>%mutate(membership_status =ifelse(is.na(membership_status), "Closed", "Open")) %>%relocate(membership_status, .before =1)# Display the first 10 rows of the datasethead(clubs, 10)
# A tibble: 10 × 2
membership_status club
<chr> <dbl>
1 Open 1
2 Closed 2
3 Open 3
4 Closed 4
5 Closed 5
6 Open 6
7 Closed 7
8 Closed 8
9 Closed 9
10 Open 10
Chat GPT - Take 3 prompt
Can you suggest a tidyverse-based solution?
Chat GPT - Take 3 response
Certainly! If you want to stick to the tidyverse, you can use the mutate and slice_head functions from the dplyr package along with the forcats package for factor manipulation. Here’s a tidyverse-based solution:
library(dplyr)library(forcats)# Assuming your dataset is named 'clubs'clubs <- clubs %>%mutate(membership_status =fct_recode(membership_status, Closed ="NA", Open ="Select ... to register for this group")) %>%slice_head(n =10) %>%relocate(membership_status, .before =1)# Display the first 10 rows of the datasetclubs
This code uses fct_recode from the forcats package to recode the levels of the membership_status variable. The slice_head function is then used to select the first 10 rows, and relocate is used to move the membership_status variable to the beginning of the dataset. Make sure to replace ‘clubs’ with the actual name of your dataset.
Chat GPT - Take 3 code
library(dplyr)library(forcats)# Assuming your dataset is named 'clubs'clubs <- clubs %>%mutate(membership_status =fct_recode(membership_status, Closed ="NA", Open ="Select ... to register for this group")) %>%slice_head(n =10) %>%relocate(membership_status, .before =1)# Display the first 10 rows of the datasetclubs
# A tibble: 10 × 2
membership_status club
<fct> <dbl>
1 Select AI Competition Club's group. Select the group and click on the Join butto… 1
2 <NA> 2
3 Select Black Pre-Law Society at Duke's group. Select the group and click on the … 3
4 <NA> 4
5 <NA> 5
6 Select Duke Women's Flag Football Team's group. Select the group and click on th… 6
7 <NA> 7
8 <NA> 8
9 <NA> 9
10 Select Latino Medical Student Association's group. Select the group and click on… 10
Chat GPT - Take 4 prompt
The text “Select … to register for this group” can vary for each group. For example, it can be ““Select blah blah to register for this group” or it can be “Select la di da to register for this group”. Can you suggest a tidyverse-based solution that would capture all of possible options for the … in this text?
Chat GPT - Take 4 response
Explain what the following code does.
Certainly! If the text “Select … to register for this group” varies for each group, you can use the str_detect function from the stringr package to identify those cases. Here’s a tidyverse-based solution:
library(dplyr)library(stringr)# Assuming your dataset is named 'clubs'clubs <- clubs %>%mutate(membership_status =ifelse(str_detect(membership_status, "Select .* to register for this group"), "Open", "Closed")) %>%relocate(membership_status, .before =1)# Display the first 10 rows of the datasethead(clubs, 10)
This code uses str_detect to check if the membership_status contains the pattern “Select … to register for this group”. If it does, it sets the status to “Open”; otherwise, it sets it to “Closed”. Make sure to replace ‘clubs’ with the actual name of your dataset.
Chat GPT - Take 4 code
library(dplyr)library(stringr)# Assuming your dataset is named 'clubs'clubs <- clubs %>%mutate(membership_status =ifelse(str_detect(membership_status, "Select .* to register for this group"), "Open", "Closed")) %>%relocate(membership_status, .before =1)# Display the first 10 rows of the datasethead(clubs, 10)
# A tibble: 10 × 2
membership_status club
<chr> <dbl>
1 Open 1
2 <NA> 2
3 Open 3
4 <NA> 4
5 <NA> 5
6 Open 6
7 <NA> 7
8 <NA> 8
9 <NA> 9
10 Open 10
Chat GPT - Take 5 prompt
What is different about this prompt?
Let’s try again. Can you write a tidyverse-based solution where membership status is recoded as Closed if the current value is NA and Open if the current value is anything else. Then, using the relocate() function, the membership status is moved to be the first column in the output.
Chat GPT - Take 5 response
Explain what the following code does.
Certainly! My apologies for the confusion. Here’s a tidyverse-based solution that recodes the membership status and moves it to be the first column using the dplyr package:
library(dplyr)# Assuming your dataset is named 'clubs'clubs <- clubs %>%mutate(membership_status =if_else(is.na(membership_status), "Closed", "Open")) %>%relocate(membership_status, .before =1)# Display the first 10 rows of the datasethead(clubs, 10)
This code uses the mutate function to recode membership_status as “Closed” if it’s NA and “Open” otherwise. Then, the relocate function is used to move membership_status to be the first column in the dataset. Make sure to replace ‘clubs’ with the actual name of your dataset.
Chat GPT - Take 5 code
library(dplyr)# Assuming your dataset is named 'clubs'clubs <- clubs %>%mutate(membership_status =if_else(is.na(membership_status), "Closed", "Open")) %>%relocate(membership_status, .before =1)# Display the first 10 rows of the datasethead(clubs, 10)
# A tibble: 10 × 2
membership_status club
<chr> <dbl>
1 Open 1
2 Closed 2
3 Open 3
4 Closed 4
5 Closed 5
6 Open 6
7 Closed 7
8 Closed 8
9 Closed 9
10 Open 10
# A tibble: 21 × 2
membership_status club
<chr> <dbl>
1 Open 1
2 Closed 2
3 Open 3
4 Closed 4
5 Closed 5
6 Open 6
7 Closed 7
8 Closed 8
9 Closed 9
10 Open 10
# ℹ 11 more rows
Example 2
Each CSI franchise includes a number of series, each with the character string "CSI" (in capital letters) in their title. Using this information, create a new data frame called csi that only has titles from the CSI-verse and all of the columns from tv for these titles. Then, display the first 10 rows of csi. Note: The titles and seasons included in the dataset may not be all CSI titles and titles that have ever aired.
Chat GPT - Take 1 prompt
Each CSI franchise includes a number of series, each with the character string "CSI" (in capital letters) in their title. Using this information, create a new data frame called csi that only has titles from the CSI-verse and all of the columns from tv for these titles. Then, display the first 10 rows of csi. Note: The titles and seasons included in the dataset may not be all CSI titles and titles that have ever aired.
Chat GPT - Take 1 response
Certainly! To filter the tv data frame to include only titles from the CSI-verse, you can use the filter function from the dplyr package. Here’s the code:
library(dplyr)# Assuming your dataset is named 'tv'csi <- tv %>%filter(str_detect(title, "CSI", case_insensitive =TRUE))# Display the first 10 rows of the 'csi' data framehead(csi, 10)
This code uses the filter function to select rows where the title column contains the string “CSI” (case-insensitive). The resulting data frame, csi, will include only titles from the CSI-verse along with all the columns from the original tv data frame. Make sure to replace ‘tv’ with the actual name of your dataset.
Chat GPT - Take 1 code
What is wrong with the following code?
tv <-read_csv("data/tv.csv")
library(dplyr)# Assuming your dataset is named 'tv'csi <- tv %>%filter(str_detect(title, "CSI", case_insensitive =TRUE))
Error in `filter()`:
ℹ In argument: `str_detect(title, "CSI", case_insensitive = TRUE)`.
Caused by error in `str_detect()`:
! unused argument (case_insensitive = TRUE)
# Display the first 10 rows of the 'csi' data framehead(csi, 10)
Error in eval(expr, envir, enclos): object 'csi' not found
Chat GPT - Take 2 prompt
str_detect doesn’t have a case_insensitive argument, can you try again?
Chat GPT - Take 2 response
You are correct, and I appreciate your understanding. The str_detect function itself doesn’t have a case_insensitive argument. Instead, you can use the str_which function along with the case_insensitive argument in the filter function. Here’s the corrected code:
library(dplyr)# Assuming your dataset is named 'tv'csi <- tv %>%filter(str_detect(title, "CSI", ignore_case =TRUE))# Display the first 10 rows of the 'csi' data framehead(csi, 10)
This code uses str_detect within the filter function and includes the ignore_case = TRUE argument to perform a case-insensitive search. The resulting csi data frame will contain only titles from the CSI-verse along with all the columns from the original tv data frame. Make sure to replace ‘tv’ with the actual name of your dataset.
Chat GPT - Take 2 code
library(dplyr)# Assuming your dataset is named 'tv'csi <- tv %>%filter(str_detect(title, "CSI", ignore_case =TRUE))
Error in `filter()`:
ℹ In argument: `str_detect(title, "CSI", ignore_case = TRUE)`.
Caused by error in `str_detect()`:
! unused argument (ignore_case = TRUE)
# Display the first 10 rows of the 'csi' data framehead(csi, 10)
Error in eval(expr, envir, enclos): object 'csi' not found
Chat GPT - Take 3 prompt
str_detect doesn’t have an ignore_case argument either, can you try again?
Chat GPT - Take 3 response
I apologize for the oversight. It seems there was a mistake in my responses. To perform a case-insensitive search, you can use the stringi package along with stri_detect_regex. Here’s the corrected code:
library(dplyr)library(stringi)# Assuming your dataset is named 'tv'csi <- tv %>%filter(stri_detect_regex(title, "CSI", case_insensitive =TRUE))# Display the first 10 rows of the 'csi' data framehead(csi, 10)
This code uses stri_detect_regex from the stringi package within the filter function, and it includes the case_insensitive = TRUE argument for a case-insensitive search. The resulting csi data frame will contain only titles from the CSI-verse along with all the columns from the original tv data frame. Make sure to replace ‘tv’ with the actual name of your dataset.
Let’s go back to Take 1
and edit it a bit:
tv |>filter(str_detect(title, "CSI"))
# A tibble: 36 × 9
season title year month day av_rating genre_1 genre_2 genre_3
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 1 CSI: Crime Scene Investigat… 2001 1 20 8.32 Crime Drama Mystery
2 2 CSI: Crime Scene Investigat… 2002 1 10 8.26 Crime Drama Mystery
3 3 CSI: Crime Scene Investigat… 2003 1 15 8.30 Crime Drama Mystery
4 4 CSI: Crime Scene Investigat… 2004 1 18 8.33 Crime Drama Mystery
5 5 CSI: Crime Scene Investigat… 2005 1 24 8.38 Crime Drama Mystery
6 6 CSI: Crime Scene Investigat… 2006 1 16 8.21 Crime Drama Mystery
7 7 CSI: Crime Scene Investigat… 2007 1 14 8.43 Crime Drama Mystery
8 8 CSI: Crime Scene Investigat… 2008 1 7 8.07 Crime Drama Mystery
9 9 CSI: Crime Scene Investigat… 2009 1 27 7.80 Crime Drama Mystery
10 10 CSI: Crime Scene Investigat… 2010 1 23 7.69 Crime Drama Mystery
# ℹ 26 more rows
Let’s go back to Take 1
csi <- tv |>filter(str_detect(title, "CSI"))csi
# A tibble: 36 × 9
season title year month day av_rating genre_1 genre_2 genre_3
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
1 1 CSI: Crime Scene Investigat… 2001 1 20 8.32 Crime Drama Mystery
2 2 CSI: Crime Scene Investigat… 2002 1 10 8.26 Crime Drama Mystery
3 3 CSI: Crime Scene Investigat… 2003 1 15 8.30 Crime Drama Mystery
4 4 CSI: Crime Scene Investigat… 2004 1 18 8.33 Crime Drama Mystery
5 5 CSI: Crime Scene Investigat… 2005 1 24 8.38 Crime Drama Mystery
6 6 CSI: Crime Scene Investigat… 2006 1 16 8.21 Crime Drama Mystery
7 7 CSI: Crime Scene Investigat… 2007 1 14 8.43 Crime Drama Mystery
8 8 CSI: Crime Scene Investigat… 2008 1 7 8.07 Crime Drama Mystery
9 9 CSI: Crime Scene Investigat… 2009 1 27 7.80 Crime Drama Mystery
10 10 CSI: Crime Scene Investigat… 2010 1 23 7.69 Crime Drama Mystery
# ℹ 26 more rows
Guidelines and best practices for using Chat GPT
Do not just copy-paste the prompt – for appropriate academic conduct, for your own learning, and for getting to better results faster
Engineer the prompt until the response starts to look like code you’re learning in the course
If the response is not correct, ask for a correction
If the response doesn’t follow the guidelines, ask for a correction
Do not just copy-paste code from Chat GPT responses, run it line-by-line and edit as needed
Watch out for clear mistakes in the response: do not keep loading packages that are already loaded, use the base pipe |>, use tidyverse-based code, etc.