Go to your ae repo, commit any remaining changes, push, and then pull for today’s application exercise.
Announcements
Lab 4 on Monday
Challenge: Resist the urge to ask a GPT before spending some time thinking!
Reading The Chronicle
How often do you read The Chronicle?
Every day
3-5 times a week
Once a week
Rarely
Reading The Chronicle
What do you think is the most common word in the titles of The Chronicle opinion pieces?
Analyzing The Chronicle
Reading The Chronicle
How do you think the sentiments in opinion pieces in The Chronicle compare across authors? Roughly the same? Wildly different? Somewhere in between?
Analyzing The Chronicle
All of this analysis is done in R!
(mostly) with tools you already know!
Common words in The Chronicle titles
Code for the earlier plot:
stop_words <-read_csv("data/stop-words.csv")chronicle |> tidytext::unnest_tokens(word, title) |>anti_join(stop_words) |>count(word, sort =TRUE) |>slice_head(n =20) |>mutate(word =fct_reorder(word, n)) |>ggplot(aes(y = word, x = n, fill =log(n))) +geom_col(show.legend =FALSE) +theme_minimal(base_size =16) +labs(x ="Number of mentions",y ="Word",title ="The Chronicle - Opinion pieces",subtitle ="Common words in the 500 most recent opinion pieces",caption ="Source: Data scraped from The Chronicle on Feb 21, 2024" ) +theme(plot.title.position ="plot",plot.caption =element_text(color ="gray30") )
# A tibble: 500 × 6
title author date abstract column url
<chr> <chr> <date> <chr> <chr> <chr>
1 All the world’s a stage Anna … 2024-02-22 If we a… STUDE… http…
2 Words that matter: For Alexei Navalny Carol… 2024-02-22 In some… STUDE… http…
3 Which would you save: Friend or romantic partn… Jess … 2024-02-22 Love sh… STUDE… http…
4 Happiness is not what you’re looking for Paul … 2024-02-21 We hing… STUDE… http…
5 Closing Duke's Herbarium: A fear of long-term … Matth… 2024-02-21 Without… LETTE… http…
6 CS Majors launch 'ambiguous and labelless rela… Monda… 2024-02-20 Unlike … STUDE… http…
7 The fear of being single Heidi… 2024-02-20 But it … STUDE… http…
8 Save the Duke Herbarium Henry… 2024-02-17 The Duk… LETTE… http…
9 What Duke can learn from retiring ex-president… Rober… 2024-02-17 In Duke… GUEST… http…
10 Love, love Gabri… 2024-02-16 Somehow… STUDE… http…
# ℹ 490 more rows
Web scraping
Scraping the web: what? why?
Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
Hypertext Markup Language
Most of the data on the web is still largely available as HTML - while it is structured (hierarchical) it often is not available in a form useful for analysis (flat / tidy).
<html><head><title>This is a title</title></head><body><p align="center">Hello world!</p><br/><div class="name" id="first">John</div><div class="name" id="last">Doe</div><div class="contact"><div class="home">555-555-1234</div><div class="home">555-555-2345</div><div class="work">555-555-9999</div><div class="fax">555-555-8888</div></div></body></html>
rvest
The rvest package makes basic processing and manipulation of HTML data straight forward
It’s designed to work with pipelines built with |>
We will use a tool called SelectorGadget to help us identify the HTML elements of interest by constructing a CSS selector which can be used to subset the HTML document.
html =read_html("<p> This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence.<br>This third sentence should start on a new line. </p>")
html |>html_text()
[1] " \n This is the first sentence in the paragraph.\n This is the second sentence that should be on the same line as the first sentence.This third sentence should start on a new line.\n "
html |>html_text2()
[1] "This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence.\nThis third sentence should start on a new line."
[[1]]
# A tibble: 3 × 3
a b c
<int> <int> <int>
1 1 2 3
2 2 3 4
3 3 4 5
SelectorGadget
SelectorGadget (selectorgadget.com) is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in.