Data visualization overview

Lecture 4

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Spring 2024

2024-01-25

Warm up

While you wait…

Questions from the prepare materials?

Announcements

  • Lab 1 due Monday morning at 8 am.

    • My office hours today after class + 2-3 pm in Old Chem 213

    • Lots of TA office hours, including over the weekend

    • Submitting late and want to use your one-time waiver? Email our course coordinator Dr. Mary Knox.

  • AEs this week should be submitted by midnight on Sunday. To “submit”, commit and push at least once to your ae repo for each application exercise this week.

  • Pilot: Ed Discussion threads for lecture, linked at the bottom of each slide.

Questions from last time

Many of the questions in Lab 1 are subjective. How does that work?

identify at least one outlier

Questions from last time

Many of the questions in Lab 1 are subjective. How does that work?

identify at least one outlier ✅

Questions from last time

Many of the questions in Lab 1 are subjective. How does that work?

identify at least one outlier ❌

Questions from last time

What are some situations where waffle plots are better than pie charts?

Let’s take a look at an example…

🥧 or 🧇?

Which of the following is a better representation for the number of counties in each midwestern state?

🥧 or 🧇 or ?

Which of the following is a better representation for the number of counties in each midwestern state?

midwest |> 
  count(state, sort = TRUE)
# A tibble: 5 × 2
  state     n
  <chr> <int>
1 IL      102
2 IN       92
3 OH       88
4 MI       83
5 WI       72

From last time

Packages

library(palmerpenguins)
library(tidyverse)
library(ggthemes)

Bivariate analysis

# Side-by-side box plots
ggplot(penguins, aes(x = body_mass_g, y = species, fill = species)) +
  geom_boxplot(alpha = 0.5, show.legend = FALSE) +
  scale_fill_colorblind() +
  labs(
    x = "Body mass (grams)", y = "Species",
    title = "Side-by-side box plots"
  )
# Density plots
ggplot(penguins, aes(x = body_mass_g, fill = species)) +
  geom_density(alpha = 0.5) +
  theme(legend.position = "bottom") +
  scale_fill_colorblind() +
  labs(
    x = "Body mass (grams)", y = "Density",
    fill = "Species", title = "Density plots"
  )

Violin plots

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin()

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_point()

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_jitter()

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter()

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  )

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  ) +
  scale_color_colorblind()

Multivariate analysis

Bechdel

Load the Bechdel test data with read_csv():

bechdel <- read_csv("https://sta199-s24.github.io/data/bechdel.csv")


View the column names() of the bechdel data:

names(bechdel)
[1] "title"       "year"        "gross_2013"  "budget_2013" "roi"         "binary"     
[7] "clean_test" 

ROI by test result

What about this plot makes it difficult to evaluate how ROI varies by Bechdel test result?

ggplot(bechdel, aes(x = roi, y = clean_test, color = binary)) +
  geom_boxplot()

Movies with high ROI

What are the movies with highest ROI?

bechdel |>
  filter(roi > 400) |>
  select(title, roi, budget_2013, gross_2013, year, clean_test)
# A tibble: 3 × 6
  title                     roi budget_2013 gross_2013  year clean_test
  <chr>                   <dbl>       <dbl>      <dbl> <dbl> <chr>     
1 Paranormal Activity      671.      505595  339424558  2007 dubious   
2 The Blair Witch Project  648.      839077  543776715  1999 ok        
3 El Mariachi              583.       11622    6778946  1992 nowomen   

ROI by test result

Zoom in: What about this plot makes it difficult to evaluate how ROI varies by Bechdel test result?

ggplot(bechdel, aes(x = roi, y = clean_test, color = binary)) +
  geom_boxplot() +
  coord_cartesian(xlim = c(0, 15))

Sneak preview…



to next week’s data wrangling pipelines…

Median ROI

bechdel |>
  summarize(median_roi = median(roi, na.rm = TRUE))
# A tibble: 1 × 1
  median_roi
       <dbl>
1       3.91

Median ROI by test result

bechdel |>
  group_by(clean_test) |>
  summarize(median_roi = median(roi, na.rm = TRUE))
# A tibble: 5 × 2
  clean_test median_roi
  <chr>           <dbl>
1 dubious          3.80
2 men              3.96
3 notalk           3.69
4 nowomen          3.27
5 ok               4.21

ROI by test result – zoom in

What does this plot say about return-on-investment on movies that pass the Bechdel test?

ggplot(bechdel, aes(x = roi, y = clean_test, color = binary)) +
  geom_boxplot() +
  coord_cartesian(xlim = c(0, 15)) +
  geom_vline(xintercept = 4.21, linetype = "dashed")

Application exercise

ae-03-duke-forest

If you’ve been here for a while:

and following along with the application exercises…

Go to the project navigator in RStudio (top right corner of your RStudio window) and open the project called ae. If there are any uncommitted files, commit them, and then click Pull.

If you’ve new:

or haven’t been following along with the application exercises…

Go to the course GitHub org and find your ae repo. Clone the repo in your container, open the Quarto document called ae-03-duke-forest.