Visualizing various types of data

Lecture 3

Dr. Mine Çetinkaya-Rundel

Duke University
STA 199 - Spring 2024

2024-01-23

Warm up

While you wait…

Questions from the prepare materials?

Questions from last time

Announcements

From last time

library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(palmerpenguins)
library(ggthemes)

Violin plots

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin()
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_point()
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Multiple geoms

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g
    )
  ) +
  geom_violin() +
  geom_jitter()
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter()
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Multiple geoms + aesthetics

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  )
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

Multiple geoms + aesthetics

::: columns ::: {.column width=“50%”}

ggplot(
  penguins,
  aes(
    x = species,
    y = body_mass_g,
    color = species
    )
  ) +
  geom_violin() +
  geom_jitter() +
  theme(
    legend.position = "none"
  ) +
  scale_color_colorblind()
Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

:::

Warning: Removed 2 rows containing non-finite values (`stat_ydensity()`).
Warning: Removed 2 rows containing missing values (`geom_point()`).

:::— title: “Visualizing various types of data” subtitle: “Lecture 3” date: “January 23, 2024” format: revealjs

Warm up

While you wait…

Questions from the prepare materials?

Questions from last time

  • Is there any code in the videos that is not in the readings? Yes and no. There is no substantial functionality introduced in the videos that is not also in the readings, however the examples in the videos are different than the ones in the reading.

  • What are all of the geoms we need to know? You don’t need to “memorize” or even “know” all o the geoms available in the ggplot2 package, but you can find a list of them on the ggplot2 cheat sheet or on the reference page.

  • Could you please clarify what situations it would be appropriate to use each geom function? Today’s topic! And think about it as “what plot should I make for which type of variable”.

Announcements

  • AEs this week should be submitted by midnight on Sunday. To “submit”, commit and push at least once to your ae repo for each application exercise this week.
  • If you email me, please put STA 199 in the subject.

From last time

ae-02-bechdel-dataviz

If you were in class last Thursday:

and followed along with the application exercise…

Go to the project navigator in RStudio (top right corner of your RStudio window) and open the project called ae. If there are any uncommitted files, commit them so you can start with a clean slate.

If you missed class last Thursday:

or didn’t follow along with the application exercise…

Go to the course GitHub org and find your ae repo. Clone the repo in your container, open the Quarto document called ae-02-bechdel.

Recap of AE

  • Construct plots with ggplot().
  • Layers of ggplots are separated by +s.
  • The formula is (almost) always as follows:
ggplot(DATA, aes(x = X-VAR, y = Y-VAR, ...)) +
  geom_XXX()
  • Aesthetic attributes of a geometries (color, size, transparency, etc.) can be mapped to variables in the data or set by the user, e.g. color = binary vs. color = "pink".
  • Use facet_wrap() when faceting (creating small multiples) by one variable and facet_grid() when faceting by two variables.

Visualizing various types of data

Identifying variable types

Identify the type of each of the following variables.

  • Favorite food
  • Number of classes you’re taking this semester
  • Zip code
  • Age

The way data is displayed matters

What do these three plots show?

Visualizing penguins

library(tidyverse)
library(palmerpenguins)
library(ggthemes)

penguins
# A tibble: 344 × 8
   species island   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex    year
   <fct>   <fct>             <dbl>         <dbl>             <int>       <int> <fct> <int>
 1 Adelie  Torgers…           39.1          18.7               181        3750 male   2007
 2 Adelie  Torgers…           39.5          17.4               186        3800 fema…  2007
 3 Adelie  Torgers…           40.3          18                 195        3250 fema…  2007
 4 Adelie  Torgers…           NA            NA                  NA          NA <NA>   2007
 5 Adelie  Torgers…           36.7          19.3               193        3450 fema…  2007
 6 Adelie  Torgers…           39.3          20.6               190        3650 male   2007
 7 Adelie  Torgers…           38.9          17.8               181        3625 fema…  2007
 8 Adelie  Torgers…           39.2          19.6               195        4675 male   2007
 9 Adelie  Torgers…           34.1          18.1               193        3475 <NA>   2007
10 Adelie  Torgers…           42            20.2               190        4250 <NA>   2007
# ℹ 334 more rows

Univariate analysis

Univariate analysis

Analyzing a single variable:

  • Numerical: histogram, box plot, density plot, etc.

  • Categorical: bar plot, pie chart, etc.

Histogram - Step 1

ggplot(
  penguins
  )

Histogram - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Histogram - Step 3

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

Histogram - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 250
  )
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

Histogram - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_histogram(
    binwidth = 250
  ) +
  labs(
    title = "Weights of penguins",
    x = "Weight (grams)",
    y = "Count"
  )
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

Boxplot - Step 1

ggplot(
  penguins
  )

Boxplot - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Boxplot - Step 3

ggplot(
  penguins,
  aes(y = body_mass_g)
  ) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Boxplot - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Boxplot - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_boxplot() +
  labs(
    x = "Weight (grams)",
    y = NULL
  )
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Density plot - Step 1

ggplot(
  penguins
  )

Density plot - Step 2

ggplot(
  penguins,
  aes(x = body_mass_g)
  )

Density plot - Step 3

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density()
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plot - Step 4

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1"
  )
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plot - Step 5

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2
  )
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plot - Step 6

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2,
    color = "darkorchid3"
  )
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plot - Step 7

ggplot(
  penguins,
  aes(x = body_mass_g)
  ) +
  geom_density(
    fill = "darkslategray1",
    linewidth = 2,
    color = "darkorchid3",
    alpha = 0.5
  )
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Weights of penguins

Warning: Removed 2 rows containing non-finite values (`stat_bin()`).

Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Warning: Removed 2 rows containing non-finite values (`stat_density()`).

TRUE / FALSE

  • The distribution of penguin weights in this sample is left skewed.
  • The distribution of penguin weights in this sample is unimodal.

Bivariate analysis

Bivariate analysis

Analyzing the relationship between two variables:

  • Numerical + numerical: scatterplot

  • Numerical + categorical: side-by-side box plots, violin plots, etc.

  • Categorical + categorical: stacked bar plots

  • Using an aesthetic (e.g., fill, color, shape, etc.) or facets to represent the second variable in any plot

Side-by-side box plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    y = species
    )
  ) +
  geom_boxplot()
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species
    )
  ) +
  geom_density()
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density()
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  )
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).

Density plots

ggplot(
  penguins,
  aes(
    x = body_mass_g,
    color = species,
    fill = species
    )
  ) +
  geom_density(
    alpha = 0.5
  ) +
  theme(
    legend.position = "bottom"
  )
Warning: Removed 2 rows containing non-finite values (`stat_density()`).
Warning: Removed 2 rows containing non-finite values (`stat_density()`).