Data visualization with ggplot2

Let’s load in the gapminder dataset

# load the tidyverse, and gapminder dataset using read_csv()
library(tidyverse)
gapminder <- read_csv("data/gapminder.csv")
gapminder
# A tibble: 1,704 × 6
   country     continent  year lifeExp      pop gdpPercap
   <chr>       <chr>     <dbl>   <dbl>    <dbl>     <dbl>
 1 Afghanistan Asia       1952    28.8  8425333      779.
 2 Afghanistan Asia       1957    30.3  9240934      821.
 3 Afghanistan Asia       1962    32.0 10267083      853.
 4 Afghanistan Asia       1967    34.0 11537966      836.
 5 Afghanistan Asia       1972    36.1 13079460      740.
 6 Afghanistan Asia       1977    38.4 14880372      786.
 7 Afghanistan Asia       1982    39.9 12881816      978.
 8 Afghanistan Asia       1987    40.8 13867957      852.
 9 Afghanistan Asia       1992    41.7 16317921      649.
10 Afghanistan Asia       1997    41.8 22227415      635.
# ℹ 1,694 more rows

The ggplot2 library is loaded as a part of the tidyverse.

Using ggplot2 to visualize data

Let’s create our first visualization using ggplot2’s “layered grammar of graphics”.

To create a ggplot figure, you start by creating an empty ggplot2 canvas, to which you provide your dataset

ggplot(gapminder)

Then you add (with +) a “geom_” layer. For a scatterplot, this is geom_point().

Inside your geom layer, you need to specify the aesthetics using aes(), such as the x- and y-coordinates of the points.

# create a scatterplot of gdpPercap (x) against lifeExp (y)
ggplot(gapminder) + 
  geom_point(aes(x = gdpPercap, y = lifeExp))

Exercise

Create a ggplot scatterplot figure of population against life expectancy

ggplot(gapminder) + 
  geom_point(aes(x = pop, y = lifeExp))

Exercise

Recreate the previous plot using only the data from the year 2007.

Hint: you can pipe the gapminder object into the ggplot function.

gapminder |> 
  filter(year == 2007) |> 
  ggplot() + 
  geom_point(aes(x = pop, y = lifeExp))

gapminder_2007 <- gapminder |> filter(year == 2007)
ggplot(gapminder_2007) + 
  geom_point(aes(x = pop, y = lifeExp))

Defining ggplot2 aesthetics

We’ve seen the x and y point aesthetics, but there are many others too.

For example, you can specify the color of the points using the color aesthetic:

# use gapminder_2007 to create a scatterplot of gdpPercap (x) against lifeExp (y)
# where color is based on continent
ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp, color = continent))

To specify a global aesthetic that does not depend on a column in your data, you need to specify it outside the aes() function.

# use gapminder_2007 to create a scatterplot of gdpPercap (x) against lifeExp (y)
# where all points are colored blue
ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp), color = "blue")

Exercise

Specify the shape aesthetic of each point in two ways:

  1. Provide a different shape for each continent

  2. Make all points “square”

# 1
ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp, 
                 shape = continent))

# 2 
ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp), 
             shape = "square")

Specifying transparency

Sometimes when you have a lot of data points, you might want to add some transparency. You can do this using the alpha argument. alpha takes values between 0 and 1. alpha = 1 is not transparent at all, and alpha = 0 is completely transparent.

# add transparency to the 2007 scatterplot of gdpPercap (x) against lifeExp (y)
ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp), 
             alpha = 0.5)

Exercise

Recreate the 2007 gdpPercap vs lifeExp plot in which you color by continent, size is determined by population, and the points have a transparency of 0.5.

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, y = lifeExp, 
                 color = continent, size = pop), 
             alpha = 0.5)

Line plots

Let’s create a line plot of lifeExp by year for each country in the Americas.

# create a line plot for each country in the Americas
gapminder |> 
  filter(continent == "Americas") |> 
  ggplot() + 
  geom_line(aes(x = year, 
                y = lifeExp, 
                # if you want separate lines, you need to provide a group variable
                group = country))

Exercise

Compute the average life expectancy for each continent for each year, and then create a line plot of the average life expectancy for each continent over time.

gapminder |> 
  group_by(continent, year) |> 
  summarize(mean_life_exp = mean(lifeExp)) |> 
  ggplot() +
  geom_line(aes(x = year, 
                y = mean_life_exp, 
                color = continent))
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

Boxplots

Let’s create some boxplots of lifeExp for each continent

# create boxplots of the lifeExp for each continent
ggplot(gapminder) + 
  geom_boxplot(aes(x = continent, y = lifeExp))

Histograms

Let’s create a histogram of lifeExp

# create a histogram of lifeExp
ggplot(gapminder) + 
  geom_histogram(aes(x = lifeExp))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Bar charts

You can create a count bar chart, by providing a categorical (character/factor) variable as your x-aesthetic to geom_bar()

# create a bar chart of the continent *counts*
ggplot(gapminder) +
  geom_bar(aes(x = continent))

If you want to create bar charts where you specify the height of each bar based on a variable in your data, you need to use geom_col() instead of geom_bar().

# create a bar chart of the average lifeExp for each continent using geom_col()
gapminder |>
  group_by(continent) |>
  summarize(mean_life_exp = mean(lifeExp)) |>
  ggplot() +
  geom_col(aes(x = continent, y = mean_life_exp))

Layering geom_layers

You can add multiple layers of geoms in the same plot.

# (from the exercise above) compute the average lifeExp for each continent-year 
# combination, then create a line plot of the mean_life_exp over time for each 
# continent, and then 
# add the points on top of the line
gapminder |> 
  group_by(continent, year) |> 
  summarize(mean_life_exp = mean(lifeExp)) |> 
  ggplot(aes(x = year, 
             y = mean_life_exp, 
             color = continent)) +
  geom_line() + 
  geom_point()
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

Getting fancy with ggplot2

Transformations

You can apply log-scale transformations to your axis by adding a scale layer.

# for the 2007 gdpPercap-lifeExp scatterplot colored by continent
# add a log10 scale layer to the x-axis 
ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp, 
                 color = continent)) + 
  scale_x_log10()

Labels

You can clean the labels of your figure using the labs() function

# take your previous plot, add nice labels using `labs()`
# save the ggplot2 object as my_scatter
my_scatter <- ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp, 
                 color = continent)) + 
  scale_x_log10() + 
  labs(x = "GDP per capita", y = "Life expectancy", title = "GDP per cap vs life expectancy")
my_scatter

Themes

You can change the theme of your figure by adding a themes layer

# try out a few themes layers: theme_classic(), theme_bw(), theme_dark()
my_scatter + theme_classic()

my_scatter + theme_bw()

my_scatter + theme_dark()

ggplot(gapminder_2007) + 
  geom_point(aes(x = gdpPercap, 
                 y = lifeExp, 
                 color = continent)) + 
  scale_x_log10() + 
  labs(x = "GDP per capita", y = "Life expectancy", title = "GDP per cap vs life expectancy") + 
  theme_dark()

Faceted grids

You can create a grid of plots using facet_wrap().

# create a line plot of lifeExp over time for each country, separately for each continent
ggplot(gapminder) + 
  geom_line(aes(x = year, y = lifeExp, group = country),
            alpha = 0.2) + 
  facet_wrap(~continent, ncol = 2)

Project exercise: world happiness

Load in the world happiness dataset (whr_2023.csv). Look at the data dictionary provided. Identify which variable indicates the country’s happiness score.

Note that there are many missing values (NA) in this data. If you want to compute a mean of a variable with missing values, you need to specify the na.rm = TRUE. If you need to, you can also use the drop_na() dplyr function to remove all rows with missing values (but this is not necessarily recommended).

mean(c(1, 4, NA, 2))
[1] NA
mean(c(1, 4, NA, 2), na.rm = TRUE)
[1] 2.333333

Conduct some explorations of the data using your dplyr and ggplot2 skills. Create at least one interesting polished plot. You are welcome to look at just one year, or even just one country!

Make sure that your plot has a clear takeaway message. Remember that less is sometimes more: just because you can add a billion things to your plot, doesn’t mean that you should!

One idea: Look at Australia’s happiness score (life_ladder) over time.

happiness <- read_csv("data/whr_2023.csv")
Rows: 2970 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): country_name
dbl (10): year, life_ladder, log_GDP_per_capita, social_support, healthy_lif...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
happiness
# A tibble: 2,970 × 11
   country_name  year life_ladder log_GDP_per_capita social_support
   <chr>        <dbl>       <dbl>              <dbl>          <dbl>
 1 Afghanistan   2005       NA                 NA            NA    
 2 Afghanistan   2006       NA                 NA            NA    
 3 Afghanistan   2007       NA                 NA            NA    
 4 Afghanistan   2008        3.72               7.35          0.451
 5 Afghanistan   2009        4.40               7.51          0.552
 6 Afghanistan   2010        4.76               7.61          0.539
 7 Afghanistan   2011        3.83               7.58          0.521
 8 Afghanistan   2012        3.78               7.66          0.521
 9 Afghanistan   2013        3.57               7.68          0.484
10 Afghanistan   2014        3.13               7.67          0.526
# ℹ 2,960 more rows
# ℹ 6 more variables: healthy_life_expectancy_at_birth <dbl>,
#   freedom_to_make_life_choices <dbl>, generosity <dbl>,
#   perceptions_of_corruption <dbl>, positive_affect <dbl>,
#   negative_affect <dbl>
happiness |>
  filter(country_name == "Australia", year >= 2010) |>
  ggplot() +
  geom_line(aes(x = year, y = life_ladder),
             col = "firebrick", linewidth = 1.1) +
  theme_classic() +
  labs(x = "Year", y = "Happiness score", title = "Australia's decreasing happiness trend")