Data visualization with ggplot2

Let’s load in the gapminder dataset

# load the tidyverse, and gapminder dataset using read_csv()

The ggplot2 library is loaded as a part of the tidyverse.

Using ggplot2 to visualize data

Let’s create our first visualization using ggplot2’s “layered grammar of graphics”.

To create a ggplot figure, you start by creating an empty ggplot2 canvas, to which you provide your dataset

# apply ggplot() to gapminder to create an empty ggplot2 canvas

Then you add (with +) a “geom_” layer. For a scatterplot, this is geom_point().

Inside your geom layer, you need to specify the aesthetics using aes(), such as the x- and y-coordinates of the points.

# Add a scatterplot layer of gdpPercap (x) against lifeExp (y)

Exercise

Create a ggplot scatterplot figure of population against life expectancy

Recreate the previous plot using only the data from the year 2007.

Hint: you can pipe the gapminder object into the ggplot function.

Defining ggplot2 aesthetics

We’ve seen the x and y point aesthetics, but there are many others too.

For example, you can specify the color of the points using the color aesthetic:

# define gapminder_2007 and create a scatterplot of gdpPercap (x) against lifeExp (y)
# where color is based on continent

To specify a global aesthetic that does not depend on a column in your data, you need to specify it outside the aes() function.

# use gapminder_2007 to create a scatterplot of gdpPercap (x) against lifeExp (y)
# where ALL points are colored blue

Exercise

Copy your code above to create a scatterplot of gdpPercap (x) against lifeExp (y).

Specify the shape aesthetic of each point in two ways:

  1. Provide a different shape for each continent
  1. Make all points “square”

Specifying transparency

Sometimes when you have a lot of data points, you might want to add some transparency. You can do this using the alpha argument. alpha takes values between 0 and 1. alpha = 1 is not transparent at all, and alpha = 0 is completely transparent.

# add transparency to the 2007 scatterplot of gdpPercap (x) against lifeExp (y)

Exercise

Recreate the 2007 gdpPercap vs lifeExp plot in which you color by continent, size is determined by population, and the points have a transparency of 0.5.

Line plots

Let’s create a line plot of lifeExp by year for each country in the Americas continent.

# create a line plot of lifeExp for each country in the Americas

Exercise

Compute the average life expectancy for each continent & year, and then create a line plot of the average life expectancy for each continent over time.

Boxplots

Let’s create some boxplots of lifeExp for each continent

# create boxplots of the lifeExp for each continent

Histograms

Let’s create a histogram of lifeExp

# create a histogram of lifeExp

Bar charts

You can create a count bar chart, by providing a categorical (character/factor) variable as your x-aesthetic to geom_bar()

# create a bar chart of the continent *counts*

If you want to create bar charts where you specify the height of each bar based on a variable in your data, you need to use geom_col() instead of geom_bar().

# create a bar chart of the average lifeExp for each continent using geom_col()

Layering geom_layers

You can add multiple layers of geoms in the same plot.

# (from the exercise above) compute the average lifeExp for each continent-year
# then create a line plot of the mean_life_exp over time for each 
# continent, and then add the points on top of the line

Getting fancy with ggplot2

Transformations

You can apply log-scale transformations to your axis by adding a scale layer.

# for the 2007 gdpPercap-lifeExp scatterplot colored by continent
# add a log10 scale layer to the x-axis 

Labels

You can clean the labels of your figure using the labs() function

# take your previous plot, add nice labels using `labs()`
# save the ggplot2 object as my_scatter

Themes

You can change the theme of your figure by adding a themes layer

# try out a few themes layers: theme_classic(), theme_bw(), theme_dark()

Faceted grids

You can create a grid of plots using facet_wrap().

# create a grid of line plots of lifeExp over time for each country for each continent 

Project exercise: world happiness

Load in the world happiness dataset (whr_2023.csv). Look at the data dictionary provided. Identify which variable indicates the country’s happiness score.

Note that there are many missing values (NA) in this data. If you want to compute a mean of a variable with missing values, you need to specify the na.rm = TRUE. If you need to, you can also use the drop_na() dplyr function to remove all rows with missing values (but this is not necessarily recommended).

mean(c(1, 4, NA, 2))
[1] NA
mean(c(1, 4, NA, 2), na.rm = TRUE)
[1] 2.333333

Conduct some explorations of the data using your dplyr and ggplot2 skills. Create at least one interesting polished plot. You are welcome to look at just one year, or even just one country!

Make sure that your plot has a clear takeaway message. Remember that less is sometimes more: just because you can add a billion things to your plot, doesn’t mean that you should!

One idea: Look at Australia’s happiness score (life_ladder) over time.