# define a function called `add_one()` that adds 1 to its argument, x
<- function(x) {
add_one + 1
x }
Custom functions
A simple custom function
In this notebook, you will learn how to define your own functions.
The code that is run when you “call” your function is inside the curly parentheses.
# apply function add_one() to 5
add_one(5)
[1] 6
# apply function add_one() to 5 using a *named argument*
add_one(x = 5)
[1] 6
What happens when the body of your function contains two pieces of code?
# re-define add_one() with two lines of code: x - 1 and x + 1
<- function(x) {
add_one - 1
x + 1
x }
# apply add_one to 5 again
add_one(5)
[1] 6
By default, any custom R function you create will automatically return the final result that is computed.
For this reason, it is common to explicitly provide a “return statement”.
# redefine add_one() but apply the return statement to x - 1 only
<- function(x) {
add_one return(x - 1)
+ 1
x }
# apply add_one to 5 again
add_one(5)
[1] 4
The general syntax of a custom function
The general syntax of a custom function is shown below:
<- function(arg) {
fn_name # code involving arg
return(object)
}
Exercise
Write a function called “cube” that returns the cubic (^3
) of the argument
# define function cube()
<- function(value) {
cube = value^3
cubed_value return(cubed_value)
}
# apply cube() to 3
cube(value = 3)
[1] 27
Multiple arguments
To create a function with multiple arguments, you need to provide two arguments inside the function definition, separated by a comma. The function below has two arguments: x
and y
:
# define a function called add_x2y that computes x + 2y
<- function(x, y) {
add_x2y + 2*y
x }
When you call this add_x2y()
, you can then specify a value for each argument, separated by a comma:
# apply add_x2y() to x = 2 and y = 5
add_x2y(2, 5)
[1] 12
When you have multiple arguments, funny things can start to happen when you don’t name your arguments.
What do you think will happen if you first provide a named argument for the second argument y
by specifying y = 2
, and then provide an unnamed argument afterwards?
add_x2y(y = 2, 5)
[1] 9
R will be clever and it will first apply the named arguments by assigning y
to the value 2, and then fill in the remaining argument, so x
will be assigned to the value 5
.
It is good practice to name your arguments if there is any chance for ambiguity.
# run add_x2y() with named arguments x = 2 and y = 5
add_x2y(x = 2, y = 5)
[1] 12
Note that when you provide named arguments, it technically doesn’t matter which order you provide them in (but this is not true when you don’t provide named arguments!)
# run add_x2y() with named arguments y = 5 and x = 2
add_x2y(y = 5, x = 2)
[1] 12
Again, note that even when you provide named arguments in a function call (e.g., in add_x2y(x = 2, y = 5)
), you are not defining a global version of the arguments, i.e., there is no y
defined:
y
Error in eval(expr, envir, enclos): object 'y' not found
What do you think will happen when you call add_x2y
without any parentheses?
# run add_x2y
add_x2y
function(x, y) {
x + 2*y
}
<bytecode: 0x12902b270>
It will print out the definition of the function!
What do you think will happen when you call add_x2y
with parentheses but no arguments?
add_x2y()
Error in add_x2y(): argument "x" is missing, with no default
You get an error! We can fix this by providing “default” values for the arguments.
Default values
To provide default values for your arguments, you can assign the argument inside the function definition. In the example below, both x
and y
are given the default values of 1.
# redefine add_x2y() with default values for x and y (both 1)
<- function(x = 1, y = 1) {
add_x2y + 2*y
x }
# call add_x2y() without any arguments
add_x2y()
[1] 3
What happens if we only provide some of the arguments?
# run add_x2y() with a single unnamed argument, 4
add_x2y(4)
[1] 6
# run add_x2y() with a single argument: y = 3
add_x2y(y = 3)
[1] 7
Exercise
Without using the mean()
function, write a function called my_mean()
that takes four values as its arguments and computes their mean. Ensure that each argument has a default of 0. Use your function to compute the mean of the values 4, 5, 2, 1.
How would you modify your function to take a single vector of length 4 as its argument instead of four separate arguments?
Solution
<- function(a = 0, b = 0, c = 0, d = 0) {
my_mean return((a + b + c + d) / 4)
}
Note that when we run this function it is taking four separate arguments (rather than a vector, as the original mean function does):
my_mean(4, 5, 2, 1)
[1] 3
You can define a function that instead takes a single vector as its argument:
<- function(vec) {
my_mean_vec return(sum(vec) / length(vec))
}
and then to run this function, you provide a vector by wrapping the four values inside c()
:
my_mean_vec(c(4, 5, 2, 1))
[1] 3
What happens if you apply your original my_mean()
function to this vector?
my_mean(c(4, 5, 2, 1))
[1] 1.00 1.25 0.50 0.25
This is equivalent to assigning the argument a = c(4, 5, 2, 1)
, and then not providing any values for the b
, c
, and d
arguments (so their default values are used). So the above code is equivalent to:
# (a + b + c + d) / 4
c(4, 5, 2, 1) + 0 + 0 + 0) / 4 (
[1] 1.00 1.25 0.50 0.25
The fact that we haven’t told any of our functions what type of object the arguments should be means that the function will simply take whatever object it is given and try and run the code.
If/else statements
If statements
The following code defines a variable a
, and then runs some conditional code that is only run if the condition a == 1
is TRUE
.
# define a variable a containing 5
<- 5
a # if a is equal to 1, print out some text
if (a == 1) {
"a is 1"
}
# Then:
# 1. try the code again, with a equal to 1
# 2. replace the code that is run in the if statement with a mathematical computation
# 3. replace the code that is run in the if statement with a variable definition (e.g., define y <- 3)
<- 1
a # if a is equal to 1, print out some text
if (a == 1) {
<- 3
y }
Unlike for a function, if you define a variable inside an if statement, then this variable will have been defined in our “global” environment too.
# Does y exist globally?
y
Else statemenets
The “else” part of the “if/else” statement provides some code that will be run if the “if” condition is not TRUE
.
# define a variable "a" containing 3
<- 3
a # if "a" equals 1, define y <- 3, else define y <- 9
if (a == 1) {
<- 3
y else {
} <- 9
y }
# check what y was assigned to
y
[1] 9
Using “if” statements to provide custom errors for a function
The following uses an “if” statement to add a “stop condition” that throws an error when innapropriate arguments are provided
# add an "if" condition to throw an error if `a` is either non-numeric OR the length of `a` is greater than 1
<- function(a = 0, b = 0, c = 0, d = 0) {
my_mean # write your if statement here containing a stop() function
if (!is.numeric(a) | (length(a) != 1)) {
stop("'a' must be a numeric value of length 1")
}return((a + b + c + d) / 4)
}
Exercise
- Write a standalone if/else statement that checks whether the value in a variable called
age
is at least 18. If the age variable is 18 or older, your statement should return the character value of “You are eligible to vote” and if not, your statement should return “You are not eligible to vote”
<- 21
age if (age >= 18) {
"You are eligible to vote"
else {
} "You are not eligible to vote"
}
[1] "You are eligible to vote"
- Write a function that computes the area of a circle (
area = pi * radius^2
) with the radius as its argument, and throws an error if a non-numeric or a negative radius value is given.
<- function(radius) {
circle_area if (!is.numeric(radius) | radius < 0) {
stop("'radius' must be a non-negative numeric value")
}<- pi * radius^2
area return(area)
}
Test your function on the following code
# the following should throw an error:
circle_area("4")
Error in circle_area("4"): 'radius' must be a non-negative numeric value
circle_area(-4)
Error in circle_area(-4): 'radius' must be a non-negative numeric value
# the following should NOT throw an error:
circle_area(4)
[1] 50.26548
Default argument options with match.arg()
Below, we give an example of providing multiple options to an argument.
# define a function add_x2y with three arguments:
# x: a numeric value with default value 1
# y: a numeric value with default value 1
# output_type: a character value with default value "numeric" and alternative option "character"
# in the body of the function:
# use match.arg() to ensure that output_type is one of the options provided
# use an if statement to modify the output based on the value of output_type
<- function(x = 1, y = 1, output_type = c("numeric", "character")) {
add_x2y # this line will set the default value of output_type to be "numeric"
# and will only allow options provided in the default vector
<- match.arg(output_type)
output_type
# stop condition if x or y are not numeric
if (!is.numeric(x) | !is.numeric(y)) {
stop("'x' and 'y' must be numeric")
}
# computing my result
<- x + 2 * y
result
# returning result in the format specified by output_type
if (output_type == "numeric") {
return(result)
else if (output_type == "character") {
} return(as.character(result))
}
}
# apply add_x2y() to x = 2 and y = 3 and output_type = "character"
add_x2y(x = 2, y = 3, output_type = "character")
[1] "8"
# apply add_x2y() to x = 2 and y = 3 with no output_type specified
add_x2y(2, 3)
[1] 8
# apply add_x2y() to x = 2 and y = 3 and output_type = "logical"
add_x2y(2, 3, output_type = "logical")
Error in match.arg(output_type): 'arg' should be one of "numeric", "character"
Exercise
Write a function called calculate_area()
that will calculate the area of the shape specified in the argument shape
, which has options “circle”, “square”, and “triangle”. Your function will need to have the following additional arguments:
radius
for computing the area of a circle (pi * radius^2
),side
for computing the area of a square (side^2
),base
andheight
for computing the area of a triangle (height * base / 2
).
Arguments that are not always required should have a default value of NULL
. Your function should throw an error when you fail any value other than “circle”, “square” or “triangle” for the shape
argument (you can use match.arg()
to do this!).
Your function should throw an error when:
the ‘radius’ argument is not provided when
shape == "circle"
the ‘side’ argument is not provided when
shape == "square"
the ‘base’ and ‘height’ arguments are not provided when
shape == "triangle"
<- function(shape = c("circle", "square", "triangle"),
calculate_area radius = NULL,
side = NULL,
base = NULL,
height = NULL) {
= match.arg(shape)
shape # error statements
if (shape == "circle" & is.na(radius)) {
stop("'radius' required for 'circle' shape")
} if (shape == "square" & is.na(side)) {
stop("'side' required for 'square' shape")
}if (shape == "triangle" & (is.na(base) | is.na(height))) {
stop("'base' and 'height' required for 'triangle' shape")
}
# compute the area
if (shape == "circle") {
<- pi * radius^2
area else if (shape == "square") {
} <- side^2
area else if (shape == "triangle") {
} <- base * height / 2
area
}
return(area)
}
Tidy evaluation for writing tidyverse-style functions
Sometimes you want to write a function whose argument is a column of a data frame without quotes (i.e., in the tidyverse style) so that you can use it in a tidyverse function.
# load in the tidyverse and demographics NHANES data
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<- read_csv("data/demographics.csv") demographics
Rows: 10175 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): interview_examination, gender, race, marital_status, pregnant
dbl (9): respondent_id, age_years, age_months_sc_0_2yr, six_month_period, ag...
lgl (5): served_active_duty_us, served_active_duty_foreign, born_usa, citize...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# take a look at the demographics data
head(demographics)
# A tibble: 6 × 19
respondent_id interview_examination gender age_years age_months_sc_0_2yr race
<dbl> <chr> <chr> <dbl> <dbl> <chr>
1 73557 both interview and e… male 69 NA black
2 73558 both interview and e… male 54 NA white
3 73559 both interview and e… male 72 NA white
4 73560 both interview and e… male 9 NA white
5 73561 both interview and e… female 73 NA white
6 73562 both interview and e… male 56 NA mexi…
# ℹ 13 more variables: six_month_period <dbl>, age_months_ex_0_19yr <dbl>,
# served_active_duty_us <lgl>, served_active_duty_foreign <lgl>,
# born_usa <lgl>, citizen_usa <lgl>, time_in_us <dbl>, education_youth <dbl>,
# education <dbl>, marital_status <chr>, pregnant <chr>,
# language_english <lgl>, household_income <dbl>
Notice that when we refer to the variables from our data frame in the context of the tidyverse (including ggplot2), we do not use quotes around the column names:
# compute boxplots for the age_years distribution across different levels of the language_english variable
|> ggplot() +
demographics geom_boxplot(aes(x = language_english,
y = age_years))
Let’s try and turn this into a function so we can create boxplots comparing the age distribution across all of the levels of other categorical variables without having to copy-and-paste the code over and over again.
In this first attempt, we use variable_name
as the argument:
# turn the above boxplot code into a function called createBoxplots()
<- function(variable_name) {
createBoxplots
|> ggplot() +
demographics geom_boxplot(aes(x = variable_name,
y = age_years))
}
# try to apply createBoxplots() to the language_english column
createBoxplots(language_english)
Error in `geom_boxplot()`:
! Problem while computing aesthetics.
ℹ Error occurred in the 1st layer.
Caused by error:
! object 'language_english' not found
createBoxplots("language_english")
The issue is that ggplot wants us to refer to the column names without strings, but our custom function doesn’t know how to find our column name variable that we provide.
We can use tidy evaluation ({ var_name }
) to solve this problem.
# update our createBoxplots() function so that it uses tidy_evaluation
<- function(variable_name) {
createBoxplots |> ggplot() +
demographics geom_boxplot(aes(x = {{ variable_name }},
y = age_years))
}
And now it works:
# apply createBoxplots() to the language_english column
createBoxplots(language_english)
Let’s use our function on a few different columns to make sure:
# apply createBoxplots() to the marital_status column
createBoxplots(marital_status)
# apply createBoxplots() to the pregnant column
createBoxplots(pregnant)
Tip: Using patchwork to patch together plots
The patchwork library can be used to create custom grids of ggplot objects (be sure to expand your plot window if the plots aren’t showing up properly):
# load the patchwork library
library(patchwork)
# create a grid of the language_english and marital_status boxplots
createBoxplots(language_english) + createBoxplots(marital_status)
# create a grid of the language_english and pregnant boxplots
# on top with the marital_status boxplots underneath
createBoxplots(language_english) + createBoxplots(pregnant)) / createBoxplots(marital_status) (
Exercise
Create a function called createOrderedBars() that takes a categorical column from demographics
as its argument and creates a bar chart for the average age for each level of the variable.
Hint: Some example code for computing the average age for each level of the marital_status
variable and creating a bar chart is shown below:
|>
demographics group_by(marital_status) |>
summarize(mean_age = mean(age_years)) |>
ggplot() +
geom_col(aes(x = marital_status,
y = mean_age))
Challenge activity: Arrange the bars in ascending order, and provide an argument called ascending
that allows the user to specify whether or not the bars will be arranged in ascending order.
Solution
Note that you’ll want to use geom_col()
instead of geom_bar()
to create bar charts where the bars have the specific height provided in the y-variable.
<- function(variable_name) {
createOrderedBars |>
demographics # group by the column provided
group_by({{ variable_name }}) |>
# compute the mean age
summarize(mean_age = mean(age_years)) |>
# create the bar plot
ggplot() +
geom_col(aes(x = {{ variable_name }},
y = mean_age))
}
createOrderedBars(marital_status)
Challenge solution
To arrange the bars in ascending order of mean_age
, you first need to convert the categorical variable itself to a factor whose levels are in the order that you want the bars to appear. You can do this by arranging the rows of your data frame in order of mean_age
and then using fct_inorder()
to convert your variable to a factor with levels match the increasing order of mean_age
.
<- function(variable_name, ascending = TRUE) {
createOrderedBars <- demographics |>
mean_age # group by the column provided
group_by({{ variable_name }}) |>
# compute the mean age
summarize(mean_age = mean(age_years))
if (ascending) {
<- mean_age |>
mean_age # arrange in increasing order of mean_age
arrange(mean_age) |>
# modify selected_variable so that it is a factor whose levels are in
# increasing order of mean_age
mutate(selected_variable = fct_inorder({{ variable_name }}))
else {
} <- mean_age |>
mean_age mutate(selected_variable = {{ variable_name }})
}
# create the bar plot
|>
mean_age ggplot() +
geom_col(aes(x = selected_variable,
y = mean_age))
}
createOrderedBars(marital_status, ascending = TRUE)
createOrderedBars(marital_status, ascending = FALSE)
createOrderedBars(marital_status)