# create three vectors containing: age, state, and diabetes status
<- c(29, 35, 36, 21, 42, 39, 52, 35, 30, 44)
age_vec <- c("CA", "FL", "PA", "NY", "UT", "UT", "MT", "CO", "NV", "WY")
state_vec <- c(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE) diabetes_vec
Loading data
Creating a data set
Let’s create a dataset with three variables
If we only had a little bit of data, we could define a vector for each variable in our data
And we could place these three vectors into a single object called a “data frame”
# create a data frame called patient_data with data.frame
# with three columns: age, state, and diabetes
# print out patient_data
You can look at a summary by
# use str() to look at the data frame
# what is the "class" of the data frame?
Each column in a data frame can have a different type, but each entry within a single column must be a single type (because each column corresponds to a vector).
CSV data files
CSV files are one of the simplest data formats.
CSV stands for “comma separated value”. In a CSV file:
Every entry in a row is separated by a comma
New rows are created by starting a new line
Take a look at the data/gapminder.csv
file.
Loading CSV files
To load in a dataset (as a data frame) from a csv file, we can use the read.csv()
function
# load the data/gapminder.csv file without saving it
# load in the file and save it as gapminder
The working directory
If R cannot find your file, you may be in the wrong working directory (the location in your computer where file-paths will start from).
If you opened an R project, then your working directory will be the location of the project folder.
To change the working directory, use the “Session > Set Working Directory” menu.
Let’s take a look at the gapminder data object
# print the gapminder object
Note that it prints out A LOT of data! Try to avoid printing entire datasets in your quarto document. Render your document to see why.
Summarizing a data frame
Instead of looking at the entire data frame, it is often easier to look at just the first few rows using the head()
function:
# use the head() function to look at gapminder
# look at the first 20 rows
We can print out the column names:
# use colnames() to print out the column names
We can ask how many rows and columns my data frame has:
# compute the number of rows (nrow)
# compute the number of columns (ncol)
# look at the dimension of gapminder (dim)
Look at a summary
# use str() to look at a summary of gapminder
# use summary() to look at a summary of gapminder
Loading Excel data files into R
To load excel files, we need to install the readxl R package
R packages provide you with additional R functions.
You only ever need to install an R package ONCE. This is like installing an application on your computer.
# run in the console: install.packages("readxl")
But every time you want to use an R package in a new R session, you need to load the library using the library() function
# load the readxl R package
Let’s load the gapminder excel dataset using a function from readxl.
# use read_excel() from readxl to load the data/gapminder.xls file
Note this will only load the first sheet. You can use the sheet
argument to load other sheets.
# use the "sheet" argument to load in just the second sheet containing Australia's data
Exercise
Load the world happiness dataset from the whr_2023.csv
file. Save it as a variable called world_happiness
. Then print out the first 10 rows, the column names, create a summary of the data, report its dimension,