1. Intro to Basics

Take your first steps with R. In this chapter, you will learn how to use the console as a calculator and how to assign variables. You will also get to know the basic data types in R. Let’s get started.

How it works

In the editor on the right there is already some sample code. Can you see which lines are actual R code and which are comments?

# Calculate 3 + 4
3+4
[1] 7
# Calculate 6 + 12
6+12
[1] 18

🌻 Type in an expression, R will calculate it and print out the results

Arithmetic with R
# An addition
5 + 5 
[1] 10
# A subtraction
5 - 5 
[1] 0
# A multiplication
3 * 5
[1] 15
 # A division
(5 + 5) / 2 
[1] 5

Additional arithmetic operators

# Exponentiation: Type 2^5 in the editor to calculate 2 to the power 5.
2^5
[1] 32
# Modulo: Type 28 %% 6 to calculate 28 modulo 6.
28%%6
[1] 4

🌻 For more operators, refer to Quick-R

Variable assignment

Use <- or = to store the evaluated result in an variable a.k.a. data object.

# Assign the value 42 to x
x <- 42
# Print out the value of the variable x
x
[1] 42
Variable assignment (2)
# Assign the value 5 to the variable my_apples
my_apples=5
# Print out the value of the variable my_apples
my_apples
[1] 5
Variable assignment (3)
# Assign a value to the variables my_apples and my_oranges
my_apples <- 5

# Assign to my_oranges the value 6.
my_oranges <- 6

# Add these two variables together
my_apples + my_oranges
[1] 11
# Create the variable my_fruit
my_fruit = my_apples + my_oranges

We can use variables in expressions, and store the evaluated results to the other variables.

Apples and oranges

Arithmetic operators apply to objects of numeric types but not character.

# Assign a value to the variable my_apples
my_apples <- 5 

# Fix the assignment of my_oranges, so that it can be added with `my_apples`
# my_oranges <- "six"
my_oranges <- 6

# Create the variable my_fruit and print it out
my_fruit <- my_apples + my_oranges 
my_fruit
[1] 11
Basic datatypes in R
# Change my_numeric to be 42
my_numeric <- 42

# Change my_character to be "universe"
my_character <- "universe"

# Change my_logical to be FALSE
my_logical <- FALSE

🌻 Basic R Data Types

  • integer, numeric
  • character
  • factor (category)
  • logical (boolean)
  • Date and Time (Date,POXIXct,…)


What’s that data type?
# Declare variables of different types
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE 

# Check class of my_numeric
class(my_numeric)
[1] "numeric"
# Check class of my_character
class(my_character)
[1] "character"
# Check class of my_logical
class(my_logical)
[1] "logical"


2. Vectors

We take you on a trip to Vegas, where you will learn how to analyze your gambling results using vectors in R. After completing this chapter, you will be able to create vectors in R, name them, select elements from them, and compare different vectors.

Create a vector
# Assign the value "Go!" to the variable `vegas`. Remember: R is case sensitive!
vegas <- "Go!"
Create a vector (2)

Vectoris one of the basic data structure. A vector is series of values of the same type.

In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:

numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")

🌻 Use the c() function to create vector.

🌻 use the <- or = operator to assign a name to the vector 🌻 All elements in c() must be the same data type.

vector1 <- c(1.2, 3.5, 15.2)            # `vector1` is a numeric vector
vector2 = c("Alice", "Bob", "Cindy")    # `vector2` is a character vector
# Complete the code to create a `boolean_vector` contains the 
# three elements: `TRUE`, `FALSE` and `TRUE` (in that order).
boolean_vector <- c(TRUE, FALSE, TRUE)
Create a vector (3)
# Poker winnings from Monday to Friday
# Monday won $140, Tuesday lost $50, and so on.
poker_vector <- c(140, -50, 20, -120, 240)
# Roulette winnings: Monday lost $24, Tuesday lost $50, 
# Wednesday won $100, Thursday lost $350, and Friday won $10.
roulette_vector <- c(-24,-50,100,-350,10)
Naming vector elements

Not only the vector object has a name. Every elements in the vector can also has its own name.

poker_vector # a vector without element names
[1]  140  -50   20 -120  240

The names() function can assign names to sub-elements within collective objects.

# Assign days as names of the elements of poker_vector
names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
poker_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
      140       -50        20      -120       240 

Now every elements in the vector has its own name.

🌻 Names can make the data object and its sub elements easier to read.

# Assign days as names of roulette_vector
names(roulette_vector) = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
roulette_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
      -24       -50       100      -350        10 
Naming a vector (2)
# Poker winnings from Monday to Friday
poker_vector <- c(140, -50, 20, -120, 240)

# Roulette winnings from Monday to Friday
roulette_vector <- c(-24, -50, 100, -350, 10)

# Save the element names in the variable `days_vector`, ...
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
# then we can assign the names to `poker_vector` and `roulette_vector`
names(poker_vector) <- days_vector  
names(roulette_vector) <- days_vector
# Re-using existing object make the program concise. 
Calculating total winnings
A_vector <- c(1, 2, 3)
B_vector <- c(4, 5, 6)

# Take the sum of A_vector and B_vector
total_vector <- A_vector + B_vector
  
# Print out total_vector
total_vector
[1] 5 7 9
Calculating total winnings (2)
# Assign to total_daily how much you won/lost on each day
total_daily <- poker_vector + roulette_vector
total_daily
   Monday   Tuesday Wednesday  Thursday    Friday 
      116      -100       120      -470       250 
Calculating total winnings (3)

R has many build-in functions. Referring to Built-in Functions

Every function has a unique name. The syntax to call a function is function_name()

# Total winnings with poker 
total_poker <- sum(poker_vector)

# Total winnings with roulette 
total_roulette <-  sum(roulette_vector)

# Total winnings overall 
total_week <- total_poker + total_roulette

# Print out total_week
total_week
[1] -84
Comparing total winnings
# Calculate total gains for poker and roulette
total_poker <- sum(poker_vector)
total_roulette <-  sum(roulette_vector)

# Check if you realized higher total gains in poker than in roulette
total_poker > total_roulette
[1] TRUE

🌻 Arithmatic Comparison Operators

  • < for less than
  • > for greater than
  • <= for less than or equal to
  • >= for greater than or equal to
  • == for equal to each other
  • != not equal to each other

🌻 Comparison Expressions evaluate to logical(boolean) values - TRUEFALSE


Indexing

🌷 In order to retrieve sub-elements within an object, we use index.

🌷 In R, the syntax of indexing is square baskets [].

🌷 Within the square baskets, we can use

  • position index: integer vectors such as c[c(3,5,10)]
  • name index:character vectors such as [c("Monday","Friday")]
  • conditional index:logical expressions such as [poker_vector > 0]


Vector selection: the good times - Position/Integer Indexing
# Assign the poker results of Wednesday to the variable poker_wednesday. 
# Using index notation `[]`
poker_wednesday <- poker_vector[3]
Vector selection: the good times (2)
# Assign the poker results of Tuesday, Wednesday and Thursday to 
# the variable `poker_midweek`, using `[c(2,3,4)]`
poker_midweek <- poker_vector[c(2,3,4)]
Vector selection: the good times (3)

Selecting a range of positions : 2:6 represents c(2,3,4,5,6)

# Poker and roulette winnings from Monday to Friday:
# Assign to roulette_selection_vector the roulette results 
# from Tuesday up to Friday; make use of `[2:5]`
roulette_selection_vector <- roulette_vector[2:5]
Vector selection: the good times (4) - Name(character) Indexing
# Poker and roulette winnings from Monday to Friday:
# Select poker results by names `[c("Monday", "Tuesday", "Wednesday")]`
poker_start <- poker_vector[c("Monday", "Tuesday", "Wednesday")]
poker_start
   Monday   Tuesday Wednesday 
      140       -50        20 
# Calculate the average of the elements in poker_start by `mean()`
mean(poker_start)
[1] 36.67
Selection by comparison - Step 1

Conditional Index is the most useful index method.

# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector

# Which days did you make money on poker? 
selection_vector <- poker_vector > 0
  
# Print out selection_vector
selection_vector  
   Monday   Tuesday Wednesday  Thursday    Friday 
     TRUE     FALSE      TRUE     FALSE      TRUE 
Selection by comparison - Step 2

Then we can use the logical vector selection_vector as an index.

# Select from poker_vector these days using the indexing vector `[selection_vector]`
poker_winning_days <- poker_vector[selection_vector]
poker_winning_days 
   Monday Wednesday    Friday 
      140        20       240 
Advanced selection
# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector

# Which days did you make money on roulette?
selection_vector <- roulette_vector > 0

# Select from roulette_vector these days
roulette_winning_days <- roulette_vector[selection_vector]
roulette_winning_days
Wednesday    Friday 
      100        10 


3. Matrices

A matrix is two-dimensional data object of a collection of elements of the same data type (numeric, character, or logical). The elements are arranged into a fixed number of rows and columns.

You can construct a matrix in R with the matrix() function. Consider the following example: matrix(1:9, byrow = TRUE, nrow = 3)

What’s a matrix?

The matrix() function transform a 1-d vector into a 2-d matrix

# Construct a matrix with 3 rows containing the numbers 1 up to 9, 
# filled row-wise.
matrix(1:9, byrow=T, nrow=3)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
Analyze matrices, you shall
# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Create box_office
# Concatenate the 3 vectors `c(new_hope, empire_strikes, return_jedi)` 
# Then create a matrix by `matrix()`. Remember to specify `byrow` and `nrow`
box_office <- c(new_hope,empire_strikes,return_jedi)
box_office
[1] 461.0 314.4 290.5 247.9 309.3 165.8
# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office, byrow=T, nrow=3)
  
# print out the matrix
star_wars_matrix
      [,1]  [,2]
[1,] 461.0 314.4
[2,] 290.5 247.9
[3,] 309.3 165.8
Naming a matrix

Every row and column of a matrix can has its own name.

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Construct matrix
star_wars_matrix <- matrix(
  c(new_hope, empire_strikes, return_jedi), 
  nrow = 3, byrow = TRUE)

# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
# Name the columns with region with `colnames()`
colnames(star_wars_matrix) = region

# Name the rows with titles  with `rownames()`
rownames(star_wars_matrix) = titles

# Print out star_wars_matrix
star_wars_matrix
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8

🌻 As above, The names make the matrix easier to understand.

Calculating the worldwide box office
# Calculate worldwide box office figures for each movies with `rowSums()`
worldwide_vector <- rowSums(star_wars_matrix) 

worldwide_vector
             A New Hope The Empire Strikes Back      Return of the Jedi 
                  775.4                   538.4                   475.1 
Adding a column for the Worldwide box office

cbind() helps to combine columns

# Construct worldwide box office vector
# Bind the new variable worldwide_vector as a column to star_wars_matrix with `cbind()`
all_wars_matrix <-cbind(star_wars_matrix,worldwide_vector)  

all_wars_matrix
                           US non-US worldwide_vector
A New Hope              461.0  314.4            775.4
The Empire Strikes Back 290.5  247.9            538.4
Return of the Jedi      309.3  165.8            475.1
Adding rows

We build another matrix of office boxes as star_wars_matrix2

star_wars_matrix2 = matrix(
  c(474.5,  552.5, 310.7,  338.7, 380.3,  468.5),
  byrow=T, nrow=3)
rownames(star_wars_matrix2) = c(
  "The Phantom Menace","Attack of the Clones",
  "Revenge of the Sith") 
colnames(star_wars_matrix2)=c("US", "non-US")

star_wars_matrix2
                        US non-US
The Phantom Menace   474.5  552.5
Attack of the Clones 310.7  338.7
Revenge of the Sith  380.3  468.5

rbind() helps to conbime rows. We use rbind() to combine the two matrices.

# Combine both Star Wars trilogies in one matrix with `rbind()`
all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2) 
  
all_wars_matrix  
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
The Phantom Menace      474.5  552.5
Attack of the Clones    310.7  338.7
Revenge of the Sith     380.3  468.5
The total box office revenue for the entire saga

colSums() make summation in the column direction.

# Total revenue for US and non-US with `colSums()`
total_revenue_vector <- colSums(all_wars_matrix)
  
# Print out total_revenue_vector
total_revenue_vector  
    US non-US 
  2226   2088 
Selection of matrix elements Indexing Matrix

Indexing 2D object takes 2 indexes [row_index,column_index]

# Select the non-US revenue for all movies with index notation `[,2]`
non_us_all <- all_wars_matrix[,2]
  
# Average non-US revenue with `mean()` 
mean(non_us_all)
[1] 348
# Select the non-US revenue for first two movies with index notation `[1:2,2]`
non_us_some <- all_wars_matrix[1:2,2]
  
# Average non-US revenue for first two movies with `mean()`
mean(non_us_some)
[1] 281.1
A little arithmetic with matrices

In the ticket price are always $5 dollars

# Estimate the visitors, assuming ticket price is $5
visitors <- all_wars_matrix/5
  
# Print the estimate to the console
visitors
                           US non-US
A New Hope              92.20  62.88
The Empire Strikes Back 58.10  49.58
Return of the Jedi      61.86  33.16
The Phantom Menace      94.90 110.50
Attack of the Clones    62.14  67.74
Revenge of the Sith     76.06  93.70
A little arithmetic with matrices (2)

If the ticket price are different among movies and regions, we need to build a ticket_prices_matrix first.

ticket_prices_matrix = matrix(
  c(5,5,6,6,7,7,4,4,4.5,4.5,4.9,4.9), byrow=T, nrow=6,
  dimnames=list(
    rownames(all_wars_matrix),
    colnames(all_wars_matrix))
  ); ticket_prices_matrix
                         US non-US
A New Hope              5.0    5.0
The Empire Strikes Back 6.0    6.0
Return of the Jedi      7.0    7.0
The Phantom Menace      4.0    4.0
Attack of the Clones    4.5    4.5
Revenge of the Sith     4.9    4.9
# The we can estimated number of visitors 
visitors <- all_wars_matrix / ticket_prices_matrix
visitors
                            US non-US
A New Hope               92.20  62.88
The Empire Strikes Back  48.41  41.32
Return of the Jedi       44.19  23.69
The Phantom Menace      118.62 138.12
Attack of the Clones     69.04  75.27
Revenge of the Sith      77.61  95.61
# US visitors 
us_visitors <- visitors[,1] 

# Average number of US visitors
mean(us_visitors)
[1] 75.01


4. Factors(Categories)

Data often falls into a limited number of categories. For example, human hair color can be categorized as black, brown, blond, red, grey, or white—and perhaps a few more options for people who color their hair. In R, categorical data is stored in factors. Factors are very important in data analysis, so start learning how to create, subset, and compare them now.

What’s a factor and why would you use it?
# Assign to variable theory the value "factors"
theory = "factors"
What’s a factor and why would you use it? (2)
# create `sex vector`
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
sex_vector
[1] "Male"   "Female" "Female" "Male"   "Male"  

sex_vector is a character vector. The factor() function help to transform character vectors into factor (categorical) vectors.

# Convert `sex_vector` to a factor
factor_sex_vector <- factor(sex_vector)

# Print out factor_sex_vector
factor_sex_vector
[1] Male   Female Female Male   Male  
Levels: Female Male

after calling factor()factor_sex_vector becomes a factor vector.

When pringing factor objects, in addition to the values R also print the Levels: below the values.


What’s a factor and why would you use it? (3)

There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.

A nominal variable is a categorical variable without an implied order. This means that it is impossible to say that ‘one is worth more than the other’. For example, think of the categorical variable animals_vector with the categories “Elephant”, “Giraffe”, “Donkey” and “Horse”. Here, it is impossible to say that one stands above or below the other. (Note that some of you might disagree ;-) ).

# Animals
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
[1] Elephant Giraffe  Donkey   Horse   
Levels: Donkey Elephant Giraffe Horse

In contrast, ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: “Low”, “Medium” and “High”. Here it is obvious that “Medium” stands above “Low”, and “High” stands above “Medium”.

🌻 To construct ordinal variables, we specify ordered=TRUE wile calling factor()

# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(
  temperature_vector, ordered = TRUE, 
  levels = c("Low", "Medium", "High"))
factor_temperature_vector
[1] High   Low    High   Low    Medium
Levels: Low < Medium < High

🌻 Levels: Low < Medium < High

🌻 implies that the elements are comparable among each other


Factor levels
# Code to build factor_survey_vector
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
factor_survey_vector
[1] M F F M M
Levels: F M

levels() can rename the levels.

# Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female","Male")
factor_survey_vector
[1] Male   Female Female Male   Male  
Levels: Female Male
Summarizing a factor

Take a summary() of the survey_vector and factor_survey_vector. Interpret the results of both vectors. Are they both equally useful in this case?

# Generate summary for survey_vector
summary(survey_vector)
   Length     Class      Mode 
        5 character character 
# Generate summary for factor_survey_vector
summary(factor_survey_vector)
Female   Male 
     2      3 

🌻 summary() works for factor vectors but not for character vectors.


Battle of the sexes
# Build factor_survey_vector with clean levels
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")

# Male
male <- factor_survey_vector[1]

# Female
female <- factor_survey_vector[2]

# Battle of the sexes: Male 'larger' than female?
male > female
Warning in Ops.factor(male, female): '>' not meaningful for factors
[1] NA

🌷 Elements among nominal objects are not comparable!


Ordered factors

Defind speed_vector as a Character vector with 5 entries, one for each analyst. Each entry should be either “slow”, “medium”, or “fast”. Use the list below:

  • Analyst 1 is medium,
  • Analyst 2 is slow,
  • Analyst 3 is slow,
  • Analyst 4 is medium and
  • Analyst 5 is fast.
# Create speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")
Ordered factors (2)
# Convert speed_vector to ordered factor vector 
# by specifying ordered=TRUE
factor_speed_vector <- factor(
  speed_vector,levels=c("slow","medium","fast"),ordered=TRUE)

# Print factor_speed_vector
factor_speed_vector
[1] medium slow   slow   medium fast  
Levels: slow < medium < fast
Comparing ordered factors
# Factor value for second data analyst
da2 <- factor_speed_vector[2]
# Factor value for fifth data analyst
da5 <- factor_speed_vector[5]

# Is data analyst 2 faster than data analyst 5?
da2 > da5
[1] FALSE

🌷 Elements among ordinal objects are comparable!



5. Data Frame

Most datasets you will be working with will be stored as data frames. By the end of this chapter, you will be able to create a data frame, select interesting parts of a data frame, and order a data frame according to certain variables.

What’s a data frame?

R has a build-in data frame mtcars

data(mtcars)
mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Type ?mtcars in console to seee the definition of the columns.

  • mpg: Miles/(US) gallon
  • cyl: Number of cylinders
  • disp: Displacement (cu.in.)
  • hp: Gross horsepower
  • drat: Rear axle ratio
  • wt: Weight (1000 lbs)
  • qsec: 1/4 mile time
  • vs: Engine (0 = V-shaped, 1 = straight)
  • am: Transmission (0 = automatic, 1 = manual)
  • gear: Number of forward gears
  • carb: Number of carburetors


Quick, have a look at your dataset
# Call head() on mtcars to see the first 6 rows of the data frame
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

🌻 Usually, within a data frame …

  • Usually, each row represent an subject under study, and
  • each column represnt an attribute of our interest.


Have a look at the structure
# Investigate the structure of mtcars with `str()`
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
Creating a data frame

Create a data frame planets_df to store the data for the 8 planets in our solar system.

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", 
          "Mars", "Jupiter", "Saturn", 
          "Uranus", "Neptune")
type <- c("Terrestrial planet", 
          "Terrestrial planet", 
          "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", 
          "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 
              11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 
              0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

# Create a data frame from the vectors
planets_df <- data.frame(name,type,diameter,rotation,rings)
  
planets_df
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
3   Earth Terrestrial planet    1.000     1.00 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
5 Jupiter          Gas giant   11.209     0.41  TRUE
6  Saturn          Gas giant    9.449     0.43  TRUE
7  Uranus          Gas giant    4.007    -0.72  TRUE
8 Neptune          Gas giant    3.883     0.67  TRUE
Creating a data frame (2)
# Check the structure of planets_df with `str`
str(planets_df)
'data.frame':   8 obs. of  5 variables:
 $ name    : chr  "Mercury" "Venus" "Earth" "Mars" ...
 $ type    : chr  "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
 $ diameter: num  0.382 0.949 1 0.532 11.209 ...
 $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
 $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...
Selection of data frame elements

Data frame is a 2D object, so it also takes two indexes ([,])

# Print out diameter of Mercury (row 1, column 3) with `[1,3]`
planets_df[1,3]
[1] 0.382
# Print out data for Mars (entire fourth row) with `[4,]`
planets_df[4,]
  name               type diameter rotation rings
4 Mars Terrestrial planet    0.532     1.03 FALSE
Selection of data frame elements (2)
# Select first 5 values of diameter column
planets_df[1:5,"diameter"]
[1]  0.382  0.949  1.000  0.532 11.209
Only planets with rings

The $ operator selects a column from a data frame. A column in a data frame is a vector by itself.

# Select the rings variable from planets_df
rings_vector <- planets_df$rings
  
# Print out rings_vector
rings_vector
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
Only planets with rings (2)

rings_vectoris a logical vector, so we can use it as an logical index directly.

# Adapt the code to select all columns for planets with rings
planets_df[rings_vector, ]
     name      type diameter rotation rings
5 Jupiter Gas giant   11.209     0.41  TRUE
6  Saturn Gas giant    9.449     0.43  TRUE
7  Uranus Gas giant    4.007    -0.72  TRUE
8 Neptune Gas giant    3.883     0.67  TRUE
Only planets with rings but shorter

subset() is used to select rows by condition.

# Select planets with diameter < 1 with `subset(df, condition)`
subset(planets_df, subset = diameter < 1)
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE

Of course we can also put the selecting condition as a logical expression within the square basket [ , ].

planets_df[planets_df$diameter>1, ]
     name      type diameter rotation rings
5 Jupiter Gas giant   11.209     0.41  TRUE
6  Saturn Gas giant    9.449     0.43  TRUE
7  Uranus Gas giant    4.007    -0.72  TRUE
8 Neptune Gas giant    3.883     0.67  TRUE
Sorting (Re-ordering)

order() returns the position of the elements in ascending order

# Play around with the `order()` function in the console
order(c(50,10,40,20,30))
[1] 2 4 5 3 1
# the smallest value (`10`) is in the 2nd place
# the second smallest value (`20`) is in the 4th place
# ...
# the largest value (`50`) is in the 1st place
Sorting your data frame

So if we store the position of the planets by their diameter in the position vector positions, …

# Use order() to create order index by diameter
positions <- order(planets_df$diameter)

we can use the position vector to re-order the planets by their size in ascending order.

# Use positions to sort planets_df with `df[order_index,]`
planets_df[positions,]
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
3   Earth Terrestrial planet    1.000     1.00 FALSE
8 Neptune          Gas giant    3.883     0.67  TRUE
7  Uranus          Gas giant    4.007    -0.72  TRUE
6  Saturn          Gas giant    9.449     0.43  TRUE
5 Jupiter          Gas giant   11.209     0.41  TRUE


6. Lists

As opposed to vectors, lists can hold components of different types, just as your to-do lists can contain different categories of tasks. This chapter will teach you how to create, name, and subset these lists.

Lists, why would you need them?

Congratulations! At this point in the course you are already familiar with:

  • Vectors (one dimensional array): can hold numeric, character or logical values. The elements in a vector all have the same data type.
  • Matrices (two dimensional array): can hold numeric, character or logical values. The elements in a matrix all have the same data type.
  • Data frames (two-dimensional objects): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Pretty sweet for an R newbie, right?
Lists, why would you need them? (2)

A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, and type of activity that has to be done.

A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

You could say that a list is some kind super data type: you can store practically any piece of information in it!

Creating a list

Put a vector (my_vector), a matrix (my_matrix) and a data frame ( my_df) into a list

# Vector with numerics from 1 up to 10
my_vector <- 1:10 

# Matrix with numerics from 1 up to 9
my_matrix <- matrix(1:9, ncol = 3)

# First 10 elements of the built-in data frame mtcars
my_df <- mtcars[1:10,]

# Construct list with these different elements:
my_list <- list(my_vector,my_matrix,my_df)

my_list
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

[[3]]
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Creating a named list

We can give every element a name when creating a list,

# Adapt list() call to change the components names to `vec`, `mat` and `df` 
my_list <- list(vec=my_vector, mat=my_matrix, df=my_df)

# Print out my_list
my_list
$vec
 [1]  1  2  3  4  5  6  7  8  9 10

$mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

$df
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

then we can retrieve the list element by its name.

my_list$mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
Creating a named list (2)

Put the information of a movie (“The Shining”) into a list (shining_list).

# The variables mov, act and rev are available
mov="The Shining"
act = c("Jack Nicholson","Shelley Duvall","Danny Lloyd",
        "Scatman Crothers","Barry Nelson")
rev = data.frame(
  scores = c(4.5,4.0,5.0),
  sources = c("IMDb1","IMDb2","IMDb3"),
  comments = c(
    "Best Horror Film I Have Ever Seen",
    "A truly brilliant and scary film from Stanley Kubrick",
    "A masterpiece of psychological horror"))

# Finish the code to build shining_list
shining_list <- list(
  moviename = mov,actors=act,reviews=rev)

shining_list
$moviename
[1] "The Shining"

$actors
[1] "Jack Nicholson"   "Shelley Duvall"   "Danny Lloyd"      "Scatman Crothers"
[5] "Barry Nelson"    

$reviews
  scores sources                                              comments
1    4.5   IMDb1                     Best Horror Film I Have Ever Seen
2    4.0   IMDb2 A truly brilliant and scary film from Stanley Kubrick
3    5.0   IMDb3                 A masterpiece of psychological horror
Selecting elements from a list
# Print out the vector representing the actors
shining_list$actors
[1] "Jack Nicholson"   "Shelley Duvall"   "Danny Lloyd"      "Scatman Crothers"
[5] "Barry Nelson"    
# Print the second element of the vector representing the actors
shining_list$actors[2]
[1] "Shelley Duvall"
Creating a new list for another movie

Put the information of another movie (“The Departed”) into a list (departed_list).

# define the comments and scores vectors
scores <- c(4.6, 5, 4.8, 5, 4.2)
comments <- c("I would watch it again", "Amazing!", "I liked it", 
              "One of the best movies","Fascinating plot")
movie_title = "The Departed"
movie_actors = c( "Leonardo DiCaprio","Matt Damon","Jack Nicholson",
                  "Mark Wahlberg","Vera Farmiga","Martin Sheen")

# Save the average of the scores vector as avg_review
avg_review = mean(scores)

# Combine scores and comments into the reviews_df data frame
reviews_df = data.frame(scores, comments)

# Create a list, called `departed_list`, 
# that contains the `movie_title`, `movie_actors`, 
# reviews data frame as `reviews_df`, 
# and the average review score as `avg_review`, and print it out.
departed_list = list( 
  movie_title, movie_actors, 
  reviews_df, avg_review)

departed_list
[[1]]
[1] "The Departed"

[[2]]
[1] "Leonardo DiCaprio" "Matt Damon"        "Jack Nicholson"   
[4] "Mark Wahlberg"     "Vera Farmiga"      "Martin Sheen"     

[[3]]
  scores               comments
1    4.6 I would watch it again
2    5.0               Amazing!
3    4.8             I liked it
4    5.0 One of the best movies
5    4.2       Fascinating plot

[[4]]
[1] 4.72