1. Intro to Basics

Take your first steps with R. In this chapter, you will learn how to use the console as a calculator and how to assign variables. You will also get to know the basic data types in R. Let’s get started.

How it works

In the editor on the right there is already some sample code. Can you see which lines are actual R code and which are comments?

# Calculate 3 + 4
3+4
[1] 7
# Calculate 6 + 12
6+12
[1] 18

🌻 輸入運算式(expression),R會輸出運算的結果

Arithmetic with R
# An addition
5 + 5 
[1] 10
# A subtraction
5 - 5 
[1] 0
# A multiplication
3 * 5
[1] 15
 # A division
(5 + 5) / 2 
[1] 5

除了四則運算和括號,R還有一些常用的運算符號

# Exponentiation: Type 2^5 in the editor to calculate 2 to the power 5.
2^5
[1] 32
# Modulo: Type 28 %% 6 to calculate 28 modulo 6.
28%%6
[1] 4

🌻 Quick-R有很多對初學者有用的資訊

Variable assignment

我們可以用 <-=,把運算的結果儲存到一個「資料物件(data object, variable)」裡面

# Assign the value 42 to x
x <- 42
# Print out the value of the variable x
x
[1] 42
Variable assignment (2)
# Assign the value 5 to the variable my_apples
my_apples=5
# Print out the value of the variable my_apples
my_apples
[1] 5
Variable assignment (3)
# Assign a value to the variables my_apples and my_oranges
my_apples <- 5

# Assign to my_oranges the value 6.
my_oranges <- 6

# Add these two variables together
my_apples + my_oranges
[1] 11
# Create the variable my_fruit
my_fruit = my_apples + my_oranges

我們可以在運算式裡面使用資料物件,把運算的結果儲存到另外一個資料物件裡面

Apples and oranges

Arithmetic operators allow objects of numeric datatypes but not character.

# Assign a value to the variable my_apples
my_apples <- 5 

# Fix the assignment of my_oranges, so that it can be added with `my_apples`
# my_oranges <- "six"
my_oranges <- 6

# Create the variable my_fruit and print it out
my_fruit <- my_apples + my_oranges 
my_fruit
[1] 11
Basic datatypes in R
# Change my_numeric to be 42
my_numeric <- 42

# Change my_character to be "universe"
my_character <- "universe"

# Change my_logical to be FALSE
my_logical <- FALSE

🌻 常用的R資料種類(types)

  • 整數(integer)、實數(numeric)
  • 文字、字串(character)
  • 類別(factor)
  • 邏輯(logical/boolean)
  • 日期(Date)、時間(POXIXct,…)


What’s that data type?
# Declare variables of different types
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE 

# Check class of my_numeric
class(my_numeric)
[1] "numeric"
# Check class of my_character
class(my_character)
[1] "character"
# Check class of my_logical
class(my_logical)
[1] "logical"


2. Vectors 向量

向量(vector)是哪裡面最基本的資料結構,一個向量物件就是一個系列同一種資料種類(data type)的值。

We take you on a trip to Vegas, where you will learn how to analyze your gambling results using vectors in R. After completing this chapter, you will be able to create vectors in R, name them, select elements from them, and compare different vectors.

Create a vector
# Assign the value "Go!" to the variable `vegas`. Remember: R is case sensitive!
vegas <- "Go!"
Create a vector (2)

In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:
In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:

numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")

🌻 Use the c() function to create vector.

🌻 All elements in c() must be the same datatype.

我們可以用c()這一個功能(function),將同一種類的資料結合成一個向量
然後用<-=為這個向量物件取一個名稱

vector1 <- c(1.2, 3.5, 15.2)            # `vector1`是一個數值向量
vector2 = c("Alice", "Bob", "Cindy")    # `vector2`是一個文字向量
# Complete the code for `boolean_vector` contains the three elements: `TRUE`, `FALSE` and `TRUE` (in that order).
boolean_vector <- c(TRUE, FALSE, TRUE)
Create a vector (3)
# Poker winnings from Monday to Friday
# 從星期一到星期五玩撲克贏或輸的錢
poker_vector <- c(140, -50, 20, -120, 240)
# Roulette winnings: Monday lost $24, Tuesday lost $50, 
# Wednesday won $100, Thursday lost $350, and Friday won $10.
# 從星期一到星期五玩輪盤贏或輸的錢
roulette_vector <- c(-24,-50,100,-350,10)
Naming vector elements 向量元件的名稱

不只是物件可以有名稱(name),物件中的每一個元件(子物件)也可以有一個名稱

poker_vector
[1]  140  -50   20 -120  240

names()可以用來指定物件(poker_vector)中的每一個子物件的名字

# Assign days as names of poker_vector
names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
poker_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
      140       -50        20      -120       240 

🌻 如上所示,名稱可以讓物件(和子物件)都變得更容易解讀。

# Assign days as names of roulette_vector
names(roulette_vector) = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
roulette_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
      -24       -50       100      -350        10 
Naming a vector (2)
# Poker winnings from Monday to Friday
poker_vector <- c(140, -50, 20, -120, 240)

# Roulette winnings from Monday to Friday
roulette_vector <- c(-24, -50, 100, -350, 10)

# The variable days_vector
# 如果我們與先把週一到週五的文字預先存放在`days_vector`裡面
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
# Assign the names of the day to roulette_vector and poker_vector
# 程式寫起來就會比較簡捷
names(poker_vector) <- days_vector  
names(roulette_vector) <- days_vector
Calculating total winnings 向量的數學運算
A_vector <- c(1, 2, 3)
B_vector <- c(4, 5, 6)

# Take the sum of A_vector and B_vector
total_vector <- A_vector + B_vector
  
# Print out total_vector
total_vector
[1] 5 7 9
Calculating total winnings (2)
# Assign to total_daily how much you won/lost on each day
total_daily <- poker_vector + roulette_vector
total_daily
   Monday   Tuesday Wednesday  Thursday    Friday 
      116      -100       120      -470       250 
Calculating total winnings (3)

R用function_name()來表示功能呼叫,如sum(vector1)會回傳vector1之中所有數值的總和,常用的R內建功能請參考:Built-in Functions

# Total winnings with poker 玩撲克總共贏了多少錢呢?
total_poker <- sum(poker_vector)

# Total winnings with roulette 玩輪盤總共贏了多少錢呢?  
total_roulette <-  sum(roulette_vector)

# Total winnings overall 這一周的總輸贏是?  
total_week <- total_poker + total_roulette

# Print out total_week
total_week
[1] -84
Comparing total winnings
# Calculate total gains for poker and roulette
total_poker <- sum(poker_vector)
total_roulette <-  sum(roulette_vector)

# Check if you realized higher total gains in poker than in roulette
total_poker > total_roulette
[1] TRUE

🌻 Comparison Operators 比較運算符號

  • < for less than
  • > for greater than
  • <= for less than or equal to
  • >= for greater than or equal to
  • == for equal to each other
  • != not equal to each other

🌻 比較運算式(Comparison Expression)運算的結果會是邏輯值:TRUEFALSE


🌷 索引(index)可以取出物件之中的某些子物件,R用[]來作索引

🌷 R的索引方式非常靈活,一共有三種索引方式:

  • 位置索引:[整數向量]如:[c(3,5,10)], [2]
  • 名稱索引:[文字向量]如:[c(“Monday”,“Friday”)]
  • 條件索引:[邏輯向量]如:[poker_vector > 0]


Vector selection: the good times 位置(整數)索引
# Assign the poker results of Wednesday to the variable poker_wednesday. Using index notation `[]`
poker_wednesday <- poker_vector[3]
Vector selection: the good times (2) 位置(整數)索引
# Assign the poker results of Tuesday, Wednesday and Thursday to the variable poker_midweek, using `[c(2,3,4)]`
poker_midweek <- poker_vector[c(2,3,4)]
Vector selection: the good times (3) 位置(整數)索引

連續的整數向量可以寫成a:b這種形式,例如2:6代表c(2,3,4,5,6)

# Poker and roulette winnings from Monday to Friday:
# Assign to roulette_selection_vector the roulette results from Tuesday up to Friday; make use of `[2:5]`
roulette_selection_vector <- roulette_vector[2:5]
Vector selection: the good times (4) 名稱(文字)索引
# Poker and roulette winnings from Monday to Friday:
# Select poker results by names `[c("Monday", "Tuesday", "Wednesday")]`
poker_start <- poker_vector[c("Monday", "Tuesday", "Wednesday")]
poker_start
   Monday   Tuesday Wednesday 
      140       -50        20 

mean()計算數值向量之中所有數值的平均值

# Calculate the average of the elements in poker_start by `mean()`
mean(poker_start)
[1] 36.67
Selection by comparison - Step 1

🌷 條件索引是最常用的索引

# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector

# Which days did you make money on poker?
selection_vector <- poker_vector > 0
  
# Print out selection_vector
selection_vector  
   Monday   Tuesday Wednesday  Thursday    Friday 
     TRUE     FALSE      TRUE     FALSE      TRUE 

🌻 Comparison Operators

  • < for less than
  • > for greater than
  • <= for less than or equal to
  • >= for greater than or equal to
  • == for equal to each other
  • != not equal to each other
Selection by comparison - Step 2
# Select from poker_vector these days using the indexing vector `[selection_vector]`
poker_winning_days <- poker_vector[selection_vector]
poker_winning_days 
   Monday Wednesday    Friday 
      140        20       240 
Advanced selection
# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector

# Which days did you make money on roulette?
selection_vector <- roulette_vector > 0

# Select from roulette_vector these days
roulette_winning_days <- roulette_vector[selection_vector]
roulette_winning_days
Wednesday    Friday 
      100        10 


3. Matrices 矩陣

矩陣(maatrix)與向量一樣,一個矩陣物件裡面所有的子物件都必須要有同樣的資料類別(data type),不過矩陣是一種二維的資料結構,向量的子元件是一維的排列,而矩陣的子元件是二維的排列。在這個章節裡面, 我們先練習在R語言裡面如何定義和使用矩陣。

A matrix is two dimensional data object of a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

You can construct a matrix in R with the matrix() function. Consider the following example: matrix(1:9, byrow = TRUE, nrow = 3)

What’s a matrix?

matrix()可以將一維的向量傳變成二維的矩陣

# Construct a matrix with 3 rows containing the numbers 1 up to 9, filled row-wise.
matrix(1:9, byrow=T, nrow=3)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
Analyze matrices, you shall

製作星際大戰前三部電影的票房矩陣

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Create box_office
# Concatenate the 3 vectors `c(new_hope, empire_strikes, return_jedi)` 
# Then create a matrix by `matrix()`. Remember to specify `byrow` and `nrow`
box_office <- c(new_hope,empire_strikes,return_jedi)
box_office
[1] 461.0 314.4 290.5 247.9 309.3 165.8
# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office, byrow=T, nrow=3)
  
# print out the matrix
star_wars_matrix
      [,1]  [,2]
[1,] 461.0 314.4
[2,] 290.5 247.9
[3,] 309.3 165.8
Naming a matrix

矩陣每一行(column)和每一列(row)都可以有名稱

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Construct matrix
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)

# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
# Name the columns with region with `colnames()`
colnames(star_wars_matrix) = region

# Name the rows with titles  with `rownames()`
rownames(star_wars_matrix) = titles

# Print out star_wars_matrix
star_wars_matrix
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
Calculating the worldwide box office

使用rowSums()這個功能來計算星際大戰前三部電影的全球票房

# Calculate worldwide box office figures for each movies with `rowSums()`
worldwide_vector <- rowSums(star_wars_matrix) 

worldwide_vector
             A New Hope The Empire Strikes Back      Return of the Jedi 
                  775.4                   538.4                   475.1 
Adding a column for the Worldwide box office

cbind()在column的方向合併矩陣,用這個功能將全球票房向量worldwide_vector併入all_wars_matrix票房矩陣

# Construct worldwide box office vector
# Bind the new variable worldwide_vector as a column to star_wars_matrix with `cbind()`
all_wars_matrix <-cbind(star_wars_matrix,worldwide_vector)  

all_wars_matrix
                           US non-US worldwide_vector
A New Hope              461.0  314.4            775.4
The Empire Strikes Back 290.5  247.9            538.4
Return of the Jedi      309.3  165.8            475.1
Adding rows

rbind()在row的方向合併矩陣,我們先製作星戰系列後三部電影的票房矩陣(star_wars_matrix2)

star_wars_matrix2 = matrix(
  c(474.5,  552.5, 310.7,  338.7, 380.3,  468.5),
  byrow=T, nrow=3)
rownames(star_wars_matrix2) = c(
  "The Phantom Menace","Attack of the Clones",
  "Revenge of the Sith") 
colnames(star_wars_matrix2)=c("US", "non-US")

star_wars_matrix2
                        US non-US
The Phantom Menace   474.5  552.5
Attack of the Clones 310.7  338.7
Revenge of the Sith  380.3  468.5

然後用rbind()將它併入all_wars_matrix

# Combine both Star Wars trilogies in one matrix with `rbind()`
all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2) 
  
all_wars_matrix  
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
The Phantom Menace      474.5  552.5
Attack of the Clones    310.7  338.7
Revenge of the Sith     380.3  468.5
The total box office revenue for the entire saga

colSums計算星戰系列的USnon-US的票房總合

# Total revenue for US and non-US with `colSums()`
total_revenue_vector <- colSums(all_wars_matrix)
  
# Print out total_revenue_vector
total_revenue_vector  
    US non-US 
  2226   2088 
Selection of matrix elements 矩陣的索引

二維物件的索引需要以[row_index,column_index]的形式指定想要抽取的rows和columns

# Select the non-US revenue for all movies with index notation `[,2]`
non_us_all <- all_wars_matrix[,2]
  
# Average non-US revenue with `mean()` 
mean(non_us_all)
[1] 348

星戰系列電影海外票房的平均值

# Select the non-US revenue for first two movies with index notation `[1:2,2]`
non_us_some <- all_wars_matrix[1:2,2]
  
# Average non-US revenue for first two movies with `mean()`
mean(non_us_some)
[1] 281.1

星戰系列前兩部電影海外票房的平均值

A little arithmetic with matrices

假設所有電影在所有地區的票價都是$5,我們可以從票房矩陣推算出每一部電影的觀眾人數

# Estimate the visitors, assuming ticket price is $5
visitors <- all_wars_matrix/5
  
# Print the estimate to the console
visitors
                           US non-US
A New Hope              92.20  62.88
The Empire Strikes Back 58.10  49.58
Return of the Jedi      61.86  33.16
The Phantom Menace      94.90 110.50
Attack of the Clones    62.14  67.74
Revenge of the Sith     76.06  93.70
A little arithmetic with matrices (2)

假設每部電影在各地區的票價不相同,我們就要先製作一個票價矩陣(ticket_prices_matrix)

ticket_prices_matrix = matrix(
  c(5,5,6,6,7,7,4,4,4.5,4.5,4.9,4.9), byrow=T, nrow=6,
  dimnames=list(
    rownames(all_wars_matrix),
    colnames(all_wars_matrix))
  ); ticket_prices_matrix
                         US non-US
A New Hope              5.0    5.0
The Empire Strikes Back 6.0    6.0
Return of the Jedi      7.0    7.0
The Phantom Menace      4.0    4.0
Attack of the Clones    4.5    4.5
Revenge of the Sith     4.9    4.9

再推算出每一部電影在不同地區的觀眾人數

# Estimated number of visitors 
visitors <- all_wars_matrix / ticket_prices_matrix
visitors
                            US non-US
A New Hope               92.20  62.88
The Empire Strikes Back  48.41  41.32
Return of the Jedi       44.19  23.69
The Phantom Menace      118.62 138.12
Attack of the Clones     69.04  75.27
Revenge of the Sith      77.61  95.61

星戰系列各部電影美國觀眾人數的平均值是多少人呢?

# US visitors 
us_visitors <- visitors[,1] 

# Average number of US visitors
mean(us_visitors)
[1] 75.01


4. Factors 類別(因素)

某一些資料,像是顧客的性別(“男性”,“女性”),交通工具(“汽車”,“火車”,“飛機”)等,雖然說這些資料都是文字的方式呈現,但它們它們的內容只限定於某一些固定的類別,而不是任意的文字字串,在程式語言裡面,這一種資料為別會被稱為“類別(Factor)”資料。在這裡我們先介紹一下載R語言裡面如何定義、使用類別資料物件(Factor),在後續的單元裡面,我在做分類統計、分類比較或者交叉分析時,類別資料將會是主要的分類基礎。

Data often falls into a limited number of categories. For example, human hair color can be categorized as black, brown, blond, red, grey, or white—and perhaps a few more options for people who color their hair. In R, categorical data is stored in factors. Factors are very important in data analysis, so start learning how to create, subset, and compare them now.

What’s a factor and why would you use it?
# Assign to variable theory the value "factors"
theory = "factors"
What’s a factor and why would you use it? (2)
# create `sex vector`
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
sex_vector
[1] "Male"   "Female" "Female" "Male"   "Male"  

sex_vector是一個文字向量

factor()這一個功能可以把文字或者數值物件轉換成類別物件

# Convert `sex_vector` to a factor
factor_sex_vector <- factor(sex_vector)

# Print out factor_sex_vector
factor_sex_vector
[1] Male   Female Female Male   Male  
Levels: Female Male

轉換之後,factor_sex_vector就是一個類別向量。

🌻 列印類別物件時,在向量值的下方會註明這個類別物件裡面有哪一些類別(Levels:)


What’s a factor and why would you use it? (3)

一般的類別物件裡面,各類別之間並沒有大小的區別,雖然說列印的時候,R會依字母的順序列出各個類別,但是這列印順序並沒有大小的意涵。

# Animals
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
[1] Elephant Giraffe  Donkey   Horse   
Levels: Donkey Elephant Giraffe Horse

如果我們想要讓類別之間有大小,在呼叫factor()時就需要加進去order=TRUE這一個參數選項。

# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(
  temperature_vector, order = TRUE, 
  levels = c("Low", "Medium", "High"))
factor_temperature_vector
[1] High   Low    High   Low    Medium
Levels: Low < Medium < High

🌻 Levels: Low < Medium < High代表:HighMedium大,MediumLow


Factor levels
# Code to build factor_survey_vector