1. Intro to Basics

Take your first steps with R. In this chapter, you will learn how to use the console as a calculator and how to assign variables. You will also get to know the basic data types in R. Let’s get started.

How it works

In the editor on the right there is already some sample code. Can you see which lines are actual R code and which are comments?

# Calculate 3 + 4
3+4
[1] 7
# Calculate 6 + 12
6+12
[1] 18

🌻 輸入運算式(expression),R會輸出運算的結果

Arithmetic with R
# An addition
5 + 5 
[1] 10
# A subtraction
5 - 5 
[1] 0
# A multiplication
3 * 5
[1] 15
 # A division
(5 + 5) / 2 
[1] 5

除了四則運算和括號,R還有一些常用的運算符號

# Exponentiation: Type 2^5 in the editor to calculate 2 to the power 5.
2^5
[1] 32
# Modulo: Type 28 %% 6 to calculate 28 modulo 6.
28%%6
[1] 4

🌻 Quick-R有很多對初學者有用的資訊

Variable assignment

我們可以用 <-=,把運算的結果儲存到一個「資料物件(data object, variable)」裡面

# Assign the value 42 to x
x <- 42
# Print out the value of the variable x
x
[1] 42
Variable assignment (2)
# Assign the value 5 to the variable my_apples
my_apples=5
# Print out the value of the variable my_apples
my_apples
[1] 5
Variable assignment (3)
# Assign a value to the variables my_apples and my_oranges
my_apples <- 5

# Assign to my_oranges the value 6.
my_oranges <- 6

# Add these two variables together
my_apples + my_oranges
[1] 11
# Create the variable my_fruit
my_fruit = my_apples + my_oranges

我們可以在運算式裡面使用資料物件,把運算的結果儲存到另外一個資料物件裡面

Apples and oranges

Arithmetic operators allow objects of numeric datatypes but not character.

# Assign a value to the variable my_apples
my_apples <- 5 

# Fix the assignment of my_oranges, so that it can be added with `my_apples`
# my_oranges <- "six"
my_oranges <- 6

# Create the variable my_fruit and print it out
my_fruit <- my_apples + my_oranges 
my_fruit
[1] 11
Basic datatypes in R
# Change my_numeric to be 42
my_numeric <- 42

# Change my_character to be "universe"
my_character <- "universe"

# Change my_logical to be FALSE
my_logical <- FALSE

🌻 常用的R資料種類(types)

  • 整數(integer)、實數(numeric)
  • 文字、字串(character)
  • 類別(factor)
  • 邏輯(logical/boolean)
  • 日期(Date)、時間(POXIXct,…)


What’s that data type?
# Declare variables of different types
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE 

# Check class of my_numeric
class(my_numeric)
[1] "numeric"
# Check class of my_character
class(my_character)
[1] "character"
# Check class of my_logical
class(my_logical)
[1] "logical"


2. Vectors 向量

向量(vector)是哪裡面最基本的資料結構,一個向量物件就是一個系列同一種資料種類(data type)的值。

We take you on a trip to Vegas, where you will learn how to analyze your gambling results using vectors in R. After completing this chapter, you will be able to create vectors in R, name them, select elements from them, and compare different vectors.

Create a vector
# Assign the value "Go!" to the variable `vegas`. Remember: R is case sensitive!
vegas <- "Go!"
Create a vector (2)

In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:
In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:

numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")

🌻 Use the c() function to create vector.

🌻 All elements in c() must be the same datatype.

我們可以用c()這一個功能(function),將同一種類的資料結合成一個向量
然後用<-=為這個向量物件取一個名稱

vector1 <- c(1.2, 3.5, 15.2)            # `vector1`是一個數值向量
vector2 = c("Alice", "Bob", "Cindy")    # `vector2`是一個文字向量
# Complete the code for `boolean_vector` contains the three elements: `TRUE`, `FALSE` and `TRUE` (in that order).
boolean_vector <- c(TRUE, FALSE, TRUE)
Create a vector (3)
# Poker winnings from Monday to Friday
# 從星期一到星期五玩撲克贏或輸的錢
poker_vector <- c(140, -50, 20, -120, 240)
# Roulette winnings: Monday lost $24, Tuesday lost $50, 
# Wednesday won $100, Thursday lost $350, and Friday won $10.
# 從星期一到星期五玩輪盤贏或輸的錢
roulette_vector <- c(-24,-50,100,-350,10)
Naming vector elements 向量元件的名稱

不只是物件可以有名稱(name),物件中的每一個元件(子物件)也可以有一個名稱

poker_vector
[1]  140  -50   20 -120  240

names()可以用來指定物件(poker_vector)中的每一個子物件的名字

# Assign days as names of poker_vector
names(poker_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
poker_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
      140       -50        20      -120       240 

🌻 如上所示,名稱可以讓物件(和子物件)都變得更容易解讀。

# Assign days as names of roulette_vector
names(roulette_vector) = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
roulette_vector
   Monday   Tuesday Wednesday  Thursday    Friday 
      -24       -50       100      -350        10 
Naming a vector (2)
# Poker winnings from Monday to Friday
poker_vector <- c(140, -50, 20, -120, 240)

# Roulette winnings from Monday to Friday
roulette_vector <- c(-24, -50, 100, -350, 10)

# The variable days_vector
# 如果我們與先把週一到週五的文字預先存放在`days_vector`裡面
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
# Assign the names of the day to roulette_vector and poker_vector
# 程式寫起來就會比較簡捷
names(poker_vector) <- days_vector  
names(roulette_vector) <- days_vector
Calculating total winnings 向量的數學運算
A_vector <- c(1, 2, 3)
B_vector <- c(4, 5, 6)

# Take the sum of A_vector and B_vector
total_vector <- A_vector + B_vector
  
# Print out total_vector
total_vector
[1] 5 7 9
Calculating total winnings (2)
# Assign to total_daily how much you won/lost on each day
total_daily <- poker_vector + roulette_vector
total_daily
   Monday   Tuesday Wednesday  Thursday    Friday 
      116      -100       120      -470       250 
Calculating total winnings (3)

R用function_name()來表示功能呼叫,如sum(vector1)會回傳vector1之中所有數值的總和,常用的R內建功能請參考:Built-in Functions

# Total winnings with poker 玩撲克總共贏了多少錢呢?
total_poker <- sum(poker_vector)

# Total winnings with roulette 玩輪盤總共贏了多少錢呢?  
total_roulette <-  sum(roulette_vector)

# Total winnings overall 這一周的總輸贏是?  
total_week <- total_poker + total_roulette

# Print out total_week
total_week
[1] -84
Comparing total winnings
# Calculate total gains for poker and roulette
total_poker <- sum(poker_vector)
total_roulette <-  sum(roulette_vector)

# Check if you realized higher total gains in poker than in roulette
total_poker > total_roulette
[1] TRUE

🌻 Comparison Operators 比較運算符號

  • < for less than
  • > for greater than
  • <= for less than or equal to
  • >= for greater than or equal to
  • == for equal to each other
  • != not equal to each other

🌻 比較運算式(Comparison Expression)運算的結果會是邏輯值:TRUEFALSE


🌷 索引(index)可以取出物件之中的某些子物件,R用[]來作索引

🌷 R的索引方式非常靈活,一共有三種索引方式:

  • 位置索引:[整數向量]如:[c(3,5,10)], [2]
  • 名稱索引:[文字向量]如:[c(“Monday”,“Friday”)]
  • 條件索引:[邏輯向量]如:[poker_vector > 0]


Vector selection: the good times 位置(整數)索引
# Assign the poker results of Wednesday to the variable poker_wednesday. Using index notation `[]`
poker_wednesday <- poker_vector[3]
Vector selection: the good times (2) 位置(整數)索引
# Assign the poker results of Tuesday, Wednesday and Thursday to the variable poker_midweek, using `[c(2,3,4)]`
poker_midweek <- poker_vector[c(2,3,4)]
Vector selection: the good times (3) 位置(整數)索引

連續的整數向量可以寫成a:b這種形式,例如2:6代表c(2,3,4,5,6)

# Poker and roulette winnings from Monday to Friday:
# Assign to roulette_selection_vector the roulette results from Tuesday up to Friday; make use of `[2:5]`
roulette_selection_vector <- roulette_vector[2:5]
Vector selection: the good times (4) 名稱(文字)索引
# Poker and roulette winnings from Monday to Friday:
# Select poker results by names `[c("Monday", "Tuesday", "Wednesday")]`
poker_start <- poker_vector[c("Monday", "Tuesday", "Wednesday")]
poker_start
   Monday   Tuesday Wednesday 
      140       -50        20 

mean()計算數值向量之中所有數值的平均值

# Calculate the average of the elements in poker_start by `mean()`
mean(poker_start)
[1] 36.67
Selection by comparison - Step 1

🌷 條件索引是最常用的索引

# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector

# Which days did you make money on poker?
selection_vector <- poker_vector > 0
  
# Print out selection_vector
selection_vector  
   Monday   Tuesday Wednesday  Thursday    Friday 
     TRUE     FALSE      TRUE     FALSE      TRUE 

🌻 Comparison Operators

  • < for less than
  • > for greater than
  • <= for less than or equal to
  • >= for greater than or equal to
  • == for equal to each other
  • != not equal to each other
Selection by comparison - Step 2
# Select from poker_vector these days using the indexing vector `[selection_vector]`
poker_winning_days <- poker_vector[selection_vector]
poker_winning_days 
   Monday Wednesday    Friday 
      140        20       240 
Advanced selection
# Poker and roulette winnings from Monday to Friday:
poker_vector <- c(140, -50, 20, -120, 240)
roulette_vector <- c(-24, -50, 100, -350, 10)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
names(roulette_vector) <- days_vector

# Which days did you make money on roulette?
selection_vector <- roulette_vector > 0

# Select from roulette_vector these days
roulette_winning_days <- roulette_vector[selection_vector]
roulette_winning_days
Wednesday    Friday 
      100        10 


3. Matrices 矩陣

矩陣(maatrix)與向量一樣,一個矩陣物件裡面所有的子物件都必須要有同樣的資料類別(data type),不過矩陣是一種二維的資料結構,向量的子元件是一維的排列,而矩陣的子元件是二維的排列。在這個章節裡面, 我們先練習在R語言裡面如何定義和使用矩陣。

A matrix is two dimensional data object of a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

You can construct a matrix in R with the matrix() function. Consider the following example: matrix(1:9, byrow = TRUE, nrow = 3)

What’s a matrix?

matrix()可以將一維的向量傳變成二維的矩陣

# Construct a matrix with 3 rows containing the numbers 1 up to 9, filled row-wise.
matrix(1:9, byrow=T, nrow=3)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
Analyze matrices, you shall

製作星際大戰前三部電影的票房矩陣

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Create box_office
# Concatenate the 3 vectors `c(new_hope, empire_strikes, return_jedi)` 
# Then create a matrix by `matrix()`. Remember to specify `byrow` and `nrow`
box_office <- c(new_hope,empire_strikes,return_jedi)
box_office
[1] 461.0 314.4 290.5 247.9 309.3 165.8
# Construct star_wars_matrix
star_wars_matrix <- matrix(box_office, byrow=T, nrow=3)
  
# print out the matrix
star_wars_matrix
      [,1]  [,2]
[1,] 461.0 314.4
[2,] 290.5 247.9
[3,] 309.3 165.8
Naming a matrix

矩陣每一行(column)和每一列(row)都可以有名稱

# Box office Star Wars (in millions!)
new_hope <- c(460.998, 314.4)
empire_strikes <- c(290.475, 247.900)
return_jedi <- c(309.306, 165.8)

# Construct matrix
star_wars_matrix <- matrix(c(new_hope, empire_strikes, return_jedi), nrow = 3, byrow = TRUE)

# Vectors region and titles, used for naming
region <- c("US", "non-US")
titles <- c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
# Name the columns with region with `colnames()`
colnames(star_wars_matrix) = region

# Name the rows with titles  with `rownames()`
rownames(star_wars_matrix) = titles

# Print out star_wars_matrix
star_wars_matrix
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
Calculating the worldwide box office

使用rowSums()這個功能來計算星際大戰前三部電影的全球票房

# Calculate worldwide box office figures for each movies with `rowSums()`
worldwide_vector <- rowSums(star_wars_matrix) 

worldwide_vector
             A New Hope The Empire Strikes Back      Return of the Jedi 
                  775.4                   538.4                   475.1 
Adding a column for the Worldwide box office

cbind()在column的方向合併矩陣,用這個功能將全球票房向量worldwide_vector併入all_wars_matrix票房矩陣

# Construct worldwide box office vector
# Bind the new variable worldwide_vector as a column to star_wars_matrix with `cbind()`
all_wars_matrix <-cbind(star_wars_matrix,worldwide_vector)  

all_wars_matrix
                           US non-US worldwide_vector
A New Hope              461.0  314.4            775.4
The Empire Strikes Back 290.5  247.9            538.4
Return of the Jedi      309.3  165.8            475.1
Adding rows

rbind()在row的方向合併矩陣,我們先製作星戰系列後三部電影的票房矩陣(star_wars_matrix2)

star_wars_matrix2 = matrix(
  c(474.5,  552.5, 310.7,  338.7, 380.3,  468.5),
  byrow=T, nrow=3)
rownames(star_wars_matrix2) = c(
  "The Phantom Menace","Attack of the Clones",
  "Revenge of the Sith") 
colnames(star_wars_matrix2)=c("US", "non-US")

star_wars_matrix2
                        US non-US
The Phantom Menace   474.5  552.5
Attack of the Clones 310.7  338.7
Revenge of the Sith  380.3  468.5

然後用rbind()將它併入all_wars_matrix

# Combine both Star Wars trilogies in one matrix with `rbind()`
all_wars_matrix <- rbind(star_wars_matrix, star_wars_matrix2) 
  
all_wars_matrix  
                           US non-US
A New Hope              461.0  314.4
The Empire Strikes Back 290.5  247.9
Return of the Jedi      309.3  165.8
The Phantom Menace      474.5  552.5
Attack of the Clones    310.7  338.7
Revenge of the Sith     380.3  468.5
The total box office revenue for the entire saga

colSums計算星戰系列的USnon-US的票房總合

# Total revenue for US and non-US with `colSums()`
total_revenue_vector <- colSums(all_wars_matrix)
  
# Print out total_revenue_vector
total_revenue_vector  
    US non-US 
  2226   2088 
Selection of matrix elements 矩陣的索引

二維物件的索引需要以[row_index,column_index]的形式指定想要抽取的rows和columns

# Select the non-US revenue for all movies with index notation `[,2]`
non_us_all <- all_wars_matrix[,2]
  
# Average non-US revenue with `mean()` 
mean(non_us_all)
[1] 348

星戰系列電影海外票房的平均值

# Select the non-US revenue for first two movies with index notation `[1:2,2]`
non_us_some <- all_wars_matrix[1:2,2]
  
# Average non-US revenue for first two movies with `mean()`
mean(non_us_some)
[1] 281.1

星戰系列前兩部電影海外票房的平均值

A little arithmetic with matrices

假設所有電影在所有地區的票價都是$5,我們可以從票房矩陣推算出每一部電影的觀眾人數

# Estimate the visitors, assuming ticket price is $5
visitors <- all_wars_matrix/5
  
# Print the estimate to the console
visitors
                           US non-US
A New Hope              92.20  62.88
The Empire Strikes Back 58.10  49.58
Return of the Jedi      61.86  33.16
The Phantom Menace      94.90 110.50
Attack of the Clones    62.14  67.74
Revenge of the Sith     76.06  93.70
A little arithmetic with matrices (2)

假設每部電影在各地區的票價不相同,我們就要先製作一個票價矩陣(ticket_prices_matrix)

ticket_prices_matrix = matrix(
  c(5,5,6,6,7,7,4,4,4.5,4.5,4.9,4.9), byrow=T, nrow=6,
  dimnames=list(
    rownames(all_wars_matrix),
    colnames(all_wars_matrix))
  ); ticket_prices_matrix
                         US non-US
A New Hope              5.0    5.0
The Empire Strikes Back 6.0    6.0
Return of the Jedi      7.0    7.0
The Phantom Menace      4.0    4.0
Attack of the Clones    4.5    4.5
Revenge of the Sith     4.9    4.9

再推算出每一部電影在不同地區的觀眾人數

# Estimated number of visitors 
visitors <- all_wars_matrix / ticket_prices_matrix
visitors
                            US non-US
A New Hope               92.20  62.88
The Empire Strikes Back  48.41  41.32
Return of the Jedi       44.19  23.69
The Phantom Menace      118.62 138.12
Attack of the Clones     69.04  75.27
Revenge of the Sith      77.61  95.61

星戰系列各部電影美國觀眾人數的平均值是多少人呢?

# US visitors 
us_visitors <- visitors[,1] 

# Average number of US visitors
mean(us_visitors)
[1] 75.01


4. Factors 類別(因素)

某一些資料,像是顧客的性別(“男性”,“女性”),交通工具(“汽車”,“火車”,“飛機”)等,雖然說這些資料都是文字的方式呈現,但它們它們的內容只限定於某一些固定的類別,而不是任意的文字字串,在程式語言裡面,這一種資料為別會被稱為“類別(Factor)”資料。在這裡我們先介紹一下載R語言裡面如何定義、使用類別資料物件(Factor),在後續的單元裡面,我在做分類統計、分類比較或者交叉分析時,類別資料將會是主要的分類基礎。

Data often falls into a limited number of categories. For example, human hair color can be categorized as black, brown, blond, red, grey, or white—and perhaps a few more options for people who color their hair. In R, categorical data is stored in factors. Factors are very important in data analysis, so start learning how to create, subset, and compare them now.

What’s a factor and why would you use it?
# Assign to variable theory the value "factors"
theory = "factors"
What’s a factor and why would you use it? (2)
# create `sex vector`
sex_vector <- c("Male", "Female", "Female", "Male", "Male")
sex_vector
[1] "Male"   "Female" "Female" "Male"   "Male"  

sex_vector是一個文字向量

factor()這一個功能可以把文字或者數值物件轉換成類別物件

# Convert `sex_vector` to a factor
factor_sex_vector <- factor(sex_vector)

# Print out factor_sex_vector
factor_sex_vector
[1] Male   Female Female Male   Male  
Levels: Female Male

轉換之後,factor_sex_vector就是一個類別向量。

🌻 列印類別物件時,在向量值的下方會註明這個類別物件裡面有哪一些類別(Levels:)


What’s a factor and why would you use it? (3)

一般的類別物件裡面,各類別之間並沒有大小的區別,雖然說列印的時候,R會依字母的順序列出各個類別,但是這列印順序並沒有大小的意涵。

# Animals
animals_vector <- c("Elephant", "Giraffe", "Donkey", "Horse")
factor_animals_vector <- factor(animals_vector)
factor_animals_vector
[1] Elephant Giraffe  Donkey   Horse   
Levels: Donkey Elephant Giraffe Horse

如果我們想要讓類別之間有大小,在呼叫factor()時就需要加進去order=TRUE這一個參數選項。

# Temperature
temperature_vector <- c("High", "Low", "High","Low", "Medium")
factor_temperature_vector <- factor(
  temperature_vector, order = TRUE, 
  levels = c("Low", "Medium", "High"))
factor_temperature_vector
[1] High   Low    High   Low    Medium
Levels: Low < Medium < High

🌻 Levels: Low < Medium < High代表:HighMedium大,MediumLow


Factor levels
# Code to build factor_survey_vector
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
factor_survey_vector
[1] M F F M M
Levels: F M

levels()可以用來改變類別(level)的名稱

# Specify the levels of factor_survey_vector
levels(factor_survey_vector) <- c("Female","Male")
factor_survey_vector
[1] Male   Female Female Male   Male  
Levels: Female Male
Summarizing a factor

Take a summary() of the survey_vector and factor_survey_vector. Interpret the results of both vectors. Are they both equally useful in this case?

對類別向量factor_survey_vector而言,summary()可以統計各分類的次數,但它對文字向量(survey_vector)是沒有用的

# Generate summary for survey_vector
summary(survey_vector)
   Length     Class      Mode 
        5 character character 
# Generate summary for factor_survey_vector
summary(factor_survey_vector)
Female   Male 
     2      3 
Battle of the sexes
# Build factor_survey_vector with clean levels
survey_vector <- c("M", "F", "F", "M", "M")
factor_survey_vector <- factor(survey_vector)
levels(factor_survey_vector) <- c("Female", "Male")

# Male
male <- factor_survey_vector[1]

# Female
female <- factor_survey_vector[2]

# Battle of the sexes: Male 'larger' than female?
male > female
Warning in Ops.factor(male, female): '>' not meaningful for factors
[1] NA

🌷 因為在產生factor_survey_vector時我們並沒有指定order=TRUE,所以它的各子元件之間是不能比大小的!

Ordered factors

Defind speed_vector as a Character vector with 5 entries, one for each analyst. Each entry should be either “slow”, “medium”, or “fast”. Use the list below:

  • Analyst 1 is medium,
  • Analyst 2 is slow,
  • Analyst 3 is slow,
  • Analyst 4 is medium and
  • Analyst 5 is fast.

我們用speed_vector來記錄五位分析師的工作速度

# Create speed_vector
speed_vector <- c("medium", "slow", "slow", "medium", "fast")
Ordered factors (2)

然後用`factor(…,order=TRUE)將其轉變為一個有序的類別向量(ordinal factor vector)

# Convert speed_vector to ordered factor vector
factor_speed_vector <- factor(
  speed_vector,levels=c("slow","medium","fast"),ordered=TRUE)

# Print factor_speed_vector
factor_speed_vector
[1] medium slow   slow   medium fast  
Levels: slow < medium < fast
Comparing ordered factors
# Factor value for second data analyst
da2 <- factor_speed_vector[2]
# Factor value for fifth data analyst
da5 <- factor_speed_vector[5]

# Is data analyst 2 faster than data analyst 5?
da2 > da5
[1] FALSE

然後我們就可以用比較運算符號(>)在它的子元素之間做大小比較了!



5. Data Frame 資料框

跟矩陣一樣資料框(data frame)也是一種二維的資料結構,不過矩陣物件裡全部的子元件必須要有同樣的資料類別(data type),而資料框沒有這一個限制,它的行columns可以有不一樣的資料種類,在後續的單元裡面我們會看到資料框是商業數據分析裡面最常見、最重要的一種資料結構,在這個章節我們先介紹一下載R語言裡面如何定義、使用資料框。

Data often falls into a limited number of categories. For example, human hair color can be categorized as black, brown, blond, red, grey, or white—and perhaps a few more options for people who color their hair. In R, categorical data is stored in factors. Factors are very important in data analysis, so start learning how to create, subset, and compare them now.

What’s a data frame?

R裡面有一個內建的資料框mtcars,如下所示,它記錄了一些汽車型號的各種屬性

data(mtcars)
mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

各屬性的定義

# ?mtcars
Quick, have a look at your dataset

看看這個資料框的前面幾筆計錄

# Call head() on mtcars
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

🌻 商業數據通常會是表格的型式,通常每個row代表一個分析對象(車型),每個column(mpg耗油量,cyl汽缸數,…)代表分析對象的某一個屬性。


Have a look at the structure

str()可以看出資料物件的資料種類(data frame)和它的內部結構

# Investigate the structure of mtcars with `str()`
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
Creating a data frame

將八大行星的屬性紀錄在planets_df這一個資料框裡面

# Definition of vectors
name <- c("Mercury", "Venus", "Earth", 
          "Mars", "Jupiter", "Saturn", 
          "Uranus", "Neptune")
type <- c("Terrestrial planet", 
          "Terrestrial planet", 
          "Terrestrial planet", 
          "Terrestrial planet", "Gas giant", 
          "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 
              11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 
              0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)

# Create a data frame from the vectors
planets_df <- data.frame(name,type,diameter,rotation,rings)
  
planets_df
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
3   Earth Terrestrial planet    1.000     1.00 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
5 Jupiter          Gas giant   11.209     0.41  TRUE
6  Saturn          Gas giant    9.449     0.43  TRUE
7  Uranus          Gas giant    4.007    -0.72  TRUE
8 Neptune          Gas giant    3.883     0.67  TRUE
Creating a data frame (2)
# Check the structure of planets_df with `str`
str(planets_df)
'data.frame':   8 obs. of  5 variables:
 $ name    : chr  "Mercury" "Venus" "Earth" "Mars" ...
 $ type    : chr  "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" "Terrestrial planet" ...
 $ diameter: num  0.382 0.949 1 0.532 11.209 ...
 $ rotation: num  58.64 -243.02 1 1.03 0.41 ...
 $ rings   : logi  FALSE FALSE FALSE FALSE TRUE TRUE ...
Selection of data frame elements

因為資料框是一個二維結構,所以要抽取他的某一個部分也需要兩個索引([,])

# 位置索引 Print out diameter of Mercury (row 1, column 3) with `[1,3]`
planets_df[1,3]
[1] 0.382
# 位置索引 Print out data for Mars (entire fourth row) with `[4,]`
planets_df[4,]
  name               type diameter rotation rings
4 Mars Terrestrial planet    0.532     1.03 FALSE
Selection of data frame elements (2)

位置索引、名稱索引和條件索引可以混合使用

# Select first 5 values of diameter column
planets_df[1:5,"diameter"]
[1]  0.382  0.949  1.000  0.532 11.209
Only planets with rings

$符號可以用來抽取資料框的某一個column,注意一下,資料框(data frame)的一個column其實是一個向量(vector)

# Select the rings variable from planets_df
rings_vector <- planets_df$rings
  
# Print out rings_vector
rings_vector
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
Only planets with rings (2)

因為rings_vector是一個邏輯向量,我們可以用它當邏輯索引

# Adapt the code to select all columns for planets with rings
planets_df[rings_vector, ]
     name      type diameter rotation rings
5 Jupiter Gas giant   11.209     0.41  TRUE
6  Saturn Gas giant    9.449     0.43  TRUE
7  Uranus Gas giant    4.007    -0.72  TRUE
8 Neptune Gas giant    3.883     0.67  TRUE
Only planets with rings but shorter

subset()可以依條件對資料框做篩選

# Select planets with diameter < 1 with `subset(df, condition)`
subset(planets_df, subset = diameter < 1)
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
planets_df[planets_df$diameter>1, ]
     name      type diameter rotation rings
5 Jupiter Gas giant   11.209     0.41  TRUE
6  Saturn Gas giant    9.449     0.43  TRUE
7  Uranus Gas giant    4.007    -0.72  TRUE
8 Neptune Gas giant    3.883     0.67  TRUE
Sorting

order()會以整數向量的行式回傳向量中每個值的排序(由小到大)

# Play around with the `order()` function in the console
order(c(50,10,40,20,30))
[1] 2 4 5 3 1
# 最小的是排在第2位的`10`
# 第二小的是排在第4位的`20`
# ...
# 最大的是排在第1位的`50`
Sorting your data frame

我們可以依直徑的大小(從小到大)對八大行星做一個排序,將次序向量放在positions裡面

# Use order() to create order index by diameter
positions <- order(planets_df$diameter)

用次序向量(positions)作索引,等於是對資料框(依diameter從小到大)作排序

# Use positions to sort planets_df with `df[order_index,]`
planets_df[positions,]
     name               type diameter rotation rings
1 Mercury Terrestrial planet    0.382    58.64 FALSE
4    Mars Terrestrial planet    0.532     1.03 FALSE
2   Venus Terrestrial planet    0.949  -243.02 FALSE
3   Earth Terrestrial planet    1.000     1.00 FALSE
8 Neptune          Gas giant    3.883     0.67  TRUE
7  Uranus          Gas giant    4.007    -0.72  TRUE
6  Saturn          Gas giant    9.449     0.43  TRUE
5 Jupiter          Gas giant   11.209     0.41  TRUE


6. Lists 序列

序列(list)是比較複雜的一種資料結構,基本上各種不同結構的物件都可以依序放在同一個序列裡面,在這一個章節我們先簡單的示範一下序列的製作和操作,在後續的課程單元之中會再用實例講解系列的操作。

As opposed to vectors, lists can hold components of different types, just as your to-do lists can contain different categories of tasks. This chapter will teach you how to create, name, and subset these lists.

Lists, why would you need them?

Congratulations! At this point in the course you are already familiar with:

  • Vectors (one dimensional array): can hold numeric, character or logical values. The elements in a vector all have the same data type.
  • Matrices (two dimensional array): can hold numeric, character or logical values. The elements in a matrix all have the same data type.
  • Data frames (two-dimensional objects): can hold numeric, character or logical values. Within a column all elements have the same data type, but different columns can be of different data type. Pretty sweet for an R newbie, right?
Lists, why would you need them? (2)

A list in R is similar to your to-do list at work or school: the different items on that list most likely differ in length, characteristic, and type of activity that has to be done.

A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

You could say that a list is some kind super data type: you can store practically any piece of information in it!

Creating a list

將向量(my_vector)、矩陣(my_matrix)和資料框(my_df)放在同一個序列(my_list)之中

# Vector with numerics from 1 up to 10
my_vector <- 1:10 

# Matrix with numerics from 1 up to 9
my_matrix <- matrix(1:9, ncol = 3)

# First 10 elements of the built-in data frame mtcars
my_df <- mtcars[1:10,]

# Construct list with these different elements:
my_list <- list(my_vector,my_matrix,my_df)

my_list
[[1]]
 [1]  1  2  3  4  5  6  7  8  9 10

[[2]]
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

[[3]]
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

my_list裡面有三個子物件但這些子物件並沒有名稱

Creating a named list

我們可以在產生my_list時,對每一個子物件指定名稱

# Adapt list() call to change the components names to `vec`, `mat` and `df` 
my_list <- list(vec=my_vector, mat=my_matrix, df=my_df)

# Print out my_list
my_list
$vec
 [1]  1  2  3  4  5  6  7  8  9 10

$mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

$df
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

然後我們就可以依名稱抽取該序列之中的子物件

my_list$mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
Creating a named list (2)

將一部電影(“The Shining”)的資料放在一個序列(shining_list)裡面

# The variables mov, act and rev are available
mov="The Shining"
act = c("Jack Nicholson","Shelley Duvall","Danny Lloyd",
        "Scatman Crothers","Barry Nelson")
rev = data.frame(
  scores = c(4.5,4.0,5.0),
  sources = c("IMDb1","IMDb2","IMDb3"),
  comments = c(
    "Best Horror Film I Have Ever Seen",
    "A truly brilliant and scary film from Stanley Kubrick",
    "A masterpiece of psychological horror"))

# Finish the code to build shining_list
shining_list <- list(
  moviename = mov,actors=act,reviews=rev)

shining_list
$moviename
[1] "The Shining"

$actors
[1] "Jack Nicholson"   "Shelley Duvall"   "Danny Lloyd"      "Scatman Crothers"
[5] "Barry Nelson"    

$reviews
  scores sources                                              comments
1    4.5   IMDb1                     Best Horror Film I Have Ever Seen
2    4.0   IMDb2 A truly brilliant and scary film from Stanley Kubrick
3    5.0   IMDb3                 A masterpiece of psychological horror
Selecting elements from a list
# Print out the vector representing the actors
shining_list$actors
[1] "Jack Nicholson"   "Shelley Duvall"   "Danny Lloyd"      "Scatman Crothers"
[5] "Barry Nelson"    
# Print the second element of the vector representing the actors
shining_list$actors[2]
[1] "Shelley Duvall"
Creating a new list for another movie

將另一部電影(“The Departed”)的資料放在另一個序列(shining_list)裡面

# define the comments and scores vectors
scores <- c(4.6, 5, 4.8, 5, 4.2)
comments <- c("I would watch it again", "Amazing!", "I liked it", 
              "One of the best movies","Fascinating plot")
movie_title = "The Departed"
movie_actors = c( "Leonardo DiCaprio","Matt Damon","Jack Nicholson",
                  "Mark Wahlberg","Vera Farmiga","Martin Sheen")

# Save the average of the scores vector as avg_review
avg_review = mean(scores)

# Combine scores and comments into the reviews_df data frame
reviews_df = data.frame(scores, comments)

# Create a list, called `departed_list`, 
# that contains the `movie_title`, `movie_actors`, 
# reviews data frame as `reviews_df`, 
# and the average review score as `avg_review`, and print it out.
departed_list = list( 
  movie_title, movie_actors, 
  reviews_df, avg_review)

departed_list
[[1]]
[1] "The Departed"

[[2]]
[1] "Leonardo DiCaprio" "Matt Damon"        "Jack Nicholson"   
[4] "Mark Wahlberg"     "Vera Farmiga"      "Martin Sheen"     

[[3]]
  scores               comments
1    4.6 I would watch it again
2    5.0               Amazing!
3    4.8             I liked it
4    5.0 One of the best movies
5    4.2       Fascinating plot

[[4]]
[1] 4.72