1. Vector Indexing

First, let’s define some vectors.

freq = c(3L, 5L, 1L, 1L, 3L)                # integer vector
amount = c(100, 168, 180, 280, 199)         # numeric
member = c(FALSE, TRUE, FALSE, TRUE, TRUE)  # logical
name = c("Amy", "Bob", "Cindy", "Danny", "Edward")  # character
gender = as.factor( c("F", "M", "F", "M", "M") )    # factor
class = as.factor(c('A','B','A','B','A'))           # factor


1.1 Position Index

🌻 Position indexes take the format of integer vector

freq[c(1,5)]
[1] 3 3
gender[2:4]
[1] M F M
Levels: F M
gender[5]      
[1] M
Levels: F M

An atomic object is treated as an vector of length 1

i = c(1:3, 5)   # we can define an integer vector object 
amount[i]       # and use it as an position index 
[1] 100 168 180 199

Index can be used to select

amount[ c(1,2) ]
[1] 100 168

reproduce

amount[ c(1,2,2,3,3,3,4,4,4,4) ] 
 [1] 100 168 168 180 180 180 280 280 280 280
# an position index may be londer than the targeted vector

or reorder elements in the targeted object

amount[ c(5,4,3,2,1)  ]
[1] 199 280 180 168 100


1.2 Name Index

amount is an unnamed numeric vector

amount
[1] 100 168 180 280 199
names(amount)
NULL

We can assign a name to each element of an collective objects

names(amount) =  c("Amy", "Bob", "Cindy", "Danny", "Edward")
amount
   Amy    Bob  Cindy  Danny Edward 
   100    168    180    280    199 

Now amount becomes an named numeric vector, and we can access its elements by names.

🌻 A name index is an character vector

i = c("Bob", "Cindy")
amount[ i ]
  Bob Cindy 
  168   180 


1.3 Conditional (Logical) Index

🌻 Conditional indexes are logical vectors (whose length equal to their targeted vectors.)

freq[ c(T,T,F,F,T) ]
[1] 3 5 3

🌻 Logical indexes let us select elements by conditions

amount[ freq <= 2 ]
Cindy Danny 
  180   280 

🗿 QUIZ:
With the vectors defined below …

noBuy = c(3L, 5L, 1L, 1L, 3L)                       # Integer
height = c(175, 168, 180, 181, 169)                 # numeric
isMale = c(FALSE, TRUE, FALSE, TRUE, TRUE)          # logical
name = c("Amy", "Bob", "Cindy", "Danny", "Edward")  # character
gender = factor( c("F", "M", "F", "M", "M") )       # factor
skin_color = factor( c("black", "black", "white", "yellow", "white") )  # factor


Use index and math functions to answer the following questions …

🗿: list the name of males

name[isMale]
[1] "Bob"    "Danny"  "Edward"

🗿: list the names of those who higher than 180

#

🗿: list the names of those who higher than 180 and skin color is “yellow”

#

🗿: calculate the average height of males

mean( height[gender == "M"] )
[1] 172.7

🗿: calculate the total number of buys (noBuy) by females

#

🗿: count the number of white female

#


2. Indexing Data Frames

2.1 The Benefit of Data Frame

Data frame is the most common and useful data structure. Usually

  • each row of a data frame represents an subject (unit of analysis) and
  • each column represents an an attribute or measure of interest.
df = data.frame(
  noBuy = c(3L, 5L, 1L, 1L, 3L),
  height = c(175, 168, 180, 181, 169),
  isMale = c(FALSE, TRUE, FALSE, TRUE, TRUE),
  name = c("Amy", "Bob", "Cindy", "Danny", "Edward"),
  gender = factor( c("F", "M", "F", "M", "M") ),
  skin_color = factor( c("black", "black", "white", "yellow", "white")),
  stringsAsFactors=FALSE
  )

Data frame is easier to examine

df
  noBuy height isMale   name gender skin_color
1     3    175  FALSE    Amy      F      black
2     5    168   TRUE    Bob      M      black
3     1    180  FALSE  Cindy      F      white
4     1    181   TRUE  Danny      M     yellow
5     3    169   TRUE Edward      M      white

to select

subset(df, isMale & skin_color == "black")
  noBuy height isMale name gender skin_color
2     5    168   TRUE  Bob      M      black

to count

table(df$gender)

F M 
2 3 

to summaries

mean(df$height)
[1] 174.6

to summaries/count by groups

tapply(df$height, df$gender, mean)
    F     M 
177.5 172.7 


2.2 Indexing Data Frames

Data Frames are a two-dimensional objects, so they take two indexes:

  • between the angle baskets and separated by comma - df[ row_idx, col_idx ]
  • usually row are selected by condition (logical index)
  • column are selected by name (name index)

We can index data frame by all three forms of index:

  • Positional/Integer Index: df[c(1,2), c(2,3)]
  • Name/Character Index: df[c(1,2), c("noBuy","height")]
  • Condition/Logical Index: df[df$gender=="M, c("noBuy","height")]

and some others extra indexing forms:

  • Empty Index:df[c(1,2), ] selects all columns (rows)
  • column name ($): df$name selects a specific column
  • subset() & filter()
    • subset(df, height<175 & isMale)
    • subset(df, height<175 & isMale, name)
    • subset(df, height<175 & isMale)$name
    • subset(df, height<175 & isMale, c(name, noBuy))

Below are some examples …

df[c(1,2), c(2,3)]
  height isMale
1    175  FALSE
2    168   TRUE
df[c(1,2), ]
  noBuy height isMale name gender skin_color
1     3    175  FALSE  Amy      F      black
2     5    168   TRUE  Bob      M      black
df[df$height < 175 & df$isMale, ]
  noBuy height isMale   name gender skin_color
2     5    168   TRUE    Bob      M      black
5     3    169   TRUE Edward      M      white
df[df$height < 175 & df$isMale, "name"]
[1] "Bob"    "Edward"
df$name[df$height < 175 & df$isMale]
[1] "Bob"    "Edward"
subset(df, height<175 & isMale)
  noBuy height isMale   name gender skin_color
2     5    168   TRUE    Bob      M      black
5     3    169   TRUE Edward      M      white
subset(df, height<175 & isMale, name)
    name
2    Bob
5 Edward
subset(df, height<175 & isMale)$name
[1] "Bob"    "Edward"
subset(df, height<175 & isMale, c(name, noBuy))
    name noBuy
2    Bob     5
5 Edward     3


🗿 QUIZ:
Annotate the function of each underlying code chunks as remarks …

For an example

df$name[df$isMale] # names of all males  
[1] "Bob"    "Danny"  "Edward"
df[df$height > 180 , "name"] # 
[1] "Danny"
subset(df, height > 170 & !isMale)$name # 
[1] "Amy"   "Cindy"
mean(df$height[df$isMale]) # 
[1] 172.7
df$height[!df$isMale] %>% mean # 
[1] 177.5
sum( subset(df, !isMale)$noBuy ) # 
[1] 4
subset(df, skin_color == "white" & !isMale ) %>% nrow # 
[1] 1
sum(df$skin_color == "white" & !df$isMale ) # 
[1] 1