UNIT2B：Using Index

1. Vector Indexing

We need index to access specific elements within an collectives
In R, indexes are specified in angle baskets. Ex.,
- x[5] take the 5th element in x
- x[c(2,5)] takes the 2nd and the 5th elements in x
- x[2:5] takes the 2nd, 3rd, 4th and the 5th elements in x
Vector is one dimensional, so it need only one index
Matrix and Data Frame are two dimensional. They need two index.
There’re three type of index, as will be elaborated below

First, let’s define some vectors.

freq = c(3L, 5L, 1L, 1L, 3L)                # integer vector
amount = c(100, 168, 180, 280, 199)         # numeric
member = c(FALSE, TRUE, FALSE, TRUE, TRUE)  # logical
name = c("Amy", "Bob", "Cindy", "Danny", "Edward")  # character
gender = as.factor( c("F", "M", "F", "M", "M") )    # factor
class = as.factor(c('A','B','A','B','A'))           # factor

1.1 Position Index

🌻 Position indexes take the format of integer vector

freq[c(1,5)]

[1] 3 3

gender[2:4]

[1] M F M
Levels: F M

gender[5]

[1] M
Levels: F M

An atomic object is treated as an vector of length 1

i = c(1:3, 5)   # we can define an integer vector object 
amount[i]       # and use it as an position index

[1] 100 168 180 199

Index can be used to select …

amount[ c(1,2) ]

[1] 100 168

reproduce …

amount[ c(1,2,2,3,3,3,4,4,4,4) ]

 [1] 100 168 168 180 180 180 280 280 280 280

# an position index may be londer than the targeted vector

or reorder elements in the targeted object

amount[ c(5,4,3,2,1)  ]

[1] 199 280 180 168 100

1.2 Name Index

amount is an unnamed numeric vector

amount

[1] 100 168 180 280 199

names(amount)

NULL

We can assign a name to each element of an collective objects

names(amount) =  c("Amy", "Bob", "Cindy", "Danny", "Edward")

amount

   Amy    Bob  Cindy  Danny Edward 
   100    168    180    280    199

Now amount becomes an named numeric vector, and we can access its elements by names.

🌻 A name index is an character vector

i = c("Bob", "Cindy")
amount[ i ]

  Bob Cindy 
  168   180

1.3 Conditional (Logical) Index

🌻 Conditional indexes are logical vectors (whose length equal to their targeted vectors.)

freq[ c(T,T,F,F,T) ]

[1] 3 5 3

🌻 Logical indexes let us select elements by conditions

amount[ freq <= 2 ]

Cindy Danny 
  180   280

🗿 QUIZ：
With the vectors defined below …

noBuy = c(3L, 5L, 1L, 1L, 3L)                       # Integer
height = c(175, 168, 180, 181, 169)                 # numeric
isMale = c(FALSE, TRUE, FALSE, TRUE, TRUE)          # logical
name = c("Amy", "Bob", "Cindy", "Danny", "Edward")  # character
gender = factor( c("F", "M", "F", "M", "M") )       # factor
skin_color = factor( c("black", "black", "white", "yellow", "white") )  # factor

Use index and math functions to answer the following questions …

🗿: list the name of males

name[isMale]

[1] "Bob"    "Danny"  "Edward"

🗿: list the names of those who higher than 180

🗿: list the names of those who higher than 180 and skin color is “yellow”

🗿: calculate the average height of males

mean( height[gender == "M"] )

[1] 172.7

🗿: calculate the total number of buys (noBuy) by females

🗿: count the number of white female

2. Indexing Data Frames

2.1 The Benefit of Data Frame

Data frame is the most common and useful data structure. Usually

each row of a data frame represents an subject (unit of analysis) and
each column represents an an attribute or measure of interest.

df = data.frame(
  noBuy = c(3L, 5L, 1L, 1L, 3L),
  height = c(175, 168, 180, 181, 169),
  isMale = c(FALSE, TRUE, FALSE, TRUE, TRUE),
  name = c("Amy", "Bob", "Cindy", "Danny", "Edward"),
  gender = factor( c("F", "M", "F", "M", "M") ),
  skin_color = factor( c("black", "black", "white", "yellow", "white")),
  stringsAsFactors=FALSE
  )

Data frame is easier to examine …

df

  noBuy height isMale   name gender skin_color
1     3    175  FALSE    Amy      F      black
2     5    168   TRUE    Bob      M      black
3     1    180  FALSE  Cindy      F      white
4     1    181   TRUE  Danny      M     yellow
5     3    169   TRUE Edward      M      white

to select …

subset(df, isMale & skin_color == "black")

  noBuy height isMale name gender skin_color
2     5    168   TRUE  Bob      M      black

to count

table(df$gender)


F M 
2 3

to summaries

mean(df$height)

[1] 174.6

to summaries/count by groups

tapply(df$height, df$gender, mean)

    F     M 
177.5 172.7

2.2 Indexing Data Frames

Data Frames are a two-dimensional objects, so they take two indexes:

between the angle baskets and separated by comma - df[ row_idx, col_idx ]
usually row are selected by condition (logical index)
column are selected by name (name index)

We can index data frame by all three forms of index:

Positional/Integer Index: df[c(1,2), c(2,3)]
Name/Character Index: df[c(1,2), c("noBuy","height")]
Condition/Logical Index: df[df$gender=="M, c("noBuy","height")]

and some others extra indexing forms:

Empty Index:df[c(1,2), ] selects all columns (rows)
column name ($): df$name selects a specific column
subset() & filter()：
- subset(df, height<175 & isMale)
- subset(df, height<175 & isMale, name)
- subset(df, height<175 & isMale)$name
- subset(df, height<175 & isMale, c(name, noBuy))

Below are some examples …

df[c(1,2), c(2,3)]

  height isMale
1    175  FALSE
2    168   TRUE

df[c(1,2), ]

  noBuy height isMale name gender skin_color
1     3    175  FALSE  Amy      F      black
2     5    168   TRUE  Bob      M      black

df[df$height < 175 & df$isMale, ]

  noBuy height isMale   name gender skin_color
2     5    168   TRUE    Bob      M      black
5     3    169   TRUE Edward      M      white

df[df$height < 175 & df$isMale, "name"]

[1] "Bob"    "Edward"

df$name[df$height < 175 & df$isMale]

[1] "Bob"    "Edward"

subset(df, height<175 & isMale)

  noBuy height isMale   name gender skin_color
2     5    168   TRUE    Bob      M      black
5     3    169   TRUE Edward      M      white

subset(df, height<175 & isMale, name)

    name
2    Bob
5 Edward

subset(df, height<175 & isMale)$name

[1] "Bob"    "Edward"

subset(df, height<175 & isMale, c(name, noBuy))

    name noBuy
2    Bob     5
5 Edward     3

🗿 QUIZ：
Annotate the function of each underlying code chunks as remarks …

For an example

df$name[df$isMale] # names of all males

[1] "Bob"    "Danny"  "Edward"

df[df$height > 180 , "name"] #

[1] "Danny"

subset(df, height > 170 & !isMale)$name #

[1] "Amy"   "Cindy"

mean(df$height[df$isMale]) #

[1] 172.7

df$height[!df$isMale] %>% mean #

[1] 177.5

sum( subset(df, !isMale)$noBuy ) #

[1] 4

subset(df, skin_color == "white" & !isMale ) %>% nrow #

[1] 1

sum(df$skin_color == "white" & !df$isMale ) #

[1] 1