1. The Attributes of Data Objects


Data Object Attributes
Data Object Attributes


💡 Major Types and Structures
Major Data Types/Classes
  ■ Integer(int) : c(10L, 55L, 99L), 21:29
  ■ Numeric : c(10.5, 22.3, 22)
  ■ Logical(logi) : c(TRUE,FALSE,FALSE,TRUE), c(T,F,T,T)
  ■ Character(chr): c("Amy","Bob","Cindy")
  ■ Factor: as.factor( c("IBMBA","MIS","BA") )
  ■ Date: as.Date( c("2020-10-01","2020-11-01","2020-12-01") )

Major Data Structures
  ■ Atomic: a value of certain type
  ■ Vector: one-dimension arrays of values in the same type
  ■ Matrix: two-dimension arrays of values in the same type
  ■ Data Frame: the most common structure, compose of equal length columns of various types
  ■ List: the most flexible structure, a sequence of objects of various types



2. Atomic Objects in Various Data Types

🌻 class(x) tells the class/type of x as an character string

x = 22
class(x)
[1] "numeric"

atomic/singular objects of various types/classes

c( class(22), class(22L), class(FALSE), 
   class("Amy"), 
   class( as.factor("Amy") ), 
   class( as.Date("2021-09-23") ) )
[1] "numeric"   "integer"   "logical"   "character" "factor"    "Date"     
L = list(22, 22L, FALSE, "Amy", as.factor("Amy"), as.Date("2021-09-23"))
sapply(L, class)
[1] "numeric"   "integer"   "logical"   "character" "factor"    "Date"     

💡 Iteratives and Iteration
  ■ c() constructs a vector
  ■ list() constructs a list
  ■ Iterative objects is convenient for iterative operation



3. Collective(Iterative) Objects

3.1 Vectors

define vectors

freq = c(3L, 5L, 1L, 1L, 3L)                # integer vector
amount = c(100, 168, 180, 280, 199)         # numeric vector
member = c(FALSE, TRUE, FALSE, TRUE, TRUE)  # logical vector
name = c("Amy", "Bob", "Cindy", "Danny", "Edward")   # character vector

define a factor/categorical vector

# factor/categorical vectors
skin = as.factor( c("black", "black", "white", "yellow", "white") ) # 3 levels
gender = as.factor( c("F", "M", "F", "M", "M") )                    # 2 levels

define a Date vector

last.buy = as.Date(
  c("2021-08-02","2021-03-02","2021-05-20","2021-07-12","2021-06-15") )

examine the data structure str() and data types class()

str(freq)
 int [1:5] 3 5 1 1 3
class(freq)
[1] "integer"
str(amount)
 num [1:5] 100 168 180 280 199
class(amount)
[1] "numeric"


3.2 Lists

put the vectors in a list

L = list(name,freq,amount,member,gender,skin,last.buy)

check the data classes/types

sapply(L, class)
[1] "character" "integer"   "numeric"   "logical"   "factor"    "factor"    "Date"     

check the data structures

x = lapply(L, str)
 chr [1:5] "Amy" "Bob" "Cindy" "Danny" "Edward"
 int [1:5] 3 5 1 1 3
 num [1:5] 100 168 180 280 199
 logi [1:5] FALSE TRUE FALSE TRUE TRUE
 Factor w/ 2 levels "F","M": 1 2 1 2 2
 Factor w/ 3 levels "black","white",..: 1 1 2 3 2
 Date[1:5], format: "2021-08-02" "2021-03-02" "2021-05-20" "2021-07-12" "2021-06-15"
  • str(obj) print the structure of obj and return NULL
  • Watch the format of the printouts
  • We assign the output to a temporary object (x), so it doesn’t mess up the printouts
  • Try to remove the assignment (x =). See what’d happen.

🌻 list is the most flexible data structure. It will be elaborated latter.


3.3 Data Frame

Since the vectors are all in the same length, we can put them in a data frame

D = data.frame(name,freq,amount,member,gender,skin,last.buy)

Data frames are easier to observe …

D
    name freq amount member gender   skin   last.buy
1    Amy    3    100  FALSE      F  black 2021-08-02
2    Bob    5    168   TRUE      M  black 2021-03-02
3  Cindy    1    180  FALSE      F  white 2021-05-20
4  Danny    1    280   TRUE      M yellow 2021-07-12
5 Edward    3    199   TRUE      M  white 2021-06-15
  • every row is a subject/record/observation
  • every column is an attribute/variable/measure

Easier to manipulate …

subset(D, member == T & freq >= 3)
    name freq amount member gender  skin   last.buy
2    Bob    5    168   TRUE      M black 2021-03-02
5 Edward    3    199   TRUE      M white 2021-06-15
  • select the members who bought more than 3 times
mean(D$amount)
[1] 185.4
  • the mean of order amounts for all customers
tapply(D$amount, D$gender, mean)
    F     M 
140.0 215.7 
  • the means of order amounts by gender

🌻 Operations of data frame will be further elaborated in the datacamp assignment.