UNIT2C：Comics Dataset (1)

DC and Marvel

Before going further, let’s have some fun. In this notebook we will use an abridged dataset of comic characters to review, introduce and practice R’s build-in functions. ( data source: Kaggle.com )

1. Read in and Examine the Structure of Data Frame

🌻 read.csv() - read CSV (comma separated values) file

D = read.csv("data/comics1.csv")

🌻 nrow() - the numbers of rows (records, subject, unit of analysis)

nrow(D)

[1] 7250

🌻 ncol() - the numbers of columns (variables, attributes, measures of interest )

ncol(D)

[1] 9

🌻 str() - the structure of data frame, the classes/types of each columns

str(D)

'data.frame':   7250 obs. of  9 variables:
 $ publisher  : chr  "dc" "dc" "dc" "dc" ...
 $ name       : chr  "Batman (Bruce Wayne)" "Superman (Clark Kent)" "Green Lantern (Hal Jordan)" "James Gordon (New Earth)" ...
 $ align      : chr  "Good" "Good" "Good" "Good" ...
 $ eye        : chr  "Blue" "Blue" "Brown" "Brown" ...
 $ hair       : chr  "Black" "Black" "Brown" "White" ...
 $ sex        : chr  "Male" "Male" "Male" "Male" ...
 $ alive      : chr  "Living" "Living" "Living" "Living" ...
 $ appearances: int  3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
 $ year       : int  1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...

🗿 QUIZ：
■ What is the name of the data frame?
■ What is the targeted subject of analysis?
■ What are the attributes of interests?
■ How many numeric (or integer) columns do we have?
■ How many character columns do we have?
■ …
■ Which columns are CATEGORICAL ?
■ What do we means by categorical?
■ Should we convert the categorical column into factor?

🌷 We should check the data columns before converting it

2. Examine Variables

🌻 summary() - The easiest way to examine every column in a data frame

summary(D)

  publisher             name              align               eye           
 Length:7250        Length:7250        Length:7250        Length:7250       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
     hair               sex               alive            appearances  
 Length:7250        Length:7250        Length:7250        Min.   :   1  
 Class :character   Class :character   Class :character   1st Qu.:   3  
 Mode  :character   Mode  :character   Mode  :character   Median :   8  
                                                          Mean   :  42  
                                                          3rd Qu.:  26  
                                                          Max.   :4043  
      year     
 Min.   :1936  
 1st Qu.:1978  
 Median :1992  
 Mean   :1989  
 3rd Qu.:2004  
 Max.   :2013

🌷 Note that columns in different data classes are summary()’ed differently

For the numeric columns, it shows the Distribution in Statistics
For character columns, it shows nothing but their classes
For factor columns, it shows their distributions (to be elaborated latter)

💡 Distribution & Statistics
■ Distribution : Description of a variable. It describes …
◇ The way a variable varies
◇ I.e., how the values of a variable are distributed?
■ Statistics : Specific characteristics of a numeric variable
◇ For examples, mean, `median, min, max
■ Distribution of a variable can be express in two forms :
◇ Numerically, in statistics or
◇ Graphically, in plots

2.1 Examine Numeric (Continuous) Variables

🌻 mean(), median, max, min() - obtain specific statistics of numerics

c( mean=mean(D$year), median=median(D$year), 
   max=max(D$year), min=min(D$year)  )

  mean median    max    min 
  1989   1992   2013   1936

🌻 summary obtains major statistics at once

summary(D$year)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1936    1978    1992    1989    2004    2013

🌻 hist() - visualize the distribution of numeric (continuous) variables
🌻 log() helps to suppress the visual impact of extreme values

par(mfrow=c(1,2), mar=c(2,2,3,2), cex=0.6)  # 3 plots in a row, smaller font
hist(D$year, main="year")
hist(D$appearances, main="appearances")

log(D$appearances,10) %>% hist(main="log(appear.)")

2.2 Examine Categorical (Discrete) Variables

🌻 table() lists and counts the numbers of categorical (discrete) values

table(D$align)


    Bad    Good Neutral 
   2915    3178    1157

Note that character strings are listed in alphabetic order

🌻 barplot() visualize the distribution of discrete variables

par(mfrow=c(1,1), cex=0.6)  # 1 plots in a row
table(D$align) %>% barplot(main="align")

🗿 QUIZ
■ What is distribution? Why do we need it?
■ How can we examine the distribution of a numeric variable?
■ How can we examine the distribution of a categorical variable?
■ Variables can be examined statistically or graphically, which way is better?
■ …
■ Examine the variable appearances statistically
■ Examine the variable year graphically
■ Examine the variable sex statistically
■ Examine the variable eye graphically

3. Character versus Factor

Categorical values can be represented in either factor or character.
The align column is read in as an character variable.
We can convert it into a categorical variable f.align (factor) of 3 different values (levels)

🌻 factor() - convert a character vector into a factor vector

D$f.align = factor(D$align, levels=c("Good","Neutral","Bad"))

Usually we overwrite the align column with the factor vector. Here we create a new column f.align for comparison.
Whenever desirable, the levels argument specifies the order of levels

par(mfrow=c(1,2), cex=0.7)  
table(D$align) %>% barplot(main="align")
table(D$f.align) %>% barplot(main="f.align")

Let’s convert eye into a factor

D$eye = factor(D$eye)

Now it becomes a factor variable of 8 levels

str(D$eye)

 Factor w/ 8 levels "Black","Blue",..: 2 2 3 3 2 2 2 2 2 2 ...

🌻 level(x) lists the levels of x

levels(D$eye)

[1] "Black"  "Blue"   "Brown"  "Green"  "others" "Red"    "White"  "Yellow"

When we do not specify levels in factor(), it defaults to alphabetical order

table(D$eye)


 Black   Blue  Brown  Green others    Red  White Yellow 
   619   2477   2210    664    359    423    291    207

🌻 sort() the vector for better comparison

par(mfrow=c(1,1), cex=0.7)  
table(D$eye) %>% sort(dec=T) %>% barplot

summary(D)

  publisher             name              align                eye      
 Length:7250        Length:7250        Length:7250        Blue   :2477  
 Class :character   Class :character   Class :character   Brown  :2210  
 Mode  :character   Mode  :character   Mode  :character   Green  : 664  
                                                          Black  : 619  
                                                          Red    : 423  
                                                          others : 359  
                                                          (Other): 498  
     hair               sex               alive            appearances  
 Length:7250        Length:7250        Length:7250        Min.   :   1  
 Class :character   Class :character   Class :character   1st Qu.:   3  
 Mode  :character   Mode  :character   Mode  :character   Median :   8  
                                                          Mean   :  42  
                                                          3rd Qu.:  26  
                                                          Max.   :4043  
                                                                        
      year         f.align    
 Min.   :1936   Good   :3178  
 1st Qu.:1978   Neutral:1157  
 Median :1992   Bad    :2915  
 Mean   :1989                 
 3rd Qu.:2004                 
 Max.   :2013

🌷 Observe how the factor variables are summarized, unto 6 levels.

💡 Factors vs. Characters
■ Character and factor both can be used to represent categorical variables.
■ Although stored and displayed differently, they can be used interchangeably most of the time.
■ Some categorical columns might be read in as character.
■ So, we should check the data type of each columns.
■ Practically we just keep them that way and convert them when it becomes necessary

4. Ordering and Filtering

If we want to list top 10 characters by their appearance …

4.1 The difference between Sorting and Ordering

We can sort() the D$appearances in descending order (dec=T) and pick out 10 elements from the head()

sort(D$appearances, dec=T) %>% head(10)

 [1] 4043 3360 3093 3061 2961 2496 2258 2255 2072 2017

We see the 10 largest appearances, but … Who are they?

🌻 sort(x) sorts and returns the sorted contents of x
🌻 order(x) produces an index vector by the value of x (for reordering some other objects)

D[order(D$appearances,decreasing=T), c("year","appearances","name")] %>% head(10)

     year appearances                                  name
2424 1962        4043             Spider-Man (Peter Parker)
2425 1941        3360       Captain America (Steven Rogers)
1    1939        3093                  Batman (Bruce Wayne)
2426 1974        3061 Wolverine (James \\"Logan\\" Howlett)
2427 1963        2961   Iron Man (Anthony \\"Tony\\" Stark)
2    1986        2496                 Superman (Clark Kent)
2428 1950        2258                   Thor (Thor Odinson)
2429 1961        2255            Benjamin Grimm (Earth-616)
2430 1961        2072             Reed Richards (Earth-616)
2431 1962        2017            Hulk (Robert Bruce Banner)

So we can see the year, appearances and name of the top10-appearances in order.

4.2 Ordering with Conditions

🌻 subset(x) picks out the rows by conditions and columns by names

d = subset(D, alive=="Deceased" & sex=="Female", select=c("year","appearances","name"))
d[order(d$appearances, decreasing=T), ] %>% head(10)

     year appearances                          name
2450 1963        1107         Jean Grey (Earth-616)
2510 1965         384  Gwendolyne Stacy (Earth-616)
50   1999         301   Kendra Saunders (New Earth)
2569 1975         259     Moira Kinross (Earth-616)
2571 1972         257 Namorita Prentiss (Earth-616)
2597 1964         226        Karen Page (Earth-616)
85   1971         216         Big Barda (New Earth)
103  1971         177     Talia al Ghul (New Earth)
2650 1964         170              Hela (Earth-616)
2651 1976         170 Lilandra Neramani (Earth-616)

So we can see the top10-appearances for deceased and female characters.

🗿 QUIZ
■ List the top-10 appearing female characters of blond hair and green eye
■ List year, publisher, align, and name of 5 least appearing red-hair, red-eye and male characters

5. Counts and Fractions

📋 Annotate each code chunks and make it your own notebook

5.1 Obtain Counts and Fractions via Logical Test

🌻 sum() of an logical vector produce the number of TRUE’s

sum(D$align == "Good")

[1] 3178

🌻 mean() of an logical vector produce the fraction of TRUE

mean(D$align == "Good")

[1] 0.4383

5.2 Tables and Proportionate Tables

🌻 table() lists and counts each distinct values in categorical (factor or chr)

table(D$align)


    Bad    Good Neutral 
   2915    3178    1157

🌻 prop.table() convert counts into fractions

table(D$align) %>% prop.table


    Bad    Good Neutral 
 0.4021  0.4383  0.1596

What happen if we put two variables in table()

table(D$f.align, D$sex)

         
          Female Male
  Good      1264 1914
  Neutral    432  725
  Bad        676 2239

❓ Check the online help of prop.table. What does argument margin works?

# The expression on the left of `%>%` is the first argument of `prop.table`
# when `margin` is not specified, it default to NULL
table(D$f.align, D$sex) %>% prop.table     # margin = NULL, default

         
           Female    Male
  Good    0.17434 0.26400
  Neutral 0.05959 0.10000
  Bad     0.09324 0.30883

table(D$f.align, D$sex) %>% prop.table(1)  # margin = 1

         
          Female   Male
  Good    0.3977 0.6023
  Neutral 0.3734 0.6266
  Bad     0.2319 0.7681

table(D$f.align, D$sex) %>% prop.table(2)  # margin = 2

         
          Female   Male
  Good    0.5329 0.3924
  Neutral 0.1821 0.1486
  Bad     0.2850 0.4590

Let’s do some practices,

How many bad males do we have ❓
What fraction of bad characters are female ❓
What fraction of female characters are bad ❓

🏆 Group Competition Round 1

6. Category and Group Operations

6.1 Statistics by Categories

Actually there is a better way to answer the last two questions above.

🌻 tapply(value, group, fun) applies fun to value by each distinct group

tapply(D$align == "Neutral", D$sex, sum)

Female   Male 
   432    725

Counts the number of neutral by sex

tapply(D$align == "Neutral", D$sex, mean)

Female   Male 
0.1821 0.1486

Calculate the fraction of neutral characters by sex

Let’s do some practices,

What are the fractions of female in each hair color ❓
What are the number of female in each eye color ❓

tapply(D$sex=="Female", D$hair, mean)

   Bald   Black   Blond   Brown      No  others     Red   White 
0.05921 0.34754 0.46642 0.26921 0.09186 0.35211 0.54839 0.22350

tapply(D$sex=="Female", D$eye, sum)

 Black   Blue  Brown  Green others    Red  White Yellow 
   153    881    669    340    118     81     79     51

🏆 Group Competition Round 2

6.2 Statistics in Sequence

Some discrete variables are sequential in their nature. For an example …

D$decade = (D$year - 1900) %/% 10
table(D$decade)


   3    4    5    6    7    8    9   10   11 
  28  271  106  678  823 1304 1581 1803  656

Calculate the fraction of neutral characters by decade

v = tapply(D$align == "Neutral", D$decade, mean) %>% round(3); v

    3     4     5     6     7     8     9    10    11 
0.071 0.077 0.094 0.137 0.180 0.137 0.134 0.182 0.250

Plotting these figures would be easier to see the trend.

🌻 plot(x, y, type) plots the data in x and y in different ways

type="p" scatter plot (the default)
type="l" line plot
type="b" line plot with markers
See online help (F1) for more line types and arguments

# Plot the sequence of figures to see the trend 
par(mfcol=c(2,2), mar=c(3,3,2,1), cex=0.7)
v %>% barplot()
plot(names(v), v)             # scatter plot
plot(names(v), v, type='l')   # line plot 
plot(names(v), v, type='b', ylim=c(0,0.25)) # zero-oriented line plot with markers

Which of the above charts show the trend better?

6.3 Sequences by Categories

tapply(value, group, function) can take more than one grouping variables

v = tapply(D$align=="Bad", list(D$sex, D$decade), mean) %>% round(3); v

           3     4     5     6     7     8     9    10    11
Female 0.125 0.156 0.312 0.198 0.255 0.266 0.352 0.288 0.250
Male   0.050 0.274 0.289 0.559 0.480 0.457 0.513 0.414 0.439

🌻 lines(x, y) adds a line in an existing plot. For an example, We can compare how the fraction of bad characters varies in time by sex as below.

par(mfrow=c(1,1), mar=c(3,3,2,1), cex=0.7)
plot(colnames(v),  v[1,], type="l", ylim=c(0,0.6), col="red", lwd=2,
     main = "Fraction of Bad Characters by Sex")          # add title
lines(colnames(v), v[2,], col="blue", lwd=2)              # add 2nd line
abline(h=seq(0,0.6,0.1), v=seq(3,10,1), col='lightgray')  # add grid lines

Identify the top-3 eye colors …

h3 = table(D$hair) %>% sort %>% tail(3) %>% names; h3

[1] "Blond" "Brown" "Black"

and see how the fractions of the top-3 eye colors vary in time.

par(mfrow=c(1,1), mar=c(3,3,2,1), cex=0.7)
v = tapply(D$hair=="Black", D$decade, mean)
plot(names(v),v,type='l',lwd=2,col="black",ylim=c(0,0.45),
     main="Fractions of the Top3 Hair Colors by Decades")
abline(h=seq(0,0.5,0.1), v=seq(3,10,1), col='lightgray')  
v = tapply(D$hair=="Brown", D$decade, mean)
lines(names(v),v,type='l',lwd=2,col="brown")
v = tapply(D$hair=="Blond", D$decade, mean)
lines(names(v),v,type='l',lwd=2,col="gold")

🗿 QUIZ
Now it’s your turn. Please make charts to …
■ examine how the fractions of the top-3 eye colors vary in time
■ compare how the ratios of Living vary in time by publisher and sex

   dc.Female dc.Male marvel.Female marvel.Male
3     0.7143  0.6923        0.0000      0.8571
4     0.7273  0.5810        0.8261      0.7438
5     0.7273  0.7547        0.8000      0.7297
6     0.8485  0.7465        0.7564      0.7082
7     0.5909  0.6964        0.6923      0.7182
8     0.7782  0.6518        0.8034      0.7293
9     0.8219  0.6911        0.8037      0.7639
10    0.8382  0.7395        0.6951      0.6447
11    0.9062  0.8525        0.8100      0.7160