DC and Marvel
Before going further, let’s have some fun. In this notebook we will use an abridged dataset of comic characters to review, introduce and practice R’s build-in functions. ( data source: Kaggle.com )
🌻 read.csv() - read CSV (comma separated values)
file
D = read.csv("data/comics1.csv")🌻 nrow() - the numbers of rows (records, subject, unit
of analysis)
nrow(D)[1] 7250
🌻 ncol() - the numbers of columns (variables,
attributes, measures of interest )
ncol(D)[1] 9
🌻 str() - the structure of data frame, the
classes/types of each columns
str(D)'data.frame': 7250 obs. of 9 variables:
$ publisher : chr "dc" "dc" "dc" "dc" ...
$ name : chr "Batman (Bruce Wayne)" "Superman (Clark Kent)" "Green Lantern (Hal Jordan)" "James Gordon (New Earth)" ...
$ align : chr "Good" "Good" "Good" "Good" ...
$ eye : chr "Blue" "Blue" "Brown" "Brown" ...
$ hair : chr "Black" "Black" "Brown" "White" ...
$ sex : chr "Male" "Male" "Male" "Male" ...
$ alive : chr "Living" "Living" "Living" "Living" ...
$ appearances: int 3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
$ year : int 1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...
🗿 QUIZ:
■ What is the name
of the data frame?
■ What is the targeted subject of analysis?
■ What are the attributes of interests?
■ How many numeric (or
integer) columns do we have?
■ How many character columns do we
have?
■ …
■ Which columns are CATEGORICAL
?
■ What do we means by categorical?
■ Should we convert the
categorical column into factor?
🌷 We should check
the data columns before converting it
🌻 summary() - The easiest way to examine every column
in a data frame
summary(D) publisher name align eye
Length:7250 Length:7250 Length:7250 Length:7250
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
hair sex alive appearances
Length:7250 Length:7250 Length:7250 Min. : 1
Class :character Class :character Class :character 1st Qu.: 3
Mode :character Mode :character Mode :character Median : 8
Mean : 42
3rd Qu.: 26
Max. :4043
year
Min. :1936
1st Qu.:1978
Median :1992
Mean :1989
3rd Qu.:2004
Max. :2013
🌷 Note that columns in different data classes are
summary()’ed differently
💡 Distribution & Statistics
■ Distribution : Description of a variable. It
describes …
◇ The way a variable varies
◇ I.e., how the
values of a variable are distributed?
■ Statistics
: Specific characteristics of a numeric variable
◇ For examples,
mean, `median, min, max
■ Distribution of a variable can be express
in two forms :
◇ Numerically, in statistics or
◇
Graphically, in plots
🌻 mean(), median, max,
min() - obtain specific statistics of numerics
c( mean=mean(D$year), median=median(D$year),
max=max(D$year), min=min(D$year) ) mean median max min
1989 1992 2013 1936
🌻 summary obtains major statistics at once
summary(D$year) Min. 1st Qu. Median Mean 3rd Qu. Max.
1936 1978 1992 1989 2004 2013
🌻 hist() - visualize the distribution of numeric
(continuous) variables
🌻 log() helps to suppress the
visual impact of extreme values
par(mfrow=c(1,2), mar=c(2,2,3,2), cex=0.6) # 3 plots in a row, smaller font
hist(D$year, main="year")
hist(D$appearances, main="appearances")log(D$appearances,10) %>% hist(main="log(appear.)")
🌻 table() lists and counts the numbers of categorical
(discrete) values
table(D$align)
Bad Good Neutral
2915 3178 1157
Note that character strings are listed in alphabetic order
🌻 barplot() visualize the distribution of discrete
variables
par(mfrow=c(1,1), cex=0.6) # 1 plots in a row
table(D$align) %>% barplot(main="align")
🗿 QUIZ
■ What is
distribution? Why do we need it?
■ How can we
examine the distribution of a numeric variable?
■
How can we examine the distribution of a categorical
variable?
■ Variables can be examined
statistically or graphically, which
way is better?
■ …
■ Examine the variable
appearances statistically
■ Examine the variable
year graphically
■ Examine the variable
sex statistically
■ Examine the variable
eye graphically
Categorical values can be represented in either factor
or character.
The align column is read in
as an character variable.
We can convert it into a categorical
variable f.align (
🌻 factor() - convert a character vector into a factor
vector
D$f.align = factor(D$align, levels=c("Good","Neutral","Bad"))align column with the factor
vector. Here we create a new column f.align for
comparison.levels argument specifies the
order of levelspar(mfrow=c(1,2), cex=0.7)
table(D$align) %>% barplot(main="align")
table(D$f.align) %>% barplot(main="f.align")Let’s convert eye into a factor
D$eye = factor(D$eye)Now it becomes a factor variable of 8 levels
str(D$eye) Factor w/ 8 levels "Black","Blue",..: 2 2 3 3 2 2 2 2 2 2 ...
🌻 level(x) lists the levels of x
levels(D$eye)[1] "Black" "Blue" "Brown" "Green" "others" "Red" "White" "Yellow"
When we do not specify levels in factor(),
it defaults to alphabetical order
table(D$eye)
Black Blue Brown Green others Red White Yellow
619 2477 2210 664 359 423 291 207
🌻 sort() the vector for better comparison
par(mfrow=c(1,1), cex=0.7)
table(D$eye) %>% sort(dec=T) %>% barplotsummary(D) publisher name align eye
Length:7250 Length:7250 Length:7250 Blue :2477
Class :character Class :character Class :character Brown :2210
Mode :character Mode :character Mode :character Green : 664
Black : 619
Red : 423
others : 359
(Other): 498
hair sex alive appearances
Length:7250 Length:7250 Length:7250 Min. : 1
Class :character Class :character Class :character 1st Qu.: 3
Mode :character Mode :character Mode :character Median : 8
Mean : 42
3rd Qu.: 26
Max. :4043
year f.align
Min. :1936 Good :3178
1st Qu.:1978 Neutral:1157
Median :1992 Bad :2915
Mean :1989
3rd Qu.:2004
Max. :2013
🌷 Observe how the factor variables are summarized, unto 6 levels.
💡 Factors vs. Characters
■
Character and factor both can be used to represent categorical
variables.
■ Although stored and displayed differently, they can be
used interchangeably most of the time.
■ Some categorical columns
might be read in as character.
■ So, we should check the data type
of each columns.
■ Practically we just keep them that way and
convert them when it becomes necessary
If we want to list top 10 characters by their appearance …
We can sort() the D$appearances in
descending order (dec=T) and pick out 10
elements from the head()
sort(D$appearances, dec=T) %>% head(10) [1] 4043 3360 3093 3061 2961 2496 2258 2255 2072 2017
We see the 10 largest appearances, but … Who are they?
🌻 sort(x) sorts and returns the sorted contents of
x
🌻 order(x) produces an index vector by the value of x (for
reordering some other objects)
D[order(D$appearances,decreasing=T), c("year","appearances","name")] %>% head(10) year appearances name
2424 1962 4043 Spider-Man (Peter Parker)
2425 1941 3360 Captain America (Steven Rogers)
1 1939 3093 Batman (Bruce Wayne)
2426 1974 3061 Wolverine (James \\"Logan\\" Howlett)
2427 1963 2961 Iron Man (Anthony \\"Tony\\" Stark)
2 1986 2496 Superman (Clark Kent)
2428 1950 2258 Thor (Thor Odinson)
2429 1961 2255 Benjamin Grimm (Earth-616)
2430 1961 2072 Reed Richards (Earth-616)
2431 1962 2017 Hulk (Robert Bruce Banner)
So we can see the year, appearances and
name of the top10-appearances in order.
🌻 subset(x) picks out the rows by
conditions and columns by names
d = subset(D, alive=="Deceased" & sex=="Female", select=c("year","appearances","name"))
d[order(d$appearances, decreasing=T), ] %>% head(10) year appearances name
2450 1963 1107 Jean Grey (Earth-616)
2510 1965 384 Gwendolyne Stacy (Earth-616)
50 1999 301 Kendra Saunders (New Earth)
2569 1975 259 Moira Kinross (Earth-616)
2571 1972 257 Namorita Prentiss (Earth-616)
2597 1964 226 Karen Page (Earth-616)
85 1971 216 Big Barda (New Earth)
103 1971 177 Talia al Ghul (New Earth)
2650 1964 170 Hela (Earth-616)
2651 1976 170 Lilandra Neramani (Earth-616)
So we can see the top10-appearances for deceased and female characters.
🗿 QUIZ
■ List the top-10
appearing female characters of blond hair and green eye
■ List
year, publisher, align, and
name of 5 least appearing red-hair, red-eye and male
characters
📋 Annotate each code chunks and make it your own
notebook
🌻 sum() of an logical vector produce the number of
TRUE’s
sum(D$align == "Good")[1] 3178
🌻 mean() of an logical vector produce the fraction of
TRUE
mean(D$align == "Good")[1] 0.4383
🌻 table() lists and counts each distinct values in
categorical (factor or chr)
table(D$align)
Bad Good Neutral
2915 3178 1157
🌻 prop.table() convert counts into fractions
table(D$align) %>% prop.table
Bad Good Neutral
0.4021 0.4383 0.1596
What happen if we put two variables in table()
table(D$f.align, D$sex)
Female Male
Good 1264 1914
Neutral 432 725
Bad 676 2239
❓ Check the online help of prop.table. What does
argument margin works?
# The expression on the left of `%>%` is the first argument of `prop.table`
# when `margin` is not specified, it default to NULL
table(D$f.align, D$sex) %>% prop.table # margin = NULL, default
Female Male
Good 0.17434 0.26400
Neutral 0.05959 0.10000
Bad 0.09324 0.30883
table(D$f.align, D$sex) %>% prop.table(1) # margin = 1
Female Male
Good 0.3977 0.6023
Neutral 0.3734 0.6266
Bad 0.2319 0.7681
table(D$f.align, D$sex) %>% prop.table(2) # margin = 2
Female Male
Good 0.5329 0.3924
Neutral 0.1821 0.1486
Bad 0.2850 0.4590
Let’s do some practices,
🏆 Group
Competition Round 1
Actually there is a better way to answer the last two questions above.
🌻 tapply(value, group, fun) applies fun to
value by each distinct group
tapply(D$align == "Neutral", D$sex, sum)Female Male
432 725
Counts the number of neutral by sex
tapply(D$align == "Neutral", D$sex, mean)Female Male
0.1821 0.1486
Calculate the fraction of neutral characters by sex
Let’s do some practices,
tapply(D$sex=="Female", D$hair, mean) Bald Black Blond Brown No others Red White
0.05921 0.34754 0.46642 0.26921 0.09186 0.35211 0.54839 0.22350
tapply(D$sex=="Female", D$eye, sum) Black Blue Brown Green others Red White Yellow
153 881 669 340 118 81 79 51
🏆 Group
Competition Round 2
Some discrete variables are sequential in their nature. For an example …
D$decade = (D$year - 1900) %/% 10
table(D$decade)
3 4 5 6 7 8 9 10 11
28 271 106 678 823 1304 1581 1803 656
Calculate the fraction of neutral characters by decade
v = tapply(D$align == "Neutral", D$decade, mean) %>% round(3); v 3 4 5 6 7 8 9 10 11
0.071 0.077 0.094 0.137 0.180 0.137 0.134 0.182 0.250
Plotting these figures would be easier to see the trend.
🌻 plot(x, y, type) plots the data in x and y in
different ways
type="p" scatter plot (the default)type="l" line plottype="b" line plot with markersF1) for more line types and
arguments# Plot the sequence of figures to see the trend
par(mfcol=c(2,2), mar=c(3,3,2,1), cex=0.7)
v %>% barplot()
plot(names(v), v) # scatter plot
plot(names(v), v, type='l') # line plot
plot(names(v), v, type='b', ylim=c(0,0.25)) # zero-oriented line plot with markers
Which of the above charts show the trend better?
tapply(value, group, function) can take more than one
grouping variables
v = tapply(D$align=="Bad", list(D$sex, D$decade), mean) %>% round(3); v 3 4 5 6 7 8 9 10 11
Female 0.125 0.156 0.312 0.198 0.255 0.266 0.352 0.288 0.250
Male 0.050 0.274 0.289 0.559 0.480 0.457 0.513 0.414 0.439
🌻 lines(x, y) adds a line in an existing plot. For an
example, We can compare how the fraction of bad characters varies in
time by sex as below.
par(mfrow=c(1,1), mar=c(3,3,2,1), cex=0.7)
plot(colnames(v), v[1,], type="l", ylim=c(0,0.6), col="red", lwd=2,
main = "Fraction of Bad Characters by Sex") # add title
lines(colnames(v), v[2,], col="blue", lwd=2) # add 2nd line
abline(h=seq(0,0.6,0.1), v=seq(3,10,1), col='lightgray') # add grid linesIdentify the top-3 eye colors …
h3 = table(D$hair) %>% sort %>% tail(3) %>% names; h3[1] "Blond" "Brown" "Black"
and see how the fractions of the top-3 eye colors vary in time.
par(mfrow=c(1,1), mar=c(3,3,2,1), cex=0.7)
v = tapply(D$hair=="Black", D$decade, mean)
plot(names(v),v,type='l',lwd=2,col="black",ylim=c(0,0.45),
main="Fractions of the Top3 Hair Colors by Decades")
abline(h=seq(0,0.5,0.1), v=seq(3,10,1), col='lightgray')
v = tapply(D$hair=="Brown", D$decade, mean)
lines(names(v),v,type='l',lwd=2,col="brown")
v = tapply(D$hair=="Blond", D$decade, mean)
lines(names(v),v,type='l',lwd=2,col="gold")
🗿 QUIZ
Now it’s your turn.
Please make charts to …
■ examine how the fractions of the top-3
eye colors vary in time
■ compare how the ratios of
Living vary in time by publisher and
sex
dc.Female dc.Male marvel.Female marvel.Male
3 0.7143 0.6923 0.0000 0.8571
4 0.7273 0.5810 0.8261 0.7438
5 0.7273 0.7547 0.8000 0.7297
6 0.8485 0.7465 0.7564 0.7082
7 0.5909 0.6964 0.6923 0.7182
8 0.7782 0.6518 0.8034 0.7293
9 0.8219 0.6911 0.8037 0.7639
10 0.8382 0.7395 0.6951 0.6447
11 0.9062 0.8525 0.8100 0.7160