Before going further, let’s have some fun. In this notebook we will use an abridged dataset of comic characters to review, introduce and practice R’s build-in functions. ( data source: Kaggle.com )
🌻 read.csv()
- read CSV (comma separated values)
file
= read.csv("data/comics1.csv") D
🌻 nrow()
- the numbers of rows (records, subject, unit
of analysis)
nrow(D)
[1] 7250
🌻 ncol()
- the numbers of columns (variables,
attributes, measures of interest )
ncol(D)
[1] 9
🌻 str()
- the structure of data frame, the
classes/types of each columns
str(D)
'data.frame': 7250 obs. of 9 variables:
$ publisher : chr "dc" "dc" "dc" "dc" ...
$ name : chr "Batman (Bruce Wayne)" "Superman (Clark Kent)" "Green Lantern (Hal Jordan)" "James Gordon (New Earth)" ...
$ align : chr "Good" "Good" "Good" "Good" ...
$ eye : chr "Blue" "Blue" "Brown" "Brown" ...
$ hair : chr "Black" "Black" "Brown" "White" ...
$ sex : chr "Male" "Male" "Male" "Male" ...
$ alive : chr "Living" "Living" "Living" "Living" ...
$ appearances: int 3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
$ year : int 1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...
🗿 QUIZ:
■ What is the name
of the data frame?
■ What is the targeted subject of analysis?
■ What are the attributes of interests?
■ How many numeric (or
integer) columns do we have?
■ How many character columns do we
have?
■ …
■ Which columns are CATEGORICAL
?
■ What do we means by categorical?
■ Should we convert the
categorical column into factor
?
🌷 We should check
the data columns before converting it
🌻 summary()
- The easiest way to examine every column
in a data frame
summary(D)
publisher name align eye
Length:7250 Length:7250 Length:7250 Length:7250
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
hair sex alive appearances
Length:7250 Length:7250 Length:7250 Min. : 1
Class :character Class :character Class :character 1st Qu.: 3
Mode :character Mode :character Mode :character Median : 8
Mean : 42
3rd Qu.: 26
Max. :4043
year
Min. :1936
1st Qu.:1978
Median :1992
Mean :1989
3rd Qu.:2004
Max. :2013
🌷 Note that columns in different data classes are
summary()
’ed differently
💡 Distribution & Statistics
■ Distribution : Description of a variable. It
describes …
◇ The way a variable varies
◇ I.e., how the
values of a variable are distributed?
■ Statistics
: Specific characteristics of a numeric variable
◇ For examples,
mean, `median, min, max
■ Distribution of a variable can be express
in two forms :
◇ Numerically, in statistics or
◇
Graphically, in plots
🌻 mean()
, median
, max
,
min()
- obtain specific statistics of numerics
c( mean=mean(D$year), median=median(D$year),
max=max(D$year), min=min(D$year) )
mean median max min
1989 1992 2013 1936
🌻 summary
obtains major statistics at once
summary(D$year)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1936 1978 1992 1989 2004 2013
🌻 hist()
- visualize the distribution of numeric
(continuous) variables
🌻 log()
helps to suppress the
visual impact of extreme values
par(mfrow=c(1,2), mar=c(2,2,3,2), cex=0.6) # 3 plots in a row, smaller font
hist(D$year, main="year")
hist(D$appearances, main="appearances")
log(D$appearances,10) %>% hist(main="log(appear.)")
🌻 table()
lists and counts the numbers of categorical
(discrete) values
table(D$align)
Bad Good Neutral
2915 3178 1157
Note that character strings are listed in alphabetic order
🌻 barplot()
visualize the distribution of discrete
variables
par(mfrow=c(1,1), cex=0.6) # 1 plots in a row
table(D$align) %>% barplot(main="align")
🗿 QUIZ
■ What is
distribution? Why do we need it?
■ How can we
examine the distribution of a numeric variable?
■
How can we examine the distribution of a categorical
variable?
■ Variables can be examined
statistically or graphically, which
way is better?
■ …
■ Examine the variable
appearances
statistically
■ Examine the variable
year
graphically
■ Examine the variable
sex
statistically
■ Examine the variable
eye
graphically
Categorical values can be represented in either factor
or character
.
The align
column is read in
as an character variable.
We can convert it into a categorical
variable f.align
(
🌻 factor()
- convert a character vector into a factor
vector
$f.align = factor(D$align, levels=c("Good","Neutral","Bad")) D
align
column with the factor
vector. Here we create a new column f.align
for
comparison.levels
argument specifies the
order of levelspar(mfrow=c(1,2), cex=0.7)
table(D$align) %>% barplot(main="align")
table(D$f.align) %>% barplot(main="f.align")
Let’s convert eye
into a factor
$eye = factor(D$eye) D
Now it becomes a factor variable of 8 levels
str(D$eye)
Factor w/ 8 levels "Black","Blue",..: 2 2 3 3 2 2 2 2 2 2 ...
🌻 level(x)
lists the levels of x
levels(D$eye)
[1] "Black" "Blue" "Brown" "Green" "others" "Red" "White" "Yellow"
When we do not specify levels
in factor()
,
it defaults to alphabetical order
table(D$eye)
Black Blue Brown Green others Red White Yellow
619 2477 2210 664 359 423 291 207
🌻 sort()
the vector for better comparison
par(mfrow=c(1,1), cex=0.7)
table(D$eye) %>% sort(dec=T) %>% barplot
summary(D)
publisher name align eye
Length:7250 Length:7250 Length:7250 Blue :2477
Class :character Class :character Class :character Brown :2210
Mode :character Mode :character Mode :character Green : 664
Black : 619
Red : 423
others : 359
(Other): 498
hair sex alive appearances
Length:7250 Length:7250 Length:7250 Min. : 1
Class :character Class :character Class :character 1st Qu.: 3
Mode :character Mode :character Mode :character Median : 8
Mean : 42
3rd Qu.: 26
Max. :4043
year f.align
Min. :1936 Good :3178
1st Qu.:1978 Neutral:1157
Median :1992 Bad :2915
Mean :1989
3rd Qu.:2004
Max. :2013
🌷 Observe how the factor variables are summarized, unto 6 levels.
💡 Factors vs. Characters
■
Character and factor both can be used to represent categorical
variables.
■ Although stored and displayed differently, they can be
used interchangeably most of the time.
■ Some categorical columns
might be read in as character.
■ So, we should check the data type
of each columns.
■ Practically we just keep them that way and
convert them when it becomes necessary
If we want to list top 10 characters by their appearance …
We can sort()
the D$appearances
in
descending order (dec=T
) and pick out 10
elements from the head()
sort(D$appearances, dec=T) %>% head(10)
[1] 4043 3360 3093 3061 2961 2496 2258 2255 2072 2017
We see the 10 largest appearances, but … Who are they?
🌻 sort(x)
sorts and returns the sorted contents of
x
🌻 order(x)
produces an index vector by the value of x (for
reordering some other objects)
order(D$appearances,decreasing=T), c("year","appearances","name")] %>% head(10) D[
year appearances name
2424 1962 4043 Spider-Man (Peter Parker)
2425 1941 3360 Captain America (Steven Rogers)
1 1939 3093 Batman (Bruce Wayne)
2426 1974 3061 Wolverine (James \\"Logan\\" Howlett)
2427 1963 2961 Iron Man (Anthony \\"Tony\\" Stark)
2 1986 2496 Superman (Clark Kent)
2428 1950 2258 Thor (Thor Odinson)
2429 1961 2255 Benjamin Grimm (Earth-616)
2430 1961 2072 Reed Richards (Earth-616)
2431 1962 2017 Hulk (Robert Bruce Banner)
So we can see the year
, appearances
and
name
of the top10-appearances in order.
🌻 subset(x)
picks out the rows by
conditions and columns by names
= subset(D, alive=="Deceased" & sex=="Female", select=c("year","appearances","name"))
d order(d$appearances, decreasing=T), ] %>% head(10) d[
year appearances name
2450 1963 1107 Jean Grey (Earth-616)
2510 1965 384 Gwendolyne Stacy (Earth-616)
50 1999 301 Kendra Saunders (New Earth)
2569 1975 259 Moira Kinross (Earth-616)
2571 1972 257 Namorita Prentiss (Earth-616)
2597 1964 226 Karen Page (Earth-616)
85 1971 216 Big Barda (New Earth)
103 1971 177 Talia al Ghul (New Earth)
2650 1964 170 Hela (Earth-616)
2651 1976 170 Lilandra Neramani (Earth-616)
So we can see the top10-appearances for deceased and female characters.
🗿 QUIZ
■ List the top-10
appearing female characters of blond hair and green eye
■ List
year
, publisher
, align
, and
name
of 5 least appearing red-hair, red-eye and male
characters
📋 Annotate each code chunks and make it your own
notebook
🌻 sum()
of an logical vector produce the number of
TRUE
’s
sum(D$align == "Good")
[1] 3178
🌻 mean()
of an logical vector produce the fraction of
TRUE
mean(D$align == "Good")
[1] 0.4383
🌻 table()
lists and counts each distinct values in
categorical (factor
or chr
)
table(D$align)
Bad Good Neutral
2915 3178 1157
🌻 prop.table()
convert counts into fractions
table(D$align) %>% prop.table
Bad Good Neutral
0.4021 0.4383 0.1596
What happen if we put two variables in table()
table(D$f.align, D$sex)
Female Male
Good 1264 1914
Neutral 432 725
Bad 676 2239
❓ Check the online help of prop.table
. What does
argument margin
works?
# The expression on the left of `%>%` is the first argument of `prop.table`
# when `margin` is not specified, it default to NULL
table(D$f.align, D$sex) %>% prop.table # margin = NULL, default
Female Male
Good 0.17434 0.26400
Neutral 0.05959 0.10000
Bad 0.09324 0.30883
table(D$f.align, D$sex) %>% prop.table(1) # margin = 1
Female Male
Good 0.3977 0.6023
Neutral 0.3734 0.6266
Bad 0.2319 0.7681
table(D$f.align, D$sex) %>% prop.table(2) # margin = 2
Female Male
Good 0.5329 0.3924
Neutral 0.1821 0.1486
Bad 0.2850 0.4590
Let’s do some practices,
🏆 Group
Competition Round 1
Actually there is a better way to answer the last two questions above.
🌻 tapply(value, group, fun)
applies fun
to
value
by each distinct group
tapply(D$align == "Neutral", D$sex, sum)
Female Male
432 725
Counts the number of neutral by sex
tapply(D$align == "Neutral", D$sex, mean)
Female Male
0.1821 0.1486
Calculate the fraction of neutral characters by sex
Let’s do some practices,
tapply(D$sex=="Female", D$hair, mean)
Bald Black Blond Brown No others Red White
0.05921 0.34754 0.46642 0.26921 0.09186 0.35211 0.54839 0.22350
tapply(D$sex=="Female", D$eye, sum)
Black Blue Brown Green others Red White Yellow
153 881 669 340 118 81 79 51
🏆 Group
Competition Round 2
Some discrete variables are sequential in their nature. For an example …
$decade = (D$year - 1900) %/% 10
Dtable(D$decade)
3 4 5 6 7 8 9 10 11
28 271 106 678 823 1304 1581 1803 656
Calculate the fraction of neutral characters by decade
= tapply(D$align == "Neutral", D$decade, mean) %>% round(3); v v
3 4 5 6 7 8 9 10 11
0.071 0.077 0.094 0.137 0.180 0.137 0.134 0.182 0.250
Plotting these figures would be easier to see the trend.
🌻 plot(x, y, type)
plots the data in x and y in
different ways
type="p"
scatter plot (the default)type="l"
line plottype="b"
line plot with markersF1
) for more line types and
arguments# Plot the sequence of figures to see the trend
par(mfcol=c(2,2), mar=c(3,3,2,1), cex=0.7)
%>% barplot()
v plot(names(v), v) # scatter plot
plot(names(v), v, type='l') # line plot
plot(names(v), v, type='b', ylim=c(0,0.25)) # zero-oriented line plot with markers
Which of the above charts show the trend better?
tapply(value, group, function)
can take more than one
grouping variables
= tapply(D$align=="Bad", list(D$sex, D$decade), mean) %>% round(3); v v
3 4 5 6 7 8 9 10 11
Female 0.125 0.156 0.312 0.198 0.255 0.266 0.352 0.288 0.250
Male 0.050 0.274 0.289 0.559 0.480 0.457 0.513 0.414 0.439
🌻 lines(x, y)
adds a line in an existing plot. For an
example, We can compare how the fraction of bad characters varies in
time by sex as below.
par(mfrow=c(1,1), mar=c(3,3,2,1), cex=0.7)
plot(colnames(v), v[1,], type="l", ylim=c(0,0.6), col="red", lwd=2,
main = "Fraction of Bad Characters by Sex") # add title
lines(colnames(v), v[2,], col="blue", lwd=2) # add 2nd line
abline(h=seq(0,0.6,0.1), v=seq(3,10,1), col='lightgray') # add grid lines
Identify the top-3 eye colors …
= table(D$hair) %>% sort %>% tail(3) %>% names; h3 h3
[1] "Blond" "Brown" "Black"
and see how the fractions of the top-3 eye colors vary in time.
par(mfrow=c(1,1), mar=c(3,3,2,1), cex=0.7)
= tapply(D$hair=="Black", D$decade, mean)
v plot(names(v),v,type='l',lwd=2,col="black",ylim=c(0,0.45),
main="Fractions of the Top3 Hair Colors by Decades")
abline(h=seq(0,0.5,0.1), v=seq(3,10,1), col='lightgray')
= tapply(D$hair=="Brown", D$decade, mean)
v lines(names(v),v,type='l',lwd=2,col="brown")
= tapply(D$hair=="Blond", D$decade, mean)
v lines(names(v),v,type='l',lwd=2,col="gold")
🗿 QUIZ
Now it’s your turn.
Please make charts to …
■ examine how the fractions of the top-3
eye colors vary in time
■ compare how the ratios of
Living
vary in time by publisher
and
sex
dc.Female dc.Male marvel.Female marvel.Male
3 0.7143 0.6923 0.0000 0.8571
4 0.7273 0.5810 0.8261 0.7438
5 0.7273 0.7547 0.8000 0.7297
6 0.8485 0.7465 0.7564 0.7082
7 0.5909 0.6964 0.6923 0.7182
8 0.7782 0.6518 0.8034 0.7293
9 0.8219 0.6911 0.8037 0.7639
10 0.8382 0.7395 0.6951 0.6447
11 0.9062 0.8525 0.8100 0.7160