Before going further, let’s have some fun. In this notebook we will use an abridged dataset of comic characters to review, introduce and practice R’s build-in functions. ( data source: Kaggle.com )
🌻 read.csv()
- read CSV (comma separated values)
file
🌻 nrow()
- the numbers of rows (records, subject, unit
of analysis)
[1] 7250
🌻 ncol()
- the numbers of columns (variables,
attributes, measures of interest )
[1] 9
🌻 str()
- the structure of data frame, the
classes/types of each columns
'data.frame': 7250 obs. of 9 variables:
$ publisher : chr "dc" "dc" "dc" "dc" ...
$ name : chr "Batman (Bruce Wayne)" "Superman (Clark Kent)" "Green Lantern (Hal Jordan)" "James Gordon (New Earth)" ...
$ align : chr "Good" "Good" "Good" "Good" ...
$ eye : chr "Blue" "Blue" "Brown" "Brown" ...
$ hair : chr "Black" "Black" "Brown" "White" ...
$ sex : chr "Male" "Male" "Male" "Male" ...
$ alive : chr "Living" "Living" "Living" "Living" ...
$ appearances: int 3093 2496 1565 1316 1237 1231 1121 1095 1075 1028 ...
$ year : int 1939 1986 1959 1987 1940 1941 1941 1989 1969 1956 ...
🗿 QUIZ:
■ What is the name
of the data frame?
■ What is the targeted subject of analysis?
■ What are the attributes of interests?
■ How many numeric (or
integer) columns do we have?
■ How many character columns do we
have?
■ …
■ Which columns are CATEGORICAL
?
■ What do we means by categorical?
■ Should we convert the
categorical column into factor
?
🌷 We should check
the data columns before converting it
🌻 summary()
- The easiest way to examine every column
in a data frame
publisher name align eye
Length:7250 Length:7250 Length:7250 Length:7250
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
hair sex alive appearances
Length:7250 Length:7250 Length:7250 Min. : 1
Class :character Class :character Class :character 1st Qu.: 3
Mode :character Mode :character Mode :character Median : 8
Mean : 42
3rd Qu.: 26
Max. :4043
year
Min. :1936
1st Qu.:1978
Median :1992
Mean :1989
3rd Qu.:2004
Max. :2013
🌷 Note that columns in different data classes are
summary()
’ed differently
💡 Distribution & Statistics
■ Distribution : Description of a variable. It
describes …
◇ The way a variable varies
◇ I.e., how the
values of a variable are distributed?
■ Statistics
: Specific characteristics of a numeric variable
◇ For examples,
mean, `median, min, max
■ Distribution of a variable can be express
in two forms :
◇ Numerically, in statistics or
◇
Graphically, in plots
🌻 mean()
, median
, max
,
min()
- obtain specific statistics of numerics
mean median max min
1989 1992 2013 1936
🌻 summary
obtains major statistics at once
Min. 1st Qu. Median Mean 3rd Qu. Max.
1936 1978 1992 1989 2004 2013
🌻 hist()
- visualize the distribution of numeric
(continuous) variables
🌻 log()
helps to suppress the
visual impact of extreme values
par(mfrow=c(1,2), mar=c(2,2,3,2), cex=0.6) # 3 plots in a row, smaller font
hist(D$year, main="year")
hist(D$appearances, main="appearances")
🌻 table()
lists and counts the numbers of categorical
(discrete) values
Bad Good Neutral
2915 3178 1157
Note that character strings are listed in alphabetic order
🌻 barplot()
visualize the distribution of discrete
variables
🗿 QUIZ
■ What is
distribution? Why do we need it?
■ How can we
examine the distribution of a numeric variable?
■
How can we examine the distribution of a categorical
variable?
■ Variables can be examined
statistically or graphically, which
way is better?
■ …
■ Examine the variable
appearances
statistically
■ Examine the variable
year
graphically
■ Examine the variable
sex
statistically
■ Examine the variable
eye
graphically
Categorical values can be represented in either factor
or character
.
The align
column is read in
as an character variable.
We can convert it into a categorical
variable f.align
(
🌻 factor()
- convert a character vector into a factor
vector
align
column with the factor
vector. Here we create a new column f.align
for
comparison.levels
argument specifies the
order of levelspar(mfrow=c(1,2), cex=0.7)
table(D$align) %>% barplot(main="align")
table(D$f.align) %>% barplot(main="f.align")
Let’s convert eye
into a factor
Now it becomes a factor variable of 8 levels
Factor w/ 8 levels "Black","Blue",..: 2 2 3 3 2 2 2 2 2 2 ...
🌻 level(x)
lists the levels of x
[1] "Black" "Blue" "Brown" "Green" "others" "Red" "White" "Yellow"
When we do not specify levels
in factor()
,
it defaults to alphabetical order
Black Blue Brown Green others Red White Yellow
619 2477 2210 664 359 423 291 207
🌻 sort()
the vector for better comparison
publisher name align eye
Length:7250 Length:7250 Length:7250 Blue :2477
Class :character Class :character Class :character Brown :2210
Mode :character Mode :character Mode :character Green : 664
Black : 619
Red : 423
others : 359
(Other): 498
hair sex alive appearances
Length:7250 Length:7250 Length:7250 Min. : 1
Class :character Class :character Class :character 1st Qu.: 3
Mode :character Mode :character Mode :character Median : 8
Mean : 42
3rd Qu.: 26
Max. :4043
year f.align
Min. :1936 Good :3178
1st Qu.:1978 Neutral:1157
Median :1992 Bad :2915
Mean :1989
3rd Qu.:2004
Max. :2013
🌷 Observe how the factor variables are summarized, unto 6 levels.
💡 Factors vs. Characters
■
Character and factor both can be used to represent categorical
variables.
■ Although stored and displayed differently, they can be
used interchangeably most of the time.
■ Some categorical columns
might be read in as character.
■ So, we should check the data type
of each columns.
■ Practically we just keep them that way and
convert them when it becomes necessary
If we want to list top 10 characters by their appearance …
We can sort()
the D$appearances
in
descending order (dec=T
) and pick out 10
elements from the head()
[1] 4043 3360 3093 3061 2961 2496 2258 2255 2072 2017
We see the 10 largest appearances, but … Who are they?
🌻 sort(x)
sorts and returns the sorted contents of
x
🌻 order(x)
produces an index vector by the value of x (for
reordering some other objects)
year appearances name
2424 1962 4043 Spider-Man (Peter Parker)
2425 1941 3360 Captain America (Steven Rogers)
1 1939 3093 Batman (Bruce Wayne)
2426 1974 3061 Wolverine (James \\"Logan\\" Howlett)
2427 1963 2961 Iron Man (Anthony \\"Tony\\" Stark)
2 1986 2496 Superman (Clark Kent)
2428 1950 2258 Thor (Thor Odinson)
2429 1961 2255 Benjamin Grimm (Earth-616)
2430 1961 2072 Reed Richards (Earth-616)
2431 1962 2017 Hulk (Robert Bruce Banner)
So we can see the year
, appearances
and
name
of the top10-appearances in order.
🌻 subset(x)
picks out the rows by
conditions and columns by names
d = subset(D, alive=="Deceased" & sex=="Female", select=c("year","appearances","name"))
d[order(d$appearances, decreasing=T), ] %>% head(10)
year appearances name
2450 1963 1107 Jean Grey (Earth-616)
2510 1965 384 Gwendolyne Stacy (Earth-616)
50 1999 301 Kendra Saunders (New Earth)
2569 1975 259 Moira Kinross (Earth-616)
2571 1972 257 Namorita Prentiss (Earth-616)
2597 1964 226 Karen Page (Earth-616)
85 1971 216 Big Barda (New Earth)
103 1971 177 Talia al Ghul (New Earth)
2650 1964 170 Hela (Earth-616)
2651 1976 170 Lilandra Neramani (Earth-616)
So we can see the top10-appearances for deceased and female characters.
🗿 QUIZ
■ List the top-10
appearing female characters of blond hair and green eye
■ List
year
, publisher
, align
, and
name
of 5 least appearing red-hair, red-eye and male
characters
📋 Annotate each code chunks and make it your own
notebook
🌻 sum()
of an logical vector produce the number of
TRUE
’s
[1] 3178
🌻 mean()
of an logical vector produce the fraction of
TRUE
[1] 0.4383
🌻 table()
lists and counts each distinct values in
categorical (factor
or chr
)
Bad Good Neutral
2915 3178 1157
🌻 prop.table()
convert counts into fractions
Bad Good Neutral
0.4021 0.4383 0.1596
What happen if we put two variables in table()
Female Male
Good 1264 1914
Neutral 432 725
Bad 676 2239
❓ Check the online help of prop.table
. What does
argument margin
works?
# The expression on the left of `%>%` is the first argument of `prop.table`
# when `margin` is not specified, it default to NULL
table(D$f.align, D$sex) %>% prop.table # margin = NULL, default
Female Male
Good 0.17434 0.26400
Neutral 0.05959 0.10000
Bad 0.09324 0.30883
Female Male
Good 0.3977 0.6023
Neutral 0.3734 0.6266
Bad 0.2319 0.7681
Female Male
Good 0.5329 0.3924
Neutral 0.1821 0.1486
Bad 0.2850 0.4590
Let’s do some practices,
🏆 Group
Competition Round 1
Actually there is a better way to answer the last two questions above.
🌻 tapply(value, group, fun)
applies fun
to
value
by each distinct group
Female Male
432 725
Counts the number of neutral by sex
Female Male
0.1821 0.1486
Calculate the fraction of neutral characters by sex
Let’s do some practices,
Bald Black Blond Brown No others Red White
0.05921 0.34754 0.46642 0.26921 0.09186 0.35211 0.54839 0.22350
Black Blue Brown Green others Red White Yellow
153 881 669 340 118 81 79 51
🏆 Group
Competition Round 2
Some discrete variables are sequential in their nature. For an example …
3 4 5 6 7 8 9 10 11
28 271 106 678 823 1304 1581 1803 656
Calculate the fraction of neutral characters by decade
3 4 5 6 7 8 9 10 11
0.071 0.077 0.094 0.137 0.180 0.137 0.134 0.182 0.250
Plotting these figures would be easier to see the trend.
🌻 plot(x, y, type)
plots the data in x and y in
different ways
type="p"
scatter plot (the default)type="l"
line plottype="b"
line plot with markersF1
) for more line types and
arguments# Plot the sequence of figures to see the trend
par(mfcol=c(2,2), mar=c(3,3,2,1), cex=0.7)
v %>% barplot()
plot(names(v), v) # scatter plot
plot(names(v), v, type='l') # line plot
plot(names(v), v, type='b', ylim=c(0,0.25)) # zero-oriented line plot with markers
Which of the above charts show the trend better?
tapply(value, group, function)
can take more than one
grouping variables
3 4 5 6 7 8 9 10 11
Female 0.125 0.156 0.312 0.198 0.255 0.266 0.352 0.288 0.250
Male 0.050 0.274 0.289 0.559 0.480 0.457 0.513 0.414 0.439
🌻 lines(x, y)
adds a line in an existing plot. For an
example, We can compare how the fraction of bad characters varies in
time by sex as below.
par(mfrow=c(1,1), mar=c(3,3,2,1), cex=0.7)
plot(colnames(v), v[1,], type="l", ylim=c(0,0.6), col="red", lwd=2,
main = "Fraction of Bad Characters by Sex") # add title
lines(colnames(v), v[2,], col="blue", lwd=2) # add 2nd line
abline(h=seq(0,0.6,0.1), v=seq(3,10,1), col='lightgray') # add grid lines
Identify the top-3 eye colors …
[1] "Blond" "Brown" "Black"
and see how the fractions of the top-3 eye colors vary in time.
par(mfrow=c(1,1), mar=c(3,3,2,1), cex=0.7)
v = tapply(D$hair=="Black", D$decade, mean)
plot(names(v),v,type='l',lwd=2,col="black",ylim=c(0,0.45),
main="Fractions of the Top3 Hair Colors by Decades")
abline(h=seq(0,0.5,0.1), v=seq(3,10,1), col='lightgray')
v = tapply(D$hair=="Brown", D$decade, mean)
lines(names(v),v,type='l',lwd=2,col="brown")
v = tapply(D$hair=="Blond", D$decade, mean)
lines(names(v),v,type='l',lwd=2,col="gold")
🗿 QUIZ
Now it’s your turn.
Please make charts to …
■ examine how the fractions of the top-3
eye colors vary in time
■ compare how the ratios of
Living
vary in time by publisher
and
sex
dc.Female dc.Male marvel.Female marvel.Male
3 0.7143 0.6923 0.0000 0.8571
4 0.7273 0.5810 0.8261 0.7438
5 0.7273 0.7547 0.8000 0.7297
6 0.8485 0.7465 0.7564 0.7082
7 0.5909 0.6964 0.6923 0.7182
8 0.7782 0.6518 0.8034 0.7293
9 0.8219 0.6911 0.8037 0.7639
10 0.8382 0.7395 0.6951 0.6447
11 0.9062 0.8525 0.8100 0.7160