🏠 Key Points :
■ 3 different types of bar plots in ggplot
■ Bar plots for group comparison
■ Choose the basis of comparison by setting position
■ Ill-designed bar plots could be misleading in an intuitive way! ⏰
Load libraries and set global options
pacman::p_load(dplyr,ggplot2,plotly,gridExtra)
theme_set(theme_get() + theme(
text=element_text(size=8), legend.key.size=unit(10,"points")
))
Load the comics dataset
Rows: 7,250
Columns: 9
$ publisher <chr> "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc"~
$ name <chr> "Batman (Bruce Wayne)", "Superman (Clark Kent)", "Green La~
$ align <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "G~
$ eye <chr> "Blue", "Blue", "Brown", "Brown", "Blue", "Blue", "Blue", ~
$ hair <chr> "Black", "Black", "Brown", "White", "Black", "Black", "Blo~
$ sex <chr> "Male", "Male", "Male", "Male", "Male", "Female", "Male", ~
$ alive <chr> "Living", "Living", "Living", "Living", "Living", "Living"~
$ appearances <int> 3093, 2496, 1565, 1316, 1237, 1231, 1121, 1095, 1075, 1028~
$ year <int> 1939, 1986, 1959, 1987, 1940, 1941, 1941, 1989, 1969, 1956~
tibble
is an enhanced data frame defined in dplyr
. We convert D
into tibble for better display.
🌻 There 3 types of bar plots in ggplot
g1 = ggplot(D, aes(x=year)) + geom_histogram(binwidth=10)
g2 = ggplot(D, aes(x=align)) + geom_bar()
g3 = group_by(D, sex) %>%
summarise( casualty.ratio = mean(alive=="Deceased") ) %>%
ggplot(aes(x=sex, y=casualty.ratio)) + geom_col()
grid.arrange(g1,g2,g3,nrow=1) # align three plots in a row
🌻 These bar plot help us
The complexity of bar plots kicks in when the comparisons involve sub-groups. To distinguish sub-groups within a bars, we map the fill
attribute to the sub-grouping variable.
g1 = ggplot(D, aes(x=year, fill=publisher)) + geom_histogram(binwidth=10)
g2 = ggplot(D, aes(x=align, fill=sex)) + geom_bar()
grid.arrange(g1,g2,nrow=1) # align three plots in a row
🚴 EXERCISE :
Try to make the following chart by modifying the code snippet below.
# group_by(D, sex) %>%
# summarise( casualty.ratio = mean(alive=="Deceased") ) %>%
# ggplot(aes(x=sex, y=casualty.ratio)) + geom_col()
❓ DISCUSSION:
Based on the chart, above …
We can cope with these problem by using the position
argument within geom_col()
See? Now we can compare the casualty ratio of all of the subgroups easily.
Sub-Group comparison is based on a data structure called
Female Male
Bad 676 2239
Good 1264 1914
Neutral 432 725
However, ggplot
cannot take the table format. To be compatible with the aes()
mapping mechanism,
we need to prepare the data in the
count()
is a handy way to make long table, when comparing to group_by() %>% summarise()
.
# A tibble: 6 x 3
align sex n
<chr> <chr> <int>
1 Bad Female 676
2 Bad Male 2239
3 Good Female 1264
4 Good Male 1914
5 Neutral Female 432
6 Neutral Male 725
position
By setting the position
argument in geom_col()
, we can align and compare the numbers in different ways.
dx = count(D, align, sex)
gg = lapply(c("stack","dodge","fill"), function(pos) {
ggplot(dx, aes(sex,n,fill=align)) +
geom_col(position=pos, alpha=0.6) + labs(title=pos,y="")
})
grid.arrange(grobs=gg, nrow=1)
🌻 Different plot serves different purpose …
stack
the numbers emphasizes the sums by sexdodge
to compare all of the numbers in the tablefill
convert numbers into factions for relative comparison ❓ QUIZ :
Which of the 3 above charts is easier to …
🚴 EXERCISE :
Actually we can make three more plots out of exactly the same data. Try to make the following chart by modifying the code snippet below.
# gg = lapply(c("stack","dodge","fill"), function(pos) {
# ggplot(dx, aes(sex,n,fill=align)) +
# geom_col(position=pos, alpha=0.6) + labs(title=pos,y="")
# })
# grid.arrange(grobs=gg, nrow=1)
Let’s mutate a decade
column in D
.
3 4 5 6 7 8 9 10 11
28 271 106 678 823 1304 1581 1803 656
❓ How is the number of each align
varies in time by sex, by publisher?
🌷 See how easy we can answer this seemingly complicate query in two lines of simple code. If I’d answered this query with a table full of numbers, would it be helpful at all.
Comparing to the numbers, the variations in ratios might better reflect the trend. To convert numbers into ratios, we simply put position='fill'
in geom_bar()
.
# here we convert `align` into a factor so we can
# re-order the align levels in a desirable way
D2 = D %>% mutate(
align=factor(align,levels=c("Bad","Neutral","Good"))
) %>%
filter(decade >= 6)
ggplot(D2, aes(x=decade, fill=align)) +
geom_bar(position="fill") +
facet_grid(sex~publisher)
Below is a bar plot that show ratios of good and bad aligns in different hair and eye colors.
hx = count(D2, hair, sort=T)
ex = count(D2, eye, sort=T)
D2 %>% filter(
hair%in%hx$hair[1:3], eye%in%ex$eye[1:3],
align!="Neutral") %>%
ggplot(aes(decade,fill=align)) +
geom_bar(position="fill",alpha=0.7) +
labs(x="eye",y="hair") +
facet_grid(hair~eye)
🚴 EXERCISE :
Can you modify the above code chuck