🏠 Key Points :
There two learning objectives in this notebook
■ Introduction to the ggplot2
packages
■ Experience the power of data visualization
Simply put, ggplot
is a tool that visualize data frame by mapping
■ each row to a geom - a plotting element such a point, a bar, a line, etc., and
■ some selected variables to the geom attributes such as coordinates, color, shape, etc.
🌻 The basic code elements in ggplots
are …
ggplot
maps data frame into a plotaes
specifies the mapping between variables and geom attributes+
) to a plotLoading the libraries
pacman::p_load(dplyr,tidyr,ggplot2,plotly,gridExtra)
theme_set(theme_get() + theme( # set common plotting formats
text=element_text(size=8), legend.key.size=unit(10,"points")
))
Whilst making nice and informative plots take skills, talent and experience, plotting is easy to do it R. Let’s use R’s build-in data iris
for a quick starting example.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
In the iris
data frame there’re 150 iris flowers in 3 different species.
ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length, color=Species)) +
geom_point(size=2, shape=18) + theme_bw()
With geom_point()
, every flowers in iris
is plotted as a point with their Sepal.Width
, Sepal.Length
and Species
map to the point’s x, y coordinates and colors respectively. The static attribute of the points, such as size
and shape
, are directly specified inside the geom_()
. Quite intuitive, isn’t it.
Rows: 7,250
Columns: 9
$ publisher <chr> "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc"~
$ name <chr> "Batman (Bruce Wayne)", "Superman (Clark Kent)", "Green La~
$ align <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "G~
$ eye <chr> "Blue", "Blue", "Brown", "Brown", "Blue", "Blue", "Blue", ~
$ hair <chr> "Black", "Black", "Brown", "White", "Black", "Black", "Blo~
$ sex <chr> "Male", "Male", "Male", "Male", "Male", "Female", "Male", ~
$ alive <chr> "Living", "Living", "Living", "Living", "Living", "Living"~
$ appearances <int> 3093, 2496, 1565, 1316, 1237, 1231, 1121, 1095, 1075, 1028~
$ year <int> 1939, 1986, 1959, 1987, 1940, 1941, 1941, 1989, 1969, 1956~
We do plotting in a manner of step-wise refinement. Making a simple scatter plot is so easy.
The plot looks pretty, but is it useful? Can you tell who is the most appearing character? Let’s try to do some enhancements.
🚴 Exercise
Add the following codes step by step and Observer their effects …
■ Put color=sex, shape=align
into aes()
■ Add + scale_y_log10()
after
■ Add + facet_wrap(~publisher)
after
It becomes very messy, isn’t it?
🌻 You’ve just experienced the key trade off in plotting - details vs. conciseness.
🌻 The resolution is interactivity. Let’s call our Ninja …
# take 100 most appearing characters from each publisher
gg = D %>% group_by(publisher) %>% top_n(n=100, wt=appearances) %>%
ggplot(aes(year,appearances,color=sex,shape=align, label=name)) +
scale_y_log10() +
facet_wrap(~publisher) +
geom_point(alpha=0.8) + # set transparency & size
theme_bw() + # choose a theme for clarity
theme(text=element_text(size=9)) + # use smaller font
labs(title="The Most Appearings",
x="", y="", color="", shape="") # set plot and the axis titles
We save the plot in an object names gg
.
Then render the plot interactively with plotly::ggplotly()
.
🚴 Let’s play with the plot interactively …
Before learning the detail grammar of plotting, now you should have experienced the power of data visualization with the comics dataset
🌻 Interactivity is a key improvement in modern data visualization.
year
and appearances
sex
, align
and publisher
🌻 Directly from the chart, you can answer some complicate questions, such as …
Answering some of these questions used to take serious statistic training. Now, with a good interactive chart, you can see the answers even if you have not learned statistics at all.
🌻 Exploration Data Analysis (EDA) is basically comparison. By mapping to the x, y coordinates, size, color, shape and using facet panels, we can use
Let’s demonstrate the analytic power of interactive charts in another example.
👨🏫 If we wanted to investigate
hair
and eye
colors specifically) may relates to their role play (align
), gender orientation (sex
) and casualty (alive
), andTraditionally, it would take some very complicate models to deal with this research questions. Now let’s see how well we can solve them within a chart.
First we need to prepare the data. Let’s set the time frame in 1980 ~ 2010 and divide it into 5-year periods.
breaks=seq(1980,2010,5)
D2 = filter(D, year>=1980, year<=2010) %>% # set the time period
mutate(
period = cut(year,breaks,breaks[-1],T) %>% # cut them into 5yr period
as.character %>% # by default cut() returns a factor, but
as.integer # we'd liker to have an integer here
)
Then we count the numbers of characters in each distinct hair-eye combinations and calculate their (cumulative) shares.
outlooks = count(D2, hair, eye, sort=T) %>% # count and sort
mutate(share=100*n/sum(n), cum=cumsum(share)) # shares and accumulation
head(outlooks, 20)
hair eye n share cum
1 Black Brown 770 15.625 15.62
2 Brown Brown 563 11.425 27.05
3 Blond Blue 548 11.120 38.17
4 Black Blue 351 7.123 45.29
5 Brown Blue 238 4.830 50.12
6 Black Black 226 4.586 54.71
7 others Blue 130 2.638 57.35
8 Red Green 122 2.476 59.82
9 Red Blue 119 2.415 62.24
10 White Blue 112 2.273 64.51
11 Black Green 77 1.562 66.07
12 No Green 75 1.522 67.59
13 Bald Brown 72 1.461 69.05
14 Blond Green 70 1.420 70.47
15 Brown Green 68 1.380 71.85
16 Blond Brown 65 1.319 73.17
17 Black Red 63 1.278 74.45
18 Black others 61 1.238 75.69
19 others Green 59 1.197 76.89
20 No Red 57 1.157 78.04
The top 10 outlooks cover about two third of the population.
The following code chuck we demonstrate how a seemingly complicated query in a data pipeline. The pipeline might be intimidating at the first glance. However, it is not developed in one shot. Let me show you how to build this pipeline step by step from scratch.
inner_join(D2, outlooks[1:10,]) %>% # filter for the top 10 outlooks
group_by(hair, eye) %>% summarise( # group by hair and eye
n = n(), # count the no. characters
female=mean(sex=="Female"), # the share of female
bad=mean(align=="Bad"), # the share of bad guys
dead=mean(alive=="Deceased"), # the casualty ratio
.groups='drop') %>% # drop the remaining group
ggplot(aes(bad, dead)) + # map x and y coordinates
geom_point(aes(col=female, size=n), alpha=0.8) + # map size and color
scale_color_gradientn(colors=c("seagreen","gold","red")) + # set color scale
scale_size_continuous(range=c(3,12)) + # set size scale
geom_text(aes(label=paste(hair,eye,sep="\n")), size=3) # put on a text label
Joining, by = c("eye", "hair")
The chart above is quite informative. Yet it is not good enough …
An dynamic and interactive chart can fix these problems. In the following code chunk, we made a few amendments …
hair
and eye
we also group by period
. It creates hair.eye
column for labelingaes()
we add frame=period
for dynamic displaygg
ggplotly
gg = inner_join(D2, outlooks[1:18,-3]) %>%
group_by(period, hair, eye) %>% summarise(
n = n(), bad=mean(align=="Bad"), female=mean(sex=="Female"),
dead=mean(alive=="Deceased"), .groups='drop') %>%
mutate(hair.eye = paste(hair,eye,sep=".")) %>%
ggplot(aes(bad, dead, label=hair.eye)) +
scale_color_gradientn(colors=c("seagreen","gold","red")) +
scale_size_continuous(range=c(2,12)) +
geom_point(aes(col=female, size=n, frame=period), alpha=0.8)
Joining, by = c("eye", "hair")
Warning: Ignoring unknown aesthetics: frame
Press the Play
button in the lower left corner and see what happen. If you move alone the slider bar we’d find that most of the bubbles gather in one area most of the time. But the location, shape and color pattern of these areas vary over time. For examples, in the 2000 period, we see the red bubbles are obviously lower than the green ones, indicating that the male characters suffered a higher casualty. In 1990, dead
seems to negatively correlates with bad
, implying that bad characters were more likely staying alive than the good ones.
Besides the group phenomenons, we can track each individual bubble and see how it’s role play, gender and casualty changed over time. For an example, the no-hair.green-eye’s tend to be male and bad most of the time, but their counts (bubble size) and dead rate changed a lot in three decades.
In the aspect of sociology, the aforementioned global phenomenons and individual trends are not trivial at all. But it is very difficult to see them in traditional mathematical models. Such an exploratory power well justify the value of dynamic and interactive data visualization.