Scatter Plots for Comparing Attributes

🏠 Key Points :
There two learning objectives in this notebook
■ Introduction to the ggplot2 packages
■ Experience the power of data visualization
Simply put, ggplot is a tool that visualize data frame by mapping
■ each row to a geom - a plotting element such a point, a bar, a line, etc., and
■ some selected variables to the geom attributes such as coordinates, color, shape, etc.

🌻 The basic code elements in ggplots are …

ggplot maps data frame into a plot
aes specifies the mapping between variables and geom attributes
static geom attributes are specified in the geom directly
more than one geom can be added (+) to a plot
theme attributes, such as font size, axes ticks, legend locations etc., can also be adjusted.

Loading the libraries

pacman::p_load(dplyr,tidyr,ggplot2,plotly,gridExtra)
theme_set(theme_get() + theme(    # set common plotting formats
  text=element_text(size=8), legend.key.size=unit(10,"points")
  ))

1. A simple exmaple

Whilst making nice and informative plots take skills, talent and experience, plotting is easy to do it R. Let’s use R’s build-in data iris for a quick starting example.

head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

In the iris data frame there’re 150 iris flowers in 3 different species.

ggplot(iris, aes(x=Sepal.Width, y=Sepal.Length, color=Species)) + 
  geom_point(size=2, shape=18) + theme_bw()

With geom_point(), every flowers in iris is plotted as a point with their Sepal.Width, Sepal.Length and Species map to the point’s x, y coordinates and colors respectively. The static attribute of the points, such as size and shape, are directly specified inside the geom_(). Quite intuitive, isn’t it.

2. Scatter Plots for Data Exploration

D = read.csv("data/comics1.csv")
glimpse(D)

Rows: 7,250
Columns: 9
$ publisher   <chr> "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc", "dc"~
$ name        <chr> "Batman (Bruce Wayne)", "Superman (Clark Kent)", "Green La~
$ align       <chr> "Good", "Good", "Good", "Good", "Good", "Good", "Good", "G~
$ eye         <chr> "Blue", "Blue", "Brown", "Brown", "Blue", "Blue", "Blue", ~
$ hair        <chr> "Black", "Black", "Brown", "White", "Black", "Black", "Blo~
$ sex         <chr> "Male", "Male", "Male", "Male", "Male", "Female", "Male", ~
$ alive       <chr> "Living", "Living", "Living", "Living", "Living", "Living"~
$ appearances <int> 3093, 2496, 1565, 1316, 1237, 1231, 1121, 1095, 1075, 1028~
$ year        <int> 1939, 1986, 1959, 1987, 1940, 1941, 1941, 1989, 1969, 1956~

We do plotting in a manner of step-wise refinement. Making a simple scatter plot is so easy.

ggplot(D, aes(year,appearances)) + geom_point()

The plot looks pretty, but is it useful? Can you tell who is the most appearing character? Let’s try to do some enhancements.

🚴 Exercise
Add the following codes step by step and Observer their effects …
■ Put color=sex, shape=align into aes()
■ Add + scale_y_log10() after
■ Add + facet_wrap(~publisher) after
It becomes very messy, isn’t it?

🌻 You’ve just experienced the key trade off in plotting - details vs. conciseness.

🌻 The resolution is interactivity. Let’s call our Ninja …

2.1 Ninja’s Dojo - Interactivity

# take 100 most appearing characters from each publisher
gg = D %>% group_by(publisher) %>% top_n(n=100, wt=appearances) %>%    
  ggplot(aes(year,appearances,color=sex,shape=align, label=name)) +    
  scale_y_log10() + 
  facet_wrap(~publisher) +
  geom_point(alpha=0.8) +              # set transparency & size
  theme_bw() +                         # choose a theme for clarity
  theme(text=element_text(size=9)) +   # use smaller font
  labs(title="The Most Appearings",
       x="", y="", color="", shape="") #  set plot and the axis titles

We save the plot in an object names gg.

Then render the plot interactively with plotly::ggplotly().

ggplotly(gg)

🚴 Let’s play with the plot interactively …

Hover over the markers to see the tooltips
Click (or double-click) the legend items to select specific types of characters(s)
Drag within the plotting area to zoom in
Click the 🏠 icon in the menu bar to zoom out

2.2 The Power of Interactive Visualization

Before learning the detail grammar of plotting, now you should have experienced the power of data visualization with the comics dataset

🌻 Interactivity is a key improvement in modern data visualization.

when you zoom out, you can observe the global phenomenons, such as
- how each type of characters distrubites in time
- the relationship between year and appearances
- how the relationship might vary across sex, align and publisher
when you zoom in, you can see the details, such as
- observe the outlier data points and
- check the name and characteristics of every single characters

🌻 Directly from the chart, you can answer some complicate questions, such as …

Who are the most appearing characters in DC and marvel?
Who are the most appearing female good characters?
How the numbers of high appearing neutral characters varies in time?

Answering some of these questions used to take serious statistic training. Now, with a good interactive chart, you can see the answers even if you have not learned statistics at all.

🌻 Exploration Data Analysis (EDA) is basically comparison. By mapping to the x, y coordinates, size, color, shape and using facet panels, we can use scatter plots compare as many as 7 attributes at a time.

3. Dynamic and Interactive Charts

Let’s demonstrate the analytic power of interactive charts in another example.

👨‍🏫 If we wanted to investigate

how the characters’ outlook (hair and eye colors specifically) may relates to their role play (align), gender orientation (sex) and casualty (alive), and
How these relations varies in time?

Traditionally, it would take some very complicate models to deal with this research questions. Now let’s see how well we can solve them within a chart.

First we need to prepare the data. Let’s set the time frame in 1980 ~ 2010 and divide it into 5-year periods.

breaks=seq(1980,2010,5)
D2 = filter(D, year>=1980, year<=2010) %>%       # set the time period
  mutate(
    period = cut(year,breaks,breaks[-1],T) %>%   # cut them into 5yr period        
      as.character %>%  # by default cut() returns a factor, but 
      as.integer        # we'd liker to have an integer here 
  )

Then we count the numbers of characters in each distinct hair-eye combinations and calculate their (cumulative) shares.

outlooks = count(D2, hair, eye, sort=T) %>%      # count and sort 
  mutate(share=100*n/sum(n), cum=cumsum(share))  # shares and accumulation
head(outlooks, 20)

     hair    eye   n  share   cum
1   Black  Brown 770 15.625 15.62
2   Brown  Brown 563 11.425 27.05
3   Blond   Blue 548 11.120 38.17
4   Black   Blue 351  7.123 45.29
5   Brown   Blue 238  4.830 50.12
6   Black  Black 226  4.586 54.71
7  others   Blue 130  2.638 57.35
8     Red  Green 122  2.476 59.82
9     Red   Blue 119  2.415 62.24
10  White   Blue 112  2.273 64.51
11  Black  Green  77  1.562 66.07
12     No  Green  75  1.522 67.59
13   Bald  Brown  72  1.461 69.05
14  Blond  Green  70  1.420 70.47
15  Brown  Green  68  1.380 71.85
16  Blond  Brown  65  1.319 73.17
17  Black    Red  63  1.278 74.45
18  Black others  61  1.238 75.69
19 others  Green  59  1.197 76.89
20     No    Red  57  1.157 78.04

The top 10 outlooks cover about two third of the population.

The following code chuck we demonstrate how a seemingly complicated query in a data pipeline. The pipeline might be intimidating at the first glance. However, it is not developed in one shot. Let me show you how to build this pipeline step by step from scratch.

inner_join(D2, outlooks[1:10,]) %>%    # filter for the top 10 outlooks 
  group_by(hair, eye) %>% summarise(   # group by hair and eye
    n = n(),                           # count the no. characters
    female=mean(sex=="Female"),        # the share of female
    bad=mean(align=="Bad"),            # the share of bad guys    
    dead=mean(alive=="Deceased"),      # the casualty ratio
    .groups='drop') %>%                # drop the remaining group
  ggplot(aes(bad, dead)) +             # map x and y coordinates
  geom_point(aes(col=female, size=n), alpha=0.8) +   # map size and color
  scale_color_gradientn(colors=c("seagreen","gold","red")) +  # set color scale
  scale_size_continuous(range=c(3,12)) +             # set size scale          
  geom_text(aes(label=paste(hair,eye,sep="\n")), size=3) # put on a text label

Joining, by = c("eye", "hair")

The chart above is quite informative. Yet it is not good enough …

First, it is static. So we cannot see how the relationship varies in time.
If we want to put more outlooks into the chart, it’d become too crowded to read.

An dynamic and interactive chart can fix these problems. In the following code chunk, we made a few amendments …

We put in the top-18 outlooks, which covers 75% of the population
Besides hair and eye we also group by period. It creates panel data which has one data frame per period.
We mutate an hair.eye column for labeling
The ggplot portion basically remains the same. Yet
- In aes() we add frame=period for dynamic display
- We do not plot it but save it as an object gg
Finally, we render the dynamic and interactive chart by ggplotly

gg = inner_join(D2, outlooks[1:18,-3]) %>% 
  group_by(period, hair, eye) %>% summarise(
    n = n(), bad=mean(align=="Bad"), female=mean(sex=="Female"),
    dead=mean(alive=="Deceased"), .groups='drop') %>% 
  mutate(hair.eye = paste(hair,eye,sep=".")) %>% 
  ggplot(aes(bad, dead, label=hair.eye)) + 
  scale_color_gradientn(colors=c("seagreen","gold","red")) +
  scale_size_continuous(range=c(2,12)) +
  geom_point(aes(col=female, size=n, frame=period), alpha=0.8)

Joining, by = c("eye", "hair")

Warning: Ignoring unknown aesthetics: frame

ggplotly(gg) %>% animation_opts(600)

Press the Play button in the lower left corner and see what happen. If you move alone the slider bar we’d find that most of the bubbles gather in one area most of the time. But the location, shape and color pattern of these areas vary over time. For examples, in the 2000 period, we see the red bubbles are obviously lower than the green ones, indicating that the male characters suffered a higher casualty. In 1990, dead seems to negatively correlates with bad, implying that bad characters were more likely staying alive than the good ones.
Besides the group phenomenons, we can track each individual bubble and see how it’s role play, gender and casualty changed over time. For an example, the no-hair.green-eye’s tend to be male and bad most of the time, but their counts (bubble size) and dead rate changed a lot in three decades.

In the aspect of sociology, the aforementioned global phenomenons and individual trends are not trivial at all. But it is very difficult to see them in traditional mathematical models. Such an exploratory power well justify the value of dynamic and interactive data visualization.