Unit05B - Dplyr’s Group Operation

🏠 Key Points :
Corresponding to Data Manipulation with dplyr, Chapter 4
■ The effect of group’s in dplyr
■ group_by() … summarise()
■ group_by() … mutate()
■ The three types of R functions …
。 summary functions (eg., sum, mean) returns an single value
。 vector functions (eg., sqrt, log) returns an vector
。 window functions (eg., order, lag) returns an vector, but …
■ Introduce ggplot2 : the sister package of dplyr

Load packages & Read data

pacman::p_load(dplyr,tidyr,ggplot2,babynames)
B = babynames
B

# A tibble: 1,924,665 x 5
   year sex   name          n   prop
  <dbl> <chr> <chr>     <int>  <dbl>
1  1880 F     Mary       7065 0.0724
2  1880 F     Anna       2604 0.0267
3  1880 F     Emma       2003 0.0205
4  1880 F     Elizabeth  1939 0.0199
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows

This is a very large dataset. There is a record for every distinctive names every year since 1880. There are 1,924,665 rows and 5 columns …

str(B)

tibble [1,924,665 x 5] (S3: tbl_df/tbl/data.frame)
 $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
 $ sex : chr [1:1924665] "F" "F" "F" "F" ...
 $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
 $ n   : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
 $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...

1. Group_By

🌻 group_by() creates Groups structures inside tibbles (data frames). The Groups by itself does not change the data. It takes effect in the subsequent function calls along the pipeline.

# A tibble: 1,924,665 x 5
   year sex   name          n   prop
  <dbl> <chr> <chr>     <int>  <dbl>
1  1880 F     Mary       7065 0.0724
2  1880 F     Anna       2604 0.0267
3  1880 F     Emma       2003 0.0205
4  1880 F     Elizabeth  1939 0.0199
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows

group_by(B, year)

# A tibble: 1,924,665 x 5
# Groups:   year [138]
   year sex   name          n   prop
  <dbl> <chr> <chr>     <int>  <dbl>
1  1880 F     Mary       7065 0.0724
2  1880 F     Anna       2604 0.0267
3  1880 F     Emma       2003 0.0205
4  1880 F     Elizabeth  1939 0.0199
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows

👓 Compare the outputs of the above two code chucks, what is the difference between them?

2. Group Summaries

Following group_by() the most common function is summarise().

group_by(B, year) %>% summarise(
  no_records = n(),               # the number of rows 
  no.names = n_distinct(name),    # the number of distinct name
  no.baby = sum(n)                # total number of baby's 
  )

# A tibble: 138 x 4
   year no_records no.names no.baby
  <dbl>      <int>    <int>   <int>
1  1880       2000     1889  201484
2  1881       1935     1830  192696
3  1882       2127     2012  221533
4  1883       2084     1962  216946
# ... with 134 more rows
# i Use `print(n = ...)` to see more rows

🌻 group_by() %>% summarise()

Only summary functions are allowed in summarise()
Every summary function return an atomic value thereof every Group becomes a row
The last group structure is removed after a summarise() call
Nested group structures are removed by subsequent summarise()’s one at a time
If no longer needed, we’d better remove the remaining group structures by ungroup()

3. Group Mutate

Group mutate is much more complicated than group summarise.

3.1 Pure vector operations

🌻 Only vector functions (such as log, sqrt and the math operators) are allowed in mutate(). However, these pure vector operations were better executed in the entire table. Doing them by groups just produce the same result with unnecessary iteration and concatenation.

B %>% group_by(name) %>% mutate(logN=log(n), sqrtN=sqrt(n))

# A tibble: 1,924,665 x 7
# Groups:   name [97,310]
   year sex   name          n   prop  logN sqrtN
  <dbl> <chr> <chr>     <int>  <dbl> <dbl> <dbl>
1  1880 F     Mary       7065 0.0724  8.86  84.1
2  1880 F     Anna       2604 0.0267  7.86  51.0
3  1880 F     Emma       2003 0.0205  7.60  44.8
4  1880 F     Elizabeth  1939 0.0199  7.57  44.0
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows

would repeat the vector operation 97,310 times and then concatenate the outputs.

Yet the result is exactly the same as …

B %>% mutate(logN=log(n), sqrtN=sqrt(n))

# A tibble: 1,924,665 x 7
   year sex   name          n   prop  logN sqrtN
  <dbl> <chr> <chr>     <int>  <dbl> <dbl> <dbl>
1  1880 F     Mary       7065 0.0724  8.86  84.1
2  1880 F     Anna       2604 0.0267  7.86  51.0
3  1880 F     Emma       2003 0.0205  7.60  44.8
4  1880 F     Elizabeth  1939 0.0199  7.57  44.0
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows

which do the vector operations on the entire tibble at once.

🌷 Therefore …

On the one hand, only vector functions/expressions are allowed in mutate().
On the other hand, it doesn’t make sense to do group mutate for pure vector operations,
Yet group mutate is very useful, as elaborated below.

3.2 Vector Expressions with Summary Functions

Actually, vector expressions that produce vectors of the same lengths of the groups can also be used in mutate().

Form time to times we’d like to covert numbers to fractions. For example, if we want to investigate how the relative popularity of a name varies in time, we should convert the numbers into percentages on a per year basis. This task can be achieved byy a vector expression such as …

group_by(B, year) %>% mutate(frac = 100*n/sum(n))

# A tibble: 1,924,665 x 6
# Groups:   year [138]
   year sex   name          n   prop  frac
  <dbl> <chr> <chr>     <int>  <dbl> <dbl>
1  1880 F     Mary       7065 0.0724 3.51 
2  1880 F     Anna       2604 0.0267 1.29 
3  1880 F     Emma       2003 0.0205 0.994
4  1880 F     Elizabeth  1939 0.0199 0.962
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows

Within each Groups: year [138] structure, the number of each name n is divided by the total number sum(n) in the same year. Thereof the vector expression is evaluated once per year and produces the percentages for every names in every years.

Note that the original dataset already has a prop column. But it was calculated on a per year and per sex basis. Therefore prop is about twice as large as frac.

3.3 Window Functions

Windows functions is a special kind of vector function. Like common vector functions, it takes a vector as input and returns a vector of the same length. However, whilst most of the vector functions are not affected by the Groups structure, the operations of windows functions depend on Groups.

Most of the window functions are related to the order of the vector elements, for examples

rank(v) generates a rank vector based on the value in v,
lag(v, n) shifts the vector forward by n positions
lead(v, n) shifts the vector backward by n positions

As an practical example, if we want to investigate how fast the popularity of the names varies in time, we can …

B %>% group_by(sex, name) %>%              
  arrange(year) %>%                # arrange by ascending year
  mutate(                          # so we can use lag to calculate   
    delta = 100 * (prop-lag(prop)) # the increment of prop in percentage  
    ) %>%                          # per sex-name per year
  arrange(desc(delta))    # arrange by delta to see the largest increments

# A tibble: 1,924,665 x 6
# Groups:   sex, name [107,973]
   year sex   name        n   prop delta
  <dbl> <chr> <chr>   <int>  <dbl> <dbl>
1  1947 F     Linda   99686 0.0548 2.22 
2  1935 F     Shirley 42355 0.0390 1.79 
3  1983 F     Ashley  33293 0.0186 1.04 
4  1934 F     Shirley 22841 0.0211 0.742
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows

The largest increment of prop occurred on the female name Linda in 1947.

NINJA’s DOJO . . . . .

Now let’s play with the baby names database with what we’d learned.

Find the the top 20 names and calculate their cumulative percentages.

count(B, sex, name, wt=n, sort=T) %>% 
  mutate(pc=n/sum(n), cumpc=cumsum(pc)) %>% 
  head(20)

# A tibble: 20 x 5
   sex   name              n      pc  cumpc
   <chr> <chr>         <int>   <dbl>  <dbl>
 1 M     James       5150472 0.0148  0.0148
 2 M     John        5115466 0.0147  0.0295
 3 M     Robert      4814815 0.0138  0.0433
 4 M     Michael     4350824 0.0125  0.0558
 5 F     Mary        4123200 0.0118  0.0677
 6 M     William     4102604 0.0118  0.0794
 7 M     David       3611329 0.0104  0.0898
 8 M     Joseph      2603445 0.00748 0.0973
 9 M     Richard     2563082 0.00736 0.105 
10 M     Charles     2386048 0.00685 0.112 
11 M     Thomas      2304948 0.00662 0.118 
12 M     Christopher 2022164 0.00581 0.124 
13 M     Daniel      1907357 0.00548 0.129 
14 F     Elizabeth   1629679 0.00468 0.134 
15 M     Matthew     1590440 0.00457 0.139 
16 F     Patricia    1571692 0.00451 0.143 
17 F     Jennifer    1466281 0.00421 0.147 
18 M     George      1464186 0.00421 0.152 
19 F     Linda       1452249 0.00417 0.156 
20 F     Barbara     1434060 0.00412 0.160

Find the female names ever being the most popular in the year

filter(B, sex=="F")  %>% 
  group_by(year) %>% top_n(1, n) %>% ungroup %>%  
  count(name, sort=T) -> fChamp

and plot them in an interactive chart.

gg = filter(B, sex=="F", name%in%fChamp$name) %>% 
  ggplot(aes(x=year, y=prop, color=name)) + geom_line() +
  labs(color="",title="The Most Popular Female Names")
ggplotly(gg) %>% layout(legend=list(tracegroupgap=4))

Do the same thing for male.

filter(B, sex=="M")  %>% 
  group_by(year) %>% top_n(1, n) %>% ungroup %>%  
  count(name, sort=T) -> mChamp

gg = filter(B, sex=="M", name%in%mChamp$name) %>% 
  ggplot(aes(x=year, y=prop, color=name)) + geom_line()+
  labs(color="",title="The Most Popular Male Names")
ggplotly(gg) %>% layout(legend=list(tracegroupgap=4))

🏄 In this Dojo session, we have learned two things …

dplyr is handy when you get used to it.
Plus ggplot, it’d make you an data Ninja.