🏠 Key
Points :
Corresponding to Data
Manipulation with dplyr, Chapter 4
■ The effect of
group
’s in dplyr
■ group_by()
…
summarise()
■ group_by()
…
mutate()
■ The three types of R functions …
。 summary functions (eg., sum
,
mean
) returns an single value
。
vector functions (eg., sqrt
,
log
) returns an vector
。 window
functions (eg., order
, lag
) returns an vector,
■ Introduce ggplot2
: the sister
package of dplyr
Load packages & Read data
::p_load(dplyr,tidyr,ggplot2,babynames)
pacman= babynames
B B
# A tibble: 1,924,665 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows
This is a very large dataset. There is a record for every distinctive
names every year since 1880. There are 1,924,665
rows and 5
columns …
str(B)
tibble [1,924,665 x 5] (S3: tbl_df/tbl/data.frame)
$ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ...
$ sex : chr [1:1924665] "F" "F" "F" "F" ...
$ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ...
$ n : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ...
$ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...
🌻 group_by()
creates Groups
structures
inside tibbles (data frames). The Groups
by itself does not
change the data. It takes effect in the subsequent function calls along
the pipeline.
B
# A tibble: 1,924,665 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows
group_by(B, year)
# A tibble: 1,924,665 x 5
# Groups: year [138]
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows
👓 Compare the outputs of the above two code chucks, what is the
difference between them?
Following group_by()
the most common function is
summarise()
.
group_by(B, year) %>% summarise(
no_records = n(), # the number of rows
no.names = n_distinct(name), # the number of distinct name
no.baby = sum(n) # total number of baby's
)
# A tibble: 138 x 4
year no_records no.names no.baby
<dbl> <int> <int> <int>
1 1880 2000 1889 201484
2 1881 1935 1830 192696
3 1882 2127 2012 221533
4 1883 2084 1962 216946
# ... with 134 more rows
# i Use `print(n = ...)` to see more rows
🌻 group_by() %>% summarise()
summarise()
summarise()
callsummarise()
’s one at a timeungroup()
Group mutate is much more complicated than group summarise.
🌻 Only vector functions (such as log
,
sqrt
and the math operators) are allowed in
mutate()
. However, these pure vector operations were better
executed in the entire table. Doing them by groups just produce the same
result with unnecessary iteration and concatenation.
%>% group_by(name) %>% mutate(logN=log(n), sqrtN=sqrt(n)) B
# A tibble: 1,924,665 x 7
# Groups: name [97,310]
year sex name n prop logN sqrtN
<dbl> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 1880 F Mary 7065 0.0724 8.86 84.1
2 1880 F Anna 2604 0.0267 7.86 51.0
3 1880 F Emma 2003 0.0205 7.60 44.8
4 1880 F Elizabeth 1939 0.0199 7.57 44.0
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows
would repeat the vector operation 97,310 times and then concatenate the outputs.
Yet the result is exactly the same as …
%>% mutate(logN=log(n), sqrtN=sqrt(n)) B
# A tibble: 1,924,665 x 7
year sex name n prop logN sqrtN
<dbl> <chr> <chr> <int> <dbl> <dbl> <dbl>
1 1880 F Mary 7065 0.0724 8.86 84.1
2 1880 F Anna 2604 0.0267 7.86 51.0
3 1880 F Emma 2003 0.0205 7.60 44.8
4 1880 F Elizabeth 1939 0.0199 7.57 44.0
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows
which do the vector operations on the entire tibble at once.
🌷 Therefore …
mutate()
.Actually, vector expressions that produce vectors of the same lengths
of the groups can also be used in mutate()
.
Form time to times we’d like to covert numbers to fractions. For example, if we want to investigate how the relative popularity of a name varies in time, we should convert the numbers into percentages on a per year basis. This task can be achieved byy a vector expression such as …
group_by(B, year) %>% mutate(frac = 100*n/sum(n))
# A tibble: 1,924,665 x 6
# Groups: year [138]
year sex name n prop frac
<dbl> <chr> <chr> <int> <dbl> <dbl>
1 1880 F Mary 7065 0.0724 3.51
2 1880 F Anna 2604 0.0267 1.29
3 1880 F Emma 2003 0.0205 0.994
4 1880 F Elizabeth 1939 0.0199 0.962
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows
Within each Groups: year [138]
structure, the number of
each name n
is divided by the total number
sum(n)
in the same year. Thereof the vector expression is
evaluated once per year and produces the percentages for every names in
every years.
Note that the original dataset already has a prop
column. But it was calculated on a per year and per sex basis. Therefore
prop
is about twice as large as frac
.
Windows functions is a special kind of vector function. Like common
vector functions, it takes a vector as input and returns a vector of the
same length. However, whilst most of the vector functions are not
affected by the Groups
structure, the operations of windows
functions depend on Groups
.
Most of the window functions are related to the order of the vector elements, for examples
rank(v)
generates a rank vector based on the value in
v
,lag(v, n)
shifts the vector forward by n
positionslead(v, n)
shifts the vector backward by n
positionsAs an practical example, if we want to investigate how fast the popularity of the names varies in time, we can …
%>% group_by(sex, name) %>%
B arrange(year) %>% # arrange by ascending year
mutate( # so we can use lag to calculate
delta = 100 * (prop-lag(prop)) # the increment of prop in percentage
%>% # per sex-name per year
) arrange(desc(delta)) # arrange by delta to see the largest increments
# A tibble: 1,924,665 x 6
# Groups: sex, name [107,973]
year sex name n prop delta
<dbl> <chr> <chr> <int> <dbl> <dbl>
1 1947 F Linda 99686 0.0548 2.22
2 1935 F Shirley 42355 0.0390 1.79
3 1983 F Ashley 33293 0.0186 1.04
4 1934 F Shirley 22841 0.0211 0.742
# ... with 1,924,661 more rows
# i Use `print(n = ...)` to see more rows
The largest increment of prop
occurred on the female
name Linda in 1947.
Now let’s play with the baby names database with what we’d learned.
Find the the top 20 names and calculate their cumulative percentages.
count(B, sex, name, wt=n, sort=T) %>%
mutate(pc=n/sum(n), cumpc=cumsum(pc)) %>%
head(20)
# A tibble: 20 x 5
sex name n pc cumpc
<chr> <chr> <int> <dbl> <dbl>
1 M James 5150472 0.0148 0.0148
2 M John 5115466 0.0147 0.0295
3 M Robert 4814815 0.0138 0.0433
4 M Michael 4350824 0.0125 0.0558
5 F Mary 4123200 0.0118 0.0677
6 M William 4102604 0.0118 0.0794
7 M David 3611329 0.0104 0.0898
8 M Joseph 2603445 0.00748 0.0973
9 M Richard 2563082 0.00736 0.105
10 M Charles 2386048 0.00685 0.112
11 M Thomas 2304948 0.00662 0.118
12 M Christopher 2022164 0.00581 0.124
13 M Daniel 1907357 0.00548 0.129
14 F Elizabeth 1629679 0.00468 0.134
15 M Matthew 1590440 0.00457 0.139
16 F Patricia 1571692 0.00451 0.143
17 F Jennifer 1466281 0.00421 0.147
18 M George 1464186 0.00421 0.152
19 F Linda 1452249 0.00417 0.156
20 F Barbara 1434060 0.00412 0.160
Find the female names ever being the most popular in the year
filter(B, sex=="F") %>%
group_by(year) %>% top_n(1, n) %>% ungroup %>%
count(name, sort=T) -> fChamp
and plot them in an interactive chart.
= filter(B, sex=="F", name%in%fChamp$name) %>%
gg ggplot(aes(x=year, y=prop, color=name)) + geom_line() +
labs(color="",title="The Most Popular Female Names")
ggplotly(gg) %>% layout(legend=list(tracegroupgap=4))
Do the same thing for male.
filter(B, sex=="M") %>%
group_by(year) %>% top_n(1, n) %>% ungroup %>%
count(name, sort=T) -> mChamp
= filter(B, sex=="M", name%in%mChamp$name) %>%
gg ggplot(aes(x=year, y=prop, color=name)) + geom_line()+
labs(color="",title="The Most Popular Male Names")
ggplotly(gg) %>% layout(legend=list(tracegroupgap=4))
🏄 In this Dojo session, we have learned two things …
dplyr
is handy when you get used to it.ggplot
, it’d make you an data Ninja.