👨🏫 KEY POINTS :
🔑 A data frame column is an attribute of interest. In terms of statistics, we call it a variable.
🔑 We have a few ways to describe the varying values of a variable.
mean
,sd
) for numeric and Counts (table
) for categorical variables.🔑 In R we can …
Most of the statistics theories were built upon two fundamental assumptions:
Mathematically, these hypothetical concepts are somehow difficult to understand. Fortunately, now we can elaborate these conceptual ideas with data in R.
The most common random draw function in R is sample()
sample(x, n)
randomly draws n
elements from x
without replacementsample(x, n, replace=T)
randomly draw n
elements from x
with replacementRandomly draw a number from 1 to 6 …
[1] 3
Rut it 10 times, the you’d know what it means by ‘random’.
In R, we define any
Run it 10 times.
[1] 8
It looks ‘random’, isn’t it.
🌻 A
Because an random experiments return random values, their corresponding random variables varies randomly.
Let’s define a random experiment that returns the maximum value of three fair dices.
When we repeat an random experiment N time, we will have a resultant vector of length N. This resultant vector is a
If we define a random variable rvMax3
on the random experiment reMax3
, we can take a 20-point random sample of rvMax3
by …
🚴 EXERCISE :
👉 Define a random experiment that randomly draw 3 numbers out of 1:10
and returns the minimum.
👉 Take a random sample of size 400 form the corresponding random variable.
👉 What is the best way to present this sample to you boss?
There are 3 different ways to define a random variable:
Supposed that we have a data vector skin
that represents the skin colors of a population of 1,000 people.
The distribution of skin
can be represented as a numeric table of counts or proportions …
Black Red White Yellow
counts 400.0 100.0 300.0 200.0
factions 0.4 0.1 0.3 0.2
or we can visualize the distribution in two ways
par(mar=c(3,5,3,1),mfrow=c(1,2),cex=0.7)
table(skin) %>%
barplot(main="the distribution of skin colors", ylab="counts")
table(skin) %>% prop.table %>%
barplot(main="the distribution of skin colors", ylab="propotions")
🌷 Quite often we’d map the proportions in the distribution to the probabilities of their corresponding events. However, this mapping should be conducted carefully, as will be explained below.
We can define a random variable rvSkin
on the random experiment exSkin
which randomly draws a data point from skin
.
To take a sample of size 100, we repeat the experiment 100 times.
The distribution of this sample is
rvSkin100
Black Red White Yellow
0.43 0.07 0.30 0.20
Is the distribution of the sample exactly the same as that of the data vector skin
?
[,1] [,2] [,3] [,4] [,5]
Black 0.33 0.375 0.4005 0.39781 0.40078
Red 0.10 0.101 0.0960 0.09805 0.10035
White 0.35 0.330 0.3022 0.30229 0.29985
Yellow 0.22 0.194 0.2013 0.20185 0.19902
As the size of the sample increased, the distribution of the sample become closer to that of the data. But it does not meet the exact numbers even when we take a sample of 1 million (10^6).
Theoretically, the sample distribution will asymptotically be equal to the data distribution when the sample size approaches infinity. This is exactly the underlying logic that we can map the distribution of empirical data to the probabilities of the corresponding events.
🔑 KEY POINTS :
The
\[ p = \frac{n!}{k_1!\,k_2\,...\,k_i!} \; \prod p_i^{k_1} \\ \frac{10!}{4!\,3!\,2!\, 1!} (.4)^4 \, (.3)^3 \, (.2)^2 \, (.1)^1 \; \simeq \; 0.034836 \]
Randomly drawing 10 points from skin
, the probability that we have 4 blacks, 3 whites, 2 yellows and 1 red is less than 3.5%. It’s hard to do the math or remember the formula though.
Fortunately, in R we can simulate the experiment and let the computer estimates the expected value for us. The asymptotic property implies that our estimation would approach the true value when the number of repetitions is large enough. The questions is - how large is enough?
Let’s define a function Simulated.Prob(k)
that repeats the experiment sample(skin,10)
k
times, and return the proportions that the outcome of the experiment match the distribution of our data.
# convert skin in to a factor for faster execution
sf = factor(skin, c("Black","White","Yellow","Red"))
# prepare for parallel simulation
pacman::p_load(future.apply)
plan(multisession)
zSimulated.Prob = function(k) future_replicate(
k, as.integer(table(sample(sf, 10))), future.seed=TRUE
) %>% apply(2, identical, 4:1) %>% mean
By increasing k
, we approach the ‘infinity’ gradually …
x = as.integer(10^seq(1,5,0.2))
t0 = Sys.time()
y = sapply(x, function(k) { cat(k,"- "); zSimulated.Prob(k) })
10 - 15 - 25 - 39 - 63 - 100 - 158 - 251 - 398 - 630 - 1000 - 1584 - 2511 - 3981 - 6309 - 10000 - 15848 - 25118 - 39810 - 63095 - 100000 -
Time difference of 8.1151 secs
In a few lines of code we can simulate this somewhat complicated experiment. And the the power of modern computer can execute this simulation 2.7 million times in 1.5 minutes.
par(mfrow=c(1,1), mar=c(5,5,3,2), cex=0.8)
plot(log(x,10), y, type="b")
abline(h=0.0348, col='red', lty=2)
See? The result of the simulation
🌞 TAKE AWAYS :
🌻 Modern business is full of complicated cases that are never covered in Statistics textbook.
🌻 When it is too hard to do the math, computer simulation may help to make good enough estimation in reasonable time
🌻 Mastering R so you can simulate whatever you can specify !!!