👨🏫 KEY POINTS :
🔑 Distribution: the connections among probability, probability density and probability density functions (PDF/CDF).
🔑 3 different types of variables. Each has a distinctive distribution form.
🔑 A same set of operations generally apply to all kinds of distribution.
🔑 Put distribution functions in use.
🔑 Define random variables by simulating random experiment
There are 3 types of variables.
Most of the commonly used distributions functions are built-in in R (See the online help of distributions
for details.)
dbeta()
dbinom()
dcauchy()
dchisq()
dexp()
df()
dgamma()
dhyper()
dlnorm()
dmultinom()
dnbinom()
dnorm()
dpois()
dt()
dunif()
dweibull()
and there are more of them in various packages.
There are 4 common operations that generally apply to every distributions.
d<dist>(x, <args>)
: for probability density function (PDF)p<dist>(q, <args>)
: for cumulative distribution function (CDF)q<dist>(p, <args>)
: for quantile function, the reversed CDFr<dist>(n, <args>)
: for random samplingwhere <dist>
specifies a specific distribution and
_norm(., mean, sd)
for Normal[mean,sd]
and_binom(., size, prob)
for Binomial[size,prob]
.These distribution operation function will be elaborated in the following sections.
d<dist>
For a continuous distribution such as dnorm(x, mean, sd)
returns the x
for Normal[mean,sd]
.
For a discrete distribution such as dbinom(x, size, prob)
returns the x
for Binomial[size,prob]
.
Usually the d
functions are used to plot PDF.
par(mfrow=c(1,2), cex=0.7)
curve(dnorm(x), -4, 4,
main="PDF of Normal[0,1]", ylab="density")
barplot(dbinom(1:10,10,0.5),names=1:10,
main="PDF of Bonomial[10,0.5]", ylab="probability", xlab="x")
🦋 DISCUSSION :
Try to elaborate the difference between the continuous and discrete distributions in the following aspects …
🐞 the mathematical forms of PDF
🐞 the plotting codes
🐞 the label on the y-axis
🐞 What is probability density? How is it different from probability itself?
🚴 Exercise :
👉 Acquire the prob. density of Standard Normal Dist. (Normal[mu=0,sd=1]
) at -3:3
[1] 0.0044318 0.0539910 0.2419707 0.3989423 0.2419707 0.0539910 0.0044318
👉 Acquire and Plot the prob. densities of Normal[mu=100,sd=20]
at seq(50,120,10)
. Plot with type='h',lwd=5,col='pink
and label the plot with the appropriate main
, xlab
, ylab
arguments in plot()
.
par(mar=c(4,4,3,2), cex=0.7)
x = seq(50,120,10)
y = dnorm(x, 100, 20)
plot(x, y, type="h", lwd=5, col='pink',
main="Prob. Density at Speciific X", ylab="density", xlab="X")
👉 Acquire the probabilities of Binomial[n=10,p=1/6]
at c(0,2,4,6,8,10)
[1] 0.161505582890 0.290710049202 0.054265875851 0.002170635034 0.000018605443
[6] 0.000000016538
👉 Roll a fair dice 10 times, what is the probability of having exactly 2 one’s.
[1] 0.29071
👉 Roll a fair dice 20 times, what is the probability of having less than 5 one’s.
[1] 0.76875
[1] 0.76875
p<dist>
For any continuous distributions, there is no probability at any given point. Therefore we usually define each event as a specific range of value. If we were to acquire the probability of such an event with PDF, we’d have to do calculus integration over the range. By integrating PDF from its left tail, CDF offers a easier way to acquire the probability over a range. For an example, the probability of \(x \in [-1,2]\) in Standard Normal Dist. can be acquired by subtracting the CDF reading at the upper bound to that at the lower.
[1] 0.81859
🚴 Exercise :
👉 Acquire the prob. of Standard Normal Dist. in the interval [-Inf, -3]
.
[1] 0.0013499
👉 Acquire the prob. of Normal[mu=100,sd=50]
in the interval [0, 50]
.
[1] 0.13591
👉 If the distribution of the temperature at this season follows Normal[mu=25,sd=10]
Celsius, what is the probability that tomorrow’s average temperature is higher than 30°C.
[1] 0.30854
👉 Flipping a fair coin 100 times, what is the probability to have less than 35 heads?
[1] 0.00089497
👉 Can you validate your answer of the above question by using PDF?
[1] 0.00089497
👉 Rolling a fair dice 100 times. What is the probability to have no greater than 20 and no less than 10 one’s?
[1] 0.82682
👉 Try to validate your answer of the above question by using PDF.
[1] 0.82682
See? For discrete distributions, PDF serves most of the purposes most of the time.
🚴 SIMULATION :
Let’s also try to estimate the above probability by sample()
and replicate()
.
[1] 0.82710000 0.00028026
👉 Wrap the above code chunk into another replicate()
, repeat 10 times and observe
replicate(10, {
n = replicate(20000, sum(sample(1:6, 100, T) == 1))
mean(n>=10 & n<=20) - p1
}) %>% summary
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0053697 -0.0013572 -0.0000447 -0.0003247 0.0017553 0.0027303
What have we learned from the previous exercise?
When we have a theoretical distribution with known arguments, acquiring the probability directly from the corresponding PDF/CDF is easier, faster and more accurate than doing simulation.
Nonetheless, our real world is full with cases that are not covered in any of the pre-defined distributions. For an example, if we wanted to estimate the probability that every side of the dice occurs no greater than 20 and no less than 10 …
K = 20000
n = replicate(K, sample(factor(1:6), 100, T) %>% table %>% range)
mean(n[1,]>=10 & n[2,]<=20)
[1] 0.25575
Run the above code chuck a few times. You’d see that the event occurs in about 25.8% of the K=20000
trials. To assure the convergence, we can wrap the above code chuck into future_replicate()
which replicates in parallel.
t0 = Sys.time()
a = future_replicate(24, {
n = replicate(K, sample(factor(1:6), 100, T) %>% table %>% range)
mean(n[1,]>=10 & n[2,]<=20)
}, future.seed=TRUE)
c(mean(a), sd(a))
[1] 0.2590229 0.0042863
Time difference of 12.708 secs
In the 24 replications, the mean and standard deviation of the estimated probability is 0.2596 and 0.0035 respectively. The small standard deviation indicates that the trial size of K=20000
is large enough for convergence.
r<dist>
r<dist>(n, <args>)
randomly draws n
data points from the <dist>
of known <args>
. For examples …
norm50 = rnorm(n=50, mean=50, sd=10) # draw a sample of size 50 from Normal[50,10]
unif50 = runif(n=50, min=0, max=100) # draw a 50-point sample from Uniform[50,10]
These random sample functions are very useful in business simulation. Supposed we know the profit contribution of a customer \(i\) - \(\pi_i\) …
\[\pi_i(V_i, X_i) = \left\{\begin{matrix} \sqrt{X_i} & , & V_i <= 2\\ \sqrt{X_i+20} & , & V_i \in [3,4]\\ 20+\sqrt{X_i} & , & V_i >= 5\\ \end{matrix}\right.\]
is determined by two random variables:
It is difficult to determine the distribution of \(\pi\) with math. Let’s try to do this via simulation
Pi.Trial = function(K) {
pi = data.frame(
V = rpois(K, lambda=3.25),
X = rnorm(K, mean=180, sd= 15)
) %>% mutate(
X = pmax(0, X),
pi = case_when(
V <= 2 ~ sqrt(X),
V >= 5 ~ 20+sqrt(X),
TRUE ~ sqrt(X+20)
) ) %>% pull(pi)
c(mu=mean(pi), sigma=sd(pi))
}
Pi.Trial(1e5)
mu sigma
18.2564 8.2605
See? Within a second, we can create a sample 0f 100K (1e5) and estimate the mu
and sigma
for the distribution of pi
.
To see the precision of our estimation we can replicate the trial
K = 1e5
future_replicate(24, Pi.Trial(K), future.seed=T) %>%
t %>% as.data.frame %>%
summarise_all(.funs=list(mean=mean, sd=sd))
mu_mean sigma_mean mu_sd sigma_sd
1 18.272 8.2646 0.027829 0.017738
🚴 Exercise :
According to the Central Limit Theories, standard deviation is reversely proportional to the square root of the sample size. So …
🐞 If the sd
’s are to reduce by a half, the sample size should be multiplied by …?
👉 Modify the above code chuck to reduce the sd
of our estimations in half.
future_replicate(24, Pi.Trial(4*K), future.seed=T) %>%
t %>% as.data.frame %>%
summarise_all(.funs=list(mean=mean, sd=sd))
mu_mean sigma_mean mu_sd sigma_sd
1 18.263 8.2599 0.012658 0.0081741
🐞 Is the Central Limit Theories still working in the 21st century?
q<dist>
The q<dist>
functions are the reverse p<dist>
functions. For a random variable X ~ <dist>[<args>]
, q<dist>(p, <args>)
returns the critical value q
whereof Prob[x <= q] = p
.
The q<dist>
functions is widely used in the statistical method of hypothesis testing. Assuming X ~ Normal[mu, sigma]
, qnorm(c(0.025, 0.975), mu, sigma)
returns the critical values x1
and x2
whereof Prob[ X ∈ [x1, x2] ] = 0.975 - 0.025 = 0.95
. Because the probability that X falls in [x1, x2] is 95% , this interval is called 95% confidence interval in Statistic textbooks.
[1] 60.801 139.199
🚴 Exercise :
With X ~ Normal[mu=50, sigma=25]
…
👉 Acquire the 99% and 95% and 90% confidence intervals of X.
[1] -14.3957 1.0009 8.8787 91.1213 98.9991 114.3957
👉 Plot the PDF of X and mark these interval in different colors
par(cex=0.7)
curve(dnorm(x, 50, 25), -50, 150, lwd=3, ylab="density",
main="PDF of Normal[mu=50,sd=25] with 90%, 95% & 99% CI")
abline(h=0)
rect(x[1],0,x[6],0.015,border=NA,col=rgb(1,0,0,0.2))
rect(x[2],0,x[5],0.012,border=NA,col=rgb(0,1,0,0.2))
rect(x[3],0,x[4],0.008,border=NA,col=rgb(0,0,1,0.2))
legend("topright",c("CI.90%","CI.95%","CI.99%"),
col=c('blue','green','pink'),lwd=2,bty="n")
🌞 TAKE AWAYS :
R is an integrated working environment where we can …
🌻 Draw sample data from known distribution
🌻 Generate sample data by simulating experiment and
🌻 Replicate the experiment to create distribution
🌻 Fit data into theoretic distribution or
🌻 build empirical distribution directly form data
🌻 Operate theoretical, empirical or simulated distributions in the same set of tools.