Dr. Peng Xiao
Statistical procedures use sample data to estimate the characteristics of the whole population from which the sample was drawn.
Statistical Inference is the process of using a sample to infer the properties of a population.
Unfortunately, populations are usually too large to measure fully. Consequently, researchers must use a manageable subset of that population to learn about it.
In statistics, probability theory plays a fundamental role in quantifying uncertainty, modeling randomness, and making predictions based on data.
Probability Lingo
Statistics Lingo
Graph
Table
Formula
Denote \(X\) = number of heads when toss 3 fair coins
\(X\sim\) Binomial (n=3, p=0.5)
The probability distribution function of \(X\) is
\[f_X(x) = \binom{3}{x}\cdot (0.5)^x \cdot (0.5)^{3-x}, \text{ where } x=0,1,2,3 \]
Graph
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Parameters for the normal distribution (mean and standard deviation)
mean_height = 170 # mean height in centimeters
std_dev = 10 # standard deviation in centimeters
# Generate a range of heights for plotting (e.g., from 130 to 210 cm)
heights = np.linspace(130, 210, 1000)
# Calculate the probability density function (PDF) of the normal distribution
pdf_heights = norm.pdf(heights, loc=mean_height, scale=std_dev)
# Plotting the normal distribution curve
plt.figure(figsize=(10, 6))
plt.plot(heights, pdf_heights, color='blue', label='Normal Distribution')
# Highlight the mean and ±1 standard deviation range
plt.axvline(mean_height, color='red', linestyle='--', label='Mean Height')
plt.axvline(mean_height - std_dev, color='green', linestyle='--', label='Mean - 1 Std Dev')
plt.axvline(mean_height + std_dev, color='green', linestyle='--', label='Mean + 1 Std Dev')
# Adding labels, title, and legend
plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.title('Normal Distribution of Adult Human Heights')
plt.legend()
# Show the plot
plt.show()
Note - Actual human height data may vary and could exhibit deviations from a perfect normal distribution due to factors such as genetic diversity, environmental influences, and sampling variability. However, the normal distribution is a useful theoretical model for describing and analyzing continuous variables like height in statistical contexts.
Table
Formula
Denote \(X\) is the height of adult human.
If \(X\sim N(\mu,\sigma)\), then its probability density function is
\[ f_X(x) = \frac{1}{\sigma\sqrt{2\pi}}\cdot e^{-\frac{(x-\mu)^2}{2\cdot \sigma^2}}, \text{ where } x\in (-\infty,\infty ) \]
The population distribution function describes the probabilities associated with every possible value of the random variable within the population.
Hypokalemia is diagnosed when blood potassium levels are below 3.5mEq/L. Let’s assume that we know a patient whose measured potassium levels vary daily according to a normal distribution N(μ = 3.8,σ = 0.2). If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia?
## [1] 0.0668072
Collect a Representative Sample
Visualize the Data
library(RColorBrewer)
data(VADeaths)
par(mfrow=c(2,3))
hist(VADeaths,breaks=10, col=brewer.pal(3,"Set3"),main="Set3 3 colors")
hist(VADeaths,breaks=3 ,col=brewer.pal(3,"Set2"),main="Set2 3 colors")
hist(VADeaths,breaks=7, col=brewer.pal(3,"Set1"),main="Set1 3 colors")
hist(VADeaths,,breaks= 2, col=brewer.pal(8,"Set3"),main="Set3 8 colors")
hist(VADeaths,col=brewer.pal(8,"Greys"),main="Greys 8 colors")
hist(VADeaths,col=brewer.pal(8,"Greens"),main="Greens 8 colors")
Select an Estimation Method
Parametric Estimation
Non-Parametric Estimation
Use the estimated population distribution to draw inferences and make predictions about the population parameters, probabilities of events, or future observations.
Methods for drawing conclusions about a population from sample data are called Statistical Inference.
Rather than directly estimating the population distribution, we estimate the sampling distribution of a statistic.
Population Distribution: Describes the distribution of a variable in the entire population of interest. It is often unknown and may be difficult or impossible to fully characterize, especially when dealing with large or unobservable populations.
Sampling Distribution: Describes the distribution of a statistic (e.g., sample mean, sample proportion) calculated from multiple samples drawn from the population. It represents the variability of the statistic across different possible samples.
Instead of estimating the population distribution, sampling distribution is focusing on the distribution of a statistic
Sampling distribution is the distribution of all possible values taken by the statistic when all possible samples of a fixed size n are taken from the population
Mean of \(\overline{X}\) = Population Mean \(\mu\)
Standard Deviation of \(\overline{X}\) = Population Standard Deviation over \(\sqrt{n}\)
\[\sigma_\overline{X} = \sigma/\sqrt{n}\]
In a large population of adults, the mean IQ is 112 with standard deviation 20. Suppose 200 adults are randomly selected for a market research campaign.
The distribution of the sample mean IQ is:
Population Distribution : \(N(\mu=112,\sigma=20)\)
Sampling
Distribution for \(n=200\) : \(N(\mu=112,\sigma/\sqrt{n}=1.414)\)
When randomly sampling from any population with mean
\(\mu\) and standard deviation \(\sigma\), when \(n\) is large enough, the sampling
distribution of is approximately normal: \(\sim N (\mu, \sigma/\sqrt{n})\).
“Randomly” – every individual in the population has an equal chance of being selected and every possible subset of a given size has an equal chance of being chosen.
Hypokalemia is diagnosed when blood potassium levels are below \(3.5\) mEq/L. Let’s assume that we know a patient whose measured potassium levels vary daily according to a normal distribution \(N(\mu = 3.8,\sigma = 0.2)\). If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia?
Instead, if measurements are taken on 4 separate days, what is the probability of a misdiagnosis?
We can first look at the graph of the Population Distribution and the Sampling Distribution
The sampling distribution is narrower than the population distribution by a factor of \(\sqrt{n}\)
When we only use one measurement
## [1] 0.0668072
When we use 4 measurements
## [1] 0.001349898
Although the sample mean, \(\overline{x}\), is a unique number for any particular sample, if you pick a different sample you will probably get a different sample mean.
In fact, you could get many different values for the sample mean, and virtually none of them would actually equal the true population mean, \(\mu\).
But the sampling distribution of \(\overline{X}\) is narrower than the population distribution, by a factor of \(\sqrt{n}\)
Thus, the estimates gained from our samples are always relatively close to the population parameter \(\mu\).
library(animation)
conf.int()
95% of all sample means will be within roughly 2 standard deviations (\(2\times \sigma/\sqrt{n}\)) of the population parameter \(\mu\).
This implies that the population mean μ must be within roughly 2 standard deviations from (\(2\times \sigma/\sqrt{n}\)) from the sample average \(\overline{x}\), in \(95\%\) of all samples.
The confidence interval is a range of values with an associated probability or confidence level. The probability quantifies the chance that the interval contains the true population mean.
Population Distribution - \(N(3.8,0.2)\)
Sample Size = \(4\)
Sample mean = \(3.7\)
\(95\%\) confidence interval : \(3.7±1.96×0.2/\sqrt{4}\) = \(3.7±0.196\) = \((3.504, 3.896)\)
## [1] 0.0249979 0.9750021
We are \(95\%\) confidence that the
actual value of \(\mu\) will be in
\((3.504, 3.896)\)
With \(95\%\) chance, the actual value of \(\mu\) will be within \(0.196\) units of the value of \(\overline{x}\)
Note –
You are in charge of quality control in a food company. You sample
randomly four packs of tomatoes, each labeled 1/2 lb. (227 g).
The
average weight from your four packs is 222 g. Obviously, we cannot
expect boxes filled with whole tomatoes to all weigh exactly half a
pound.
There are two possibilities:
One way to think about this that we want a measure of how extreme the event is that we observed (222 g)? Can probability help us to measure “how extreme?”
After we carefully checked the assumptions, we can frame this problem into a probability problem as follow
\[P(\overline{X} < 222 | \text{Assumptions})\] If assumptions are satisfied,
\[\overline{X} \sim N(227,5/\sqrt{4})\] Therefore,
\[P(\overline{X}<222) = P\left(Z<\frac{222-227}{5/\sqrt{4}}\right)=P(Z<-2) = 0.0228\]
## [1] 0.02275013
There is only 2.28% chance that you would pick one tomato package with 220 g or less.
Is it an extreme event?
Unusual event happened!
The purpose of hypothesis testing is to assess whether there is
enough evidence to reject the null hypothesis in favor of the
alternative hypothesis.
The null hypothesis is a very specific statement about a parameter of
the population(s). It is labeled \(H_o\)
It usually represents the default position that there is no effect, no difference, or no relationship.
It’s typically the hypotheses that researchers aim to test
against when conducting statistical hypothesis testing.
The
alternative hypothesis is a statement that researchers aim to find
evidence for.
It represents a departure from the null hypothesis and typically asserts that there is an effect, a difference, or a relationship in the population.
There are upper-tailed test, lower-tailed test and two-tailed test for 1 sample z-test
\[ H_o : \mu = 227 \]
\[ H_a : \mu <227 \]
Tests of statistical significance quantify the chance of
obtaining a particular random sample result if the null hypothesis were
true. This quantity is the P-value.
This is a way of assessing the “believability” of the null
hypothesis, given the evidence provided by a random sample.
With a small p-vale we Reject \(H_o\). The true property of the population
is significantly different from what was stated in \(H_o\).
Small p-values are strong evidence AGAINST \(H_o\).
Oftentimes, a P-value of 0.05 or less is
considered significant
The phenomenon observed is unlikely to be entirely due to chance event from the random sampling.
Instead, the assumption of the population (\(\mu\)) is significantly different from the truth
The significance level, \(\alpha\), is the largest P-value tolerated for rejecting a true null hypothesis (how much evidence against \(H_o\) we require). This value is decided before conducting the test.
If the P-value is equal to or less than \(\alpha\) (\(P ≤ \alpha\)), then we reject \(H_o\). If the P-value is greater than \(\alpha\) (\(P > \alpha\)), then we fail to reject \(H_o\).
When choosing the sifnificance level \(\alpha\)
The power of a test of hypothesis with fixed significance level \(\alpha\) is the probability that the test
will reject the null hypothesis when the alternative is true.
In
other words, power is the probability that the data gathered in an
experiment will be sufficient to reject a wrong null hypothesis.
Knowing the power of your test is important:
A Type I error is made when we reject the null hypothesis and the
null hypothesis is actually true (incorrectly reject a true \(H_o\)).
The probability of making a
Type I error is the significance level \(\alpha\).
A Type II error is made when we fail to reject the null
hypothesis and the null hypothesis is false (incorrectly keep a false
\(H_o\)).
The probability of
making a Type II error is labeled \(\beta\).
The power of a test is \(1-\beta\).