What is the difference between Probability and Statistics

Statistical procedures use sample data to estimate the characteristics of the whole population from which the sample was drawn.
Statistical Inference is the process of using a sample to infer the properties of a population.
Unfortunately, populations are usually too large to measure fully. Consequently, researchers must use a manageable subset of that population to learn about it.
In statistics, probability theory plays a fundamental role in quantifying uncertainty, modeling randomness, and making predictions based on data.

Probability vs Statistics

Probability Lingo

Probability Spaces, Experiment, Events, Outcomes
Axioms
Random Variables
Probability Distribution, Density Function (curve)
Expected Values and Variances
Independence
Conditional Expected Values

Statistics Lingo

p-value
likelihood
confidence level
significance level
Type I error, a.k.a. false positive
Type II error, a.k.a. false negative
Power analysis

What Distribution Looks Like (Binomial Distribution)

Graph

Table

Formula

Denote \(X\) = number of heads when toss 3 fair coins

\(X\sim\) Binomial (n=3, p=0.5)

The probability distribution function of \(X\) is

\[f_X(x) = \binom{3}{x}\cdot (0.5)^x \cdot (0.5)^{3-x}, \text{ where } x=0,1,2,3 \]

What Does Distribution Looks like (Normal Distribution)

Graph

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Parameters for the normal distribution (mean and standard deviation)
mean_height = 170  # mean height in centimeters
std_dev = 10        # standard deviation in centimeters

# Generate a range of heights for plotting (e.g., from 130 to 210 cm)
heights = np.linspace(130, 210, 1000)

# Calculate the probability density function (PDF) of the normal distribution
pdf_heights = norm.pdf(heights, loc=mean_height, scale=std_dev)

# Plotting the normal distribution curve
plt.figure(figsize=(10, 6))
plt.plot(heights, pdf_heights, color='blue', label='Normal Distribution')

# Highlight the mean and ±1 standard deviation range
plt.axvline(mean_height, color='red', linestyle='--', label='Mean Height')
plt.axvline(mean_height - std_dev, color='green', linestyle='--', label='Mean - 1 Std Dev')
plt.axvline(mean_height + std_dev, color='green', linestyle='--', label='Mean + 1 Std Dev')

# Adding labels, title, and legend
plt.xlabel('Height (cm)')
plt.ylabel('Probability Density')
plt.title('Normal Distribution of Adult Human Heights')
plt.legend()

# Show the plot
plt.show()

Note - Actual human height data may vary and could exhibit deviations from a perfect normal distribution due to factors such as genetic diversity, environmental influences, and sampling variability. However, the normal distribution is a useful theoretical model for describing and analyzing continuous variables like height in statistical contexts.

Table

Formula

Denote \(X\) is the height of adult human.

If \(X\sim N(\mu,\sigma)\), then its probability density function is

\[ f_X(x) = \frac{1}{\sigma\sqrt{2\pi}}\cdot e^{-\frac{(x-\mu)^2}{2\cdot \sigma^2}}, \text{ where } x\in (-\infty,\infty ) \]

What if we know the Population Distribution?

The population distribution function describes the probabilities associated with every possible value of the random variable within the population.

Calculate probabilities of specific events or outcomes.
Determine the expected value (mean) and variance of the random variable.
Understand the overall distribution of the random variable across the population.

Example (Hypokalemia)

Hypokalemia is diagnosed when blood potassium levels are below 3.5mEq/L. Let’s assume that we know a patient whose measured potassium levels vary daily according to a normal distribution N(μ = 3.8,σ = 0.2). If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia?

mosaic :: xpnorm(3.5,mean=3.8,sd=.2)

## [1] 0.0668072

What if we don’t have Population Distribution

Collect a Representative Sample

Large samples are not always attainable
The cost, difficulty, or preciousness of what is studied drastically limits any possible sample size
Blood samples/Biopsies – No more than a handful of repetitions are acceptable
Opinion polls have a limited sample size due to time and cost of operations
Stock Market data – fail to capture rapid changes in market conditions due to delays in data availability or processing

What if we don’t have Population Distribution

Visualize the Data

library(RColorBrewer)

data(VADeaths)
par(mfrow=c(2,3))
hist(VADeaths,breaks=10, col=brewer.pal(3,"Set3"),main="Set3 3 colors")

hist(VADeaths,breaks=3 ,col=brewer.pal(3,"Set2"),main="Set2 3 colors")

hist(VADeaths,breaks=7, col=brewer.pal(3,"Set1"),main="Set1 3 colors")

hist(VADeaths,,breaks= 2, col=brewer.pal(8,"Set3"),main="Set3 8 colors")

hist(VADeaths,col=brewer.pal(8,"Greys"),main="Greys 8 colors")

hist(VADeaths,col=brewer.pal(8,"Greens"),main="Greens 8 colors")

What if we don’t have Population Distribution

Select an Estimation Method

Parametric Estimation
- Parametric Statistics make explicit assumptions about the underlying distribution of the data (e.g. assuming normality or another specific distribution)
- Examples of parametric methods include t-test, ANOVA, linear regression, and many other techniques that require specifying a particular distribution or model for the data
- Exponential Distribution – Analysis of system reliability (time-to-failure) in engineering, survival analysis in health sciences
- Lognormal Distribution – Modeling of extreme values, financial risk assessment
- Binomial Distribution – Quality control inspections, Clinical trial outcomes, Epidemiological studies on disease occurrence
- Poisson Distribution – Analysis of queueing systems (supply chains), network traffic modeling
Non-Parametric Estimation
- Non-parametric statistics make fewer or no assumptions about the underlying distribution of the data.
- Non-parametric statistics are useful when data do not meet the assumptions of parametric methods (e.g., normality, specific distributional form) or when the underlying population distribution is unknown.
- Non-parametric methods are often used when the data do not meet the assumptions of parametric methods or when the distributional form is unknown or cannot be assumed.

What if we don’t have Population Distribution

Use the estimated population distribution to draw inferences and make predictions about the population parameters, probabilities of events, or future observations.
- Point Estimation: Methods of Moments, Maximum Likelihood Estimation
- Interval Estimation: Confidence Interval
- p-values
- Predictions

Statistical Inference

Methods for drawing conclusions about a population from sample data are called Statistical Inference.

Population Distribution vs. Sampling Distribution

Rather than directly estimating the population distribution, we estimate the sampling distribution of a statistic.

Population Distribution: Describes the distribution of a variable in the entire population of interest. It is often unknown and may be difficult or impossible to fully characterize, especially when dealing with large or unobservable populations.
Sampling Distribution: Describes the distribution of a statistic (e.g., sample mean, sample proportion) calculated from multiple samples drawn from the population. It represents the variability of the statistic across different possible samples.

Sampling Distribution

Instead of estimating the population distribution, sampling distribution is focusing on the distribution of a statistic

sample mean \(\overline{X}\)
sample proportion \(\hat{p}\)
sample standard deviation \(\hat{s}\)

Sampling distribution is the distribution of all possible values taken by the statistic when all possible samples of a fixed size n are taken from the population

Sampling Distribution of \(\overline{X}\)

Sampling Distributions

Mean of \(\overline{X}\) = Population Mean \(\mu\)

There is no tendency for a sample mean to fall systematically above or below μ, even if the population distribution is skewed. Thus, the mean of the sampling distribution will be “correct on average” in many samples.

Sampling Distribution of \(\overline{X}\)

Sampling Distributions

Standard Deviation of \(\overline{X}\) = Population Standard Deviation over \(\sqrt{n}\)
- The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample.
  - It is smaller than the standard deviation of the population by a factor of \(\sqrt{n}\)
  - Averages are less variable than individual observations.

\[\sigma_\overline{X} = \sigma/\sqrt{n}\]

Example (IQ Scores)

In a large population of adults, the mean IQ is 112 with standard deviation 20. Suppose 200 adults are randomly selected for a market research campaign.

The distribution of the sample mean IQ is:

Exactly normal, mean 112, standard deviation 20
Approximately normal, mean 112, standard deviation 20
Approximately normal, mean 112, standard deviation 1.414
Approximately normal, mean 112, standard deviation 0.1

Population Distribution : \(N(\mu=112,\sigma=20)\)
Sampling Distribution for \(n=200\) : \(N(\mu=112,\sigma/\sqrt{n}=1.414)\)

Central Limit Theorem

When randomly sampling from any population with mean \(\mu\) and standard deviation \(\sigma\), when \(n\) is large enough, the sampling distribution of is approximately normal: \(\sim N (\mu, \sigma/\sqrt{n})\).

“Randomly” – every individual in the population has an equal chance of being selected and every possible subset of a given size has an equal chance of being chosen.

Large enough?

More observations are required if the population distribution is far from normal
A sample size of 25 is generally enough to obtain a normal sampling distribution from a strong skewness or even mild outliers
A sample size of 40 will typically be good enough to overcome extreme skewness and outliers.
In many cases, n = 25 isn’t a huge sample. Thus, even for strange population distributions we can assume a normal sampling distribution of the mean and work with it to solve problems

Example (Hypokalemia)

Hypokalemia is diagnosed when blood potassium levels are below \(3.5\) mEq/L. Let’s assume that we know a patient whose measured potassium levels vary daily according to a normal distribution \(N(\mu = 3.8,\sigma = 0.2)\). If only one measurement is made, what is the probability that this patient will be misdiagnosed with Hypokalemia?

Instead, if measurements are taken on 4 separate days, what is the probability of a misdiagnosis?

We can first look at the graph of the Population Distribution and the Sampling Distribution

The sampling distribution is narrower than the population distribution by a factor of \(\sqrt{n}\)

Example(Hypokalemia)

When we only use one measurement

mosaic :: xpnorm(3.5, 3.8, 0.2)

## [1] 0.0668072

When we use 4 measurements

mosaic:: xpnorm(3.5,3.8,0.2/sqrt(4))

## [1] 0.001349898

Statistical Confidence

Although the sample mean, \(\overline{x}\), is a unique number for any particular sample, if you pick a different sample you will probably get a different sample mean.

In fact, you could get many different values for the sample mean, and virtually none of them would actually equal the true population mean, \(\mu\).

But the sampling distribution of \(\overline{X}\) is narrower than the population distribution, by a factor of \(\sqrt{n}\)

Thus, the estimates gained from our samples are always relatively close to the population parameter \(\mu\).

Within 2 standard deviation of the mean

library(animation)
conf.int()

95% of all sample means will be within roughly 2 standard deviations (\(2\times \sigma/\sqrt{n}\)) of the population parameter \(\mu\).

ShinyApp

This implies that the population mean μ must be within roughly 2 standard deviations from (\(2\times \sigma/\sqrt{n}\)) from the sample average \(\overline{x}\), in \(95\%\) of all samples.

Confidence Interval

The confidence interval is a range of values with an associated probability or confidence level. The probability quantifies the chance that the interval contains the true population mean.

Population Distribution - \(N(3.8,0.2)\)
Sample Size = \(4\)
Sample mean = \(3.7\)

\(95\%\) confidence interval : \(3.7±1.96×0.2/\sqrt{4}\) = \(3.7±0.196\) = \((3.504, 3.896)\)

mosaic :: xpnorm(c(-1.96,1.96))

## [1] 0.0249979 0.9750021

We are \(95\%\) confidence that the actual value of \(\mu\) will be in \((3.504, 3.896)\)
With \(95\%\) chance, the actual value of \(\mu\) will be within \(0.196\) units of the value of \(\overline{x}\)

Note –

We don’t need to take a lot of random samples to “rebuild” the sampling distribution and find \(\mu\) at its center
All we need is one random sample of size \(n\) and rely on the properties of the sampling distribution to infer the population mean \(\mu\)

Interpretation of Confidence Intervals

Conditions under which an inference method is valid are never fully met in practice. Exploratory data analysis and judgment should be used when deciding whether or not to use a statistical procedure.
Any individual confidence interval either will or will not contain the true population mean. It is wrong to say that the probability is 95% that the true mean falls in the confidence interval.
The correct interpretation of a 95% confidence interval is that we are 95% confident that the true mean falls within the interval. The confidence interval was calculated by a method that gives correct results in 95% of all possible samples.
In other words, if many such confidence intervals were constructed, 95% of these intervals would contain the true mean.

Cautions about using \(\overline{x}\pm z^*\times \sigma/\sqrt{n}\)

Data must be a random sample from the population.
Formula is not correct for other sampling designs.
Inference cannot rescue badly produced data. (Confidence intervals are not resistant to outliers).
If \(n\) is small (\(<15\)) and the population is not normal, the true confidence level will be different.
The standard deviation \(\sigma\) of the population must be known.

Reasoning of Significance Tests

You are in charge of quality control in a food company. You sample randomly four packs of tomatoes, each labeled 1/2 lb. (227 g).
The average weight from your four packs is 222 g. Obviously, we cannot expect boxes filled with whole tomatoes to all weigh exactly half a pound.

There are two possibilities:

Is the somewhat smaller weight simply due to chance variation? Just unlucky
The true average weight of four packs is less than 227 g? Calibrating machine problem

One way to think about this that we want a measure of how extreme the event is that we observed (222 g)? Can probability help us to measure “how extreme?”

Some Assumptions needed to estimate sampling distribution

We assume the weight of tomato packages is normally distributed \(N (\mu=227,\sigma=5)\). The standard deviation \(5\) comes from the Quality Control Manager
Tomatoe packages were picked randomly from the population which means every tomato package in the population has an equal chance of being selected.
Each tomato package within the sample is independent of the others. Independence is crucial for statistical inference and for the validity of various statistical tests.

In general, we may also consider.
Sample size is sufficiently large. Larger sample sizes tend to result in more accurate estimates and more reliable sampling distribution.
Normality. This assumption becomes more important for smaller sample sizes due to the central limit theorem.

Reasoning of Significance Tests

After we carefully checked the assumptions, we can frame this problem into a probability problem as follow

\[P(\overline{X} < 222 | \text{Assumptions})\] If assumptions are satisfied,

\[\overline{X} \sim N(227,5/\sqrt{4})\] Therefore,

\[P(\overline{X}<222) = P\left(Z<\frac{222-227}{5/\sqrt{4}}\right)=P(Z<-2) = 0.0228\]

mosaic :: xpnorm(222,227,5/sqrt(4))

## [1] 0.02275013

There is only 2.28% chance that you would pick one tomato package with 220 g or less.

Is it an extreme event?

Unusual event happened!

Null and Alternative Hypotheses

The purpose of hypothesis testing is to assess whether there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.

The null hypothesis is a very specific statement about a parameter of the population(s). It is labeled \(H_o\)

It usually represents the default position that there is no effect, no difference, or no relationship.
It’s typically the hypotheses that researchers aim to test against when conducting statistical hypothesis testing.
The alternative hypothesis is a statement that researchers aim to find evidence for.
It represents a departure from the null hypothesis and typically asserts that there is an effect, a difference, or a relationship in the population.
There are upper-tailed test, lower-tailed test and two-tailed test for 1 sample z-test

\[ H_o : \mu = 227 \]

Versus

\[ H_a : \mu <227 \]

P-value

Tests of statistical significance quantify the chance of obtaining a particular random sample result if the null hypothesis were true. This quantity is the P-value.
This is a way of assessing the “believability” of the null hypothesis, given the evidence provided by a random sample.
With a small p-vale we Reject \(H_o\). The true property of the population is significantly different from what was stated in \(H_o\).

Small p-values are strong evidence AGAINST \(H_o\).

Oftentimes, a P-value of 0.05 or less is considered significant
The phenomenon observed is unlikely to be entirely due to chance event from the random sampling.
Instead, the assumption of the population (\(\mu\)) is significantly different from the truth

Steps for Test of Significance

State the null hypotheses \(H_o\) and the alternative hypothesis \(H_a\).
Calculate value of the test statistic.
Determine the P-value for the observed data.
State a conclusion

Significant Level \(\alpha\)

The significance level, \(\alpha\), is the largest P-value tolerated for rejecting a true null hypothesis (how much evidence against \(H_o\) we require). This value is decided before conducting the test.

If the P-value is equal to or less than \(\alpha\) (\(P ≤ \alpha\)), then we reject \(H_o\). If the P-value is greater than \(\alpha\) (\(P > \alpha\)), then we fail to reject \(H_o\).

When choosing the sifnificance level \(\alpha\)

We typically use the standards of our field or work
There are no “sharpR cutoffs: e.g. 4.9% versus 5.1%
It is the order of magnitude of the P-value that matters: “somewhat significant”, “significant”, or “very significant”.

The power of a test

The power of a test of hypothesis with fixed significance level \(\alpha\) is the probability that the test will reject the null hypothesis when the alternative is true.
In other words, power is the probability that the data gathered in an experiment will be sufficient to reject a wrong null hypothesis.

Knowing the power of your test is important:

When designing your experiment: select a sample size large enough to detect an effect of a magnitude you think is meaningful.
When a test found no significance: Check that your test would have had enough power to detect an effect of a magnitude you think is meaningful

Type I and II Errors

A Type I error is made when we reject the null hypothesis and the null hypothesis is actually true (incorrectly reject a true \(H_o\)).
The probability of making a Type I error is the significance level \(\alpha\).
A Type II error is made when we fail to reject the null hypothesis and the null hypothesis is false (incorrectly keep a false \(H_o\)).
The probability of making a Type II error is labeled \(\beta\).
The power of a test is \(1-\beta\).

Inferential Statistics

What is the difference between Probability and Statistics

Probability vs Statistics

What Distribution Looks Like (Binomial Distribution)

What Does Distribution Looks like (Normal Distribution)

What if we know the Population Distribution?

Example (Hypokalemia)

What if we don’t have Population Distribution

What if we don’t have Population Distribution

What if we don’t have Population Distribution

What if we don’t have Population Distribution

Statistical Inference

Population Distribution vs. Sampling Distribution

Sampling Distribution

Sampling Distribution of \(\overline{X}\)

Sampling Distribution of \(\overline{X}\)

Example (IQ Scores)

Central Limit Theorem

Large enough?

Example (Hypokalemia)

Example(Hypokalemia)

Statistical Confidence

Within 2 standard deviation of the mean

Confidence Interval

Interpretation of Confidence Intervals

Cautions about using \(\overline{x}\pm z^*\times \sigma/\sqrt{n}\)

Reasoning of Significance Tests

Some Assumptions needed to estimate sampling distribution

Reasoning of Significance Tests

Null and Alternative Hypotheses

P-value

Steps for Test of Significance

Significant Level \(\alpha\)

The power of a test

Type I and II Errors