4 PROBABILITY DISTRIBUTIONS
a. Sampling from distributions –
Binomial distribution, normal distribution
b. tTest, zTest, Chi Square test
c. Density functions
d. Data Visualization using
ggplot – Box plot, histograms, scatter plotter, line chart, bar chart, heat
maps.
a.
Sampling from distributions –
Binomial distribution, normal distribution
The sampling distribution of the
sample means to see how the Central Limit Theorem works. We will start with a
uniform distribution.
=
unif < - c(1:8)
unif
Output:
[1] 1 2 3 4 5 6 7 8
mean(unif)
Output:
4.5
# Standard deviation of our
uniform distribution
sd(unif)
Output:
2.44949
(i)
Binomial Distribution
To create the binomial
probability distribution,
we will use the function, dbinom(x, size, prob)
where x = vector of success,
size = size of the sample,
prob = probability of success.
For graphing, we will use the
function plot(x, y, type = “h”) where x = vector of success, y = dbinom( ) and
type = “h” for histogram like vertical lines.
# Sample Size of 10
success <- c(0:10)
plot(success, dbinom(success,
size = 10, prob = 0.25),
type = "h",
main = "Sample Size of 10, p =
0.25",
xlab = "Number of Successes",
ylab = "Probability of Success",
lwd = 3)
Output:
Example: What is the probability of getting three or less “2s” in eight rolls? # P(X <= 3)
Code: pbinom(3, 8, 1/6)
Output: 0.9693436
(ii)
Normal Distribution
The normal distribution has a
mean of 0 and standard deviation of 1. Its curve is bell-shaped, symmetric and
unimodal as shown below.
To calculate probabilities,
z-scores or tail areas of distributions,
Use the function pnorm(q, mean, sd, lower.tail)
where q is a vector of quantiles,
and lower.tail = TRUE is the default.
Example : On the normal curve, the area to
the left of 0 with a mean of 0 and standard deviation of 1 is 0.5.
R Code:
pnorm(0,
0, 1)
Output: 0.5
The
heights of adult men in the United States are approximately normally
distributed with a mean of 70 inches and a standard deviation of 3 inches.
A man is randomly selected. His height is 72
inches. What percentile will he be?
pnorm(72,
mean = 70, sd = 3)
Output: 0.7475075
(b) tTest, zTest, Chi Square test
Chi-Square test is a statistical method to
determine if two categorical variables have a significant correlation between
them. Both those variables should be from same population and they should be
categorical like − Yes/No, Male/Female, Red/Green etc
Syntax
The
function used for performing chi-Square test is chisq.test().
R
Programming : chisq.test(data)
We
will read table of Results Data Set
which represent students marks obtained and backlogs.
#
used to read table data from CSV
library(data.table)
res.data1<-read.csv("Results.csv")
View(res.data1)
#read
data in to table
res.data
=fread("Results.csv",select=c("fObt","fBacklogs"))
#
Perform the Chi-Square test.
print(chisq.test(res.data))
Output:
print(chisq.test(res.data1))
Pearson's Chi-squared test
data: res.data1
X-squared = 1354.4, df = 202, p-value
< 2.2e-16
Warning message:
In chisq.test(res.data1) : Chi-squared approximation may be
incorrect
Conclusion
The
result shows the p-value of less than
0.05 which indicates a string correlation
Bar graph:
#
Load ggplot2
library(ggplot2)
#
Create data
data
<- data.frame( name=c("A","B","C","D","E")
, value=c(3,12,5,18,45) )
#
Barplot
ggplot(data,
aes(x=name, y=value)) + geom_bar(stat = "identity") + coord_flip()
Let’s create another
data frame
survey
<- data.frame(fruit=c("Apple", "Banana",
"Grapes", "Kiwi", "Orange",
"Pears"),people=c(40, 50, 30, 15, 35, 20))
#
Change the ggplot theme to 'Minimal'
ggplot(survey, aes(x=fruit, y=people, fill=fruit)) +
geom_bar(stat="identity") +
theme_minimal()
Let’s
create the survey data frame with groups.
survey
<- data.frame(group=rep(c("Men", "Women"),each=6),
fruit=rep(c("Apple",
"Kiwi", "Grapes", "Banana", "Pears",
"Orange"),2),
people=c(22,
10, 15, 23, 12, 18, 18, 5, 15, 27, 8, 17))
Now you can pass this
data frame to the ggplot() function to create a stacked bar graph.
Remember to map the categorical variable to fill.
ggplot(survey,
aes(x=fruit, y=people, fill=group)) +
geom_bar(stat="identity")
Scatter Plot:
#
install.packages("ggplot2")
library(ggplot2)
ggplot(cars,
aes(x = speed, y = dist,
colour = dist)) + geom_point(show.legend =
FALSE) +
scale_color_gradient(low
= "#67c9ff", high = "#f2bbfc")
Create a Scatter Plot of Multiple
Groups
# Group points by 'Species' mapped to color
head(iris)
ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species)) +
geom_point()
Plotting the Regression Line
To add a regression line (line of Best-Fit) to the scatter plot,
use stat_smooth() function and specify method=lm.
ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) + geom_point() +
stat_smooth(method=lm)
Out put:
No comments:
Post a Comment