Total Pageviews

Monday 3 July 2023

PROBABILITY DISTRIBUTIONS (R Programming)

 

4 PROBABILITY DISTRIBUTIONS

a. Sampling from distributions – Binomial distribution, normal distribution

b. tTest, zTest, Chi Square test

c. Density functions

d. Data Visualization using ggplot – Box plot, histograms, scatter plotter, line chart, bar chart, heat maps.

 

a.      Sampling from distributions – Binomial distribution, normal distribution

The sampling distribution of the sample means to see how the Central Limit Theorem works. We will start with a uniform distribution.

=

unif < - c(1:8)

unif

 

Output:  [1] 1 2 3 4 5 6 7 8

mean(unif)

Output:  4.5

 

# Standard deviation of our uniform distribution

sd(unif)

Output:  2.44949

 

(i)                 Binomial Distribution

To create the binomial probability distribution,

we will use the function, dbinom(x, size, prob)

where x = vector of success,

size = size of the sample,

prob = probability of success.

 

For graphing, we will use the function plot(x, y, type = “h”) where x = vector of success, y = dbinom( ) and type = “h” for histogram like vertical lines.

 

# Sample Size of 10

success <- c(0:10)

plot(success, dbinom(success, size = 10, prob = 0.25),

     type = "h",

     main = "Sample Size of 10, p = 0.25",

     xlab = "Number of Successes",

     ylab = "Probability of Success",

     lwd = 3)

Output:


Example: What is the probability of getting three or less “2s” in eight rolls? # P(X <= 3)

Code:   pbinom(3, 8, 1/6)

Output: 0.9693436

(ii)               Normal Distribution

The normal distribution has a mean of 0 and standard deviation of 1. Its curve is bell-shaped, symmetric and unimodal as shown below.



To calculate probabilities, z-scores or tail areas of distributions,

 Use the function pnorm(q, mean, sd, lower.tail)

 where q is a vector of quantiles,

 and lower.tail = TRUE is the default.

Example : On the normal curve, the area to the left of 0 with a mean of 0 and standard deviation of 1 is 0.5.

R Code:

pnorm(0, 0, 1)

Output:  0.5

Example:

The heights of adult men in the United States are approximately normally distributed with a mean of 70 inches and a standard deviation of 3 inches.

 A man is randomly selected. His height is 72 inches. What percentile will he be?

pnorm(72, mean = 70, sd = 3)

Output:  0.7475075

 

(b) tTest, zTest, Chi Square test

 

Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like − Yes/No, Male/Female, Red/Green etc

 Download Data Set (Results.CSV)

Syntax

The function used for performing chi-Square test is chisq.test().

 

R Programming : chisq.test(data)

 

 

We will read table of  Results Data Set which represent students marks obtained and backlogs.

# used to read table data from CSV

library(data.table)

res.data1<-read.csv("Results.csv")

View(res.data1)



#read data in to table

res.data =fread("Results.csv",select=c("fObt","fBacklogs"))



# Perform the Chi-Square test.

print(chisq.test(res.data))

 

Output:

print(chisq.test(res.data1))

 

        Pearson's Chi-squared test

 

data:  res.data1

X-squared = 1354.4, df = 202, p-value < 2.2e-16

 

Warning message:

In chisq.test(res.data1) : Chi-squared approximation may be incorrect

 

Conclusion

The result shows the p-value of less than 0.05 which indicates a string correlation




(d)Data Visualization using ggplot

Bar graph:

# Load ggplot2

library(ggplot2)

# Create data

data <- data.frame(  name=c("A","B","C","D","E") ,  value=c(3,12,5,18,45) )

# Barplot

ggplot(data, aes(x=name, y=value)) + geom_bar(stat = "identity") + coord_flip()

Let’s create another data frame

survey <- data.frame(fruit=c("Apple", "Banana", "Grapes", "Kiwi", "Orange", "Pears"),people=c(40, 50, 30, 15, 35, 20))

# Change the ggplot theme to 'Minimal'

ggplot(survey, aes(x=fruit, y=people, fill=fruit)) +

 geom_bar(stat="identity") +

 theme_minimal()

Let’s create the survey data frame with groups.

survey <- data.frame(group=rep(c("Men", "Women"),each=6),

fruit=rep(c("Apple", "Kiwi", "Grapes", "Banana", "Pears", "Orange"),2),

people=c(22, 10, 15, 23, 12, 18, 18, 5, 15, 27, 8, 17))

Now you can pass this data frame to the ggplot() function to create a  stacked bar graph. Remember to map the categorical variable to fill.

 

ggplot(survey, aes(x=fruit, y=people, fill=group)) +

 geom_bar(stat="identity")

 

Scatter Plot:

# install.packages("ggplot2")

library(ggplot2)

ggplot(cars, aes(x = speed, y = dist,

 colour = dist)) + geom_point(show.legend = FALSE) +

scale_color_gradient(low = "#67c9ff", high = "#f2bbfc")

Create a Scatter Plot of Multiple Groups

# Group points by 'Species' mapped to color

head(iris)

ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species)) +

 geom_point()

Plotting the Regression Line

To add a regression line (line of Best-Fit) to the scatter plot,

use stat_smooth() function and specify method=lm.

ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) + geom_point() + stat_smooth(method=lm)

Out put:



No comments:

Post a Comment