Study and Job Info: PROBABILITY DISTRIBUTIONS (R Programming)

4 PROBABILITY DISTRIBUTIONS

a. Sampling from distributions – Binomial distribution, normal distribution

b. tTest, zTest, Chi Square test

c. Density functions

d. Data Visualization using ggplot – Box plot, histograms, scatter plotter, line chart, bar chart, heat maps.

a. Sampling from distributions – Binomial distribution, normal distribution

The sampling distribution of the sample means to see how the Central Limit Theorem works. We will start with a uniform distribution.

unif < - c(1:8)

unif

Output: [1] 1 2 3 4 5 6 7 8

mean(unif)

Output: 4.5

# Standard deviation of our uniform distribution

sd(unif)

Output: 2.44949

(i) Binomial Distribution

To create the binomial probability distribution,

we will use the function, dbinom(x, size, prob)

where x = vector of success,

size = size of the sample,

prob = probability of success.

For graphing, we will use the function plot(x, y, type = “h”) where x = vector of success, y = dbinom( ) and type = “h” for histogram like vertical lines.

# Sample Size of 10

success <- c(0:10)

plot(success, dbinom(success, size = 10, prob = 0.25),

type = "h",

main = "Sample Size of 10, p = 0.25",

xlab = "Number of Successes",

ylab = "Probability of Success",

lwd = 3)

Output:

Example: What is the probability of getting three or less “2s” in eight rolls? # P(X <= 3)

Code: pbinom(3, 8, 1/6)

Output: 0.9693436

(ii) Normal Distribution

The normal distribution has a mean of 0 and standard deviation of 1. Its curve is bell-shaped, symmetric and unimodal as shown below.

To calculate probabilities, z-scores or tail areas of distributions,

Use the function pnorm(q, mean, sd, lower.tail)

where q is a vector of quantiles,

and lower.tail = TRUE is the default.

Example : On the normal curve, the area to the left of 0 with a mean of 0 and standard deviation of 1 is 0.5.

R Code:

pnorm(0, 0, 1)

Output: 0.5

Example:

The heights of adult men in the United States are approximately normally distributed with a mean of 70 inches and a standard deviation of 3 inches.

A man is randomly selected. His height is 72 inches. What percentile will he be?

pnorm(72, mean = 70, sd = 3)

Output: 0.7475075

(b) tTest, zTest, Chi Square test

Chi-Square test is a statistical method to determine if two categorical variables have a significant correlation between them. Both those variables should be from same population and they should be categorical like − Yes/No, Male/Female, Red/Green etc

Download Data Set (Results.CSV)

Syntax

The function used for performing chi-Square test is chisq.test().

R Programming : chisq.test(data)

We will read table of Results Data Set which represent students marks obtained and backlogs.

# used to read table data from CSV

library(data.table)

res.data1<-read.csv("Results.csv")

View(res.data1)

#read data in to table

res.data =fread("Results.csv",select=c("fObt","fBacklogs"))

# Perform the Chi-Square test.

print(chisq.test(res.data))

Output:

print(chisq.test(res.data1))

Pearson's Chi-squared test

data: res.data1

X-squared = 1354.4, df = 202, p-value < 2.2e-16

Warning message:

In chisq.test(res.data1) : Chi-squared approximation may be incorrect

Conclusion

The result shows the p-value of less than 0.05 which indicates a string correlation

(d)Data Visualization using ggplot

Bar graph:

# Load ggplot2

library(ggplot2)

# Create data

data <- data.frame( name=c("A","B","C","D","E") , value=c(3,12,5,18,45) )

# Barplot

ggplot(data, aes(x=name, y=value)) + geom_bar(stat = "identity") + coord_flip()

Let’s create another data frame

survey <- data.frame(fruit=c("Apple", "Banana", "Grapes", "Kiwi", "Orange", "Pears"),people=c(40, 50, 30, 15, 35, 20))

# Change the ggplot theme to 'Minimal'

ggplot(survey, aes(x=fruit, y=people, fill=fruit)) +

geom_bar(stat="identity") +

theme_minimal()

Let’s create the survey data frame with groups.

survey <- data.frame(group=rep(c("Men", "Women"),each=6),

fruit=rep(c("Apple", "Kiwi", "Grapes", "Banana", "Pears", "Orange"),2),

people=c(22, 10, 15, 23, 12, 18, 18, 5, 15, 27, 8, 17))

Now you can pass this data frame to the ggplot() function to create a stacked bar graph. Remember to map the categorical variable to fill.

ggplot(survey, aes(x=fruit, y=people, fill=group)) +

geom_bar(stat="identity")

Scatter Plot:

# install.packages("ggplot2")

library(ggplot2)

ggplot(cars, aes(x = speed, y = dist,

colour = dist)) + geom_point(show.legend = FALSE) +

scale_color_gradient(low = "#67c9ff", high = "#f2bbfc")

Create a Scatter Plot of Multiple Groups

# Group points by 'Species' mapped to color

head(iris)

ggplot(iris, aes(x=Petal.Length, y=Petal.Width, colour=Species)) +

geom_point()

Plotting the Regression Line

To add a regression line (line of Best-Fit) to the scatter plot,

use stat_smooth() function and specify method=lm.

ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) + geom_point() + stat_smooth(method=lm)

Out put:

Study and Job Info

Total Pageviews

Monday, 3 July 2023

PROBABILITY DISTRIBUTIONS (R Programming)

No comments:

Post a Comment

Followers

Pages