Study and Job Info: R Programming (SHAPE OF DATA AND DESCRIBING RELATIONSHIPS )

Introduction

Building high performing machine learning algorithms depends on identifying the relationships between the variables.

Create Table:

tab <- as.table(rbind(c(762, 327, 468), c(484, 239, 477)))

dimnames(tab) <- list(GATES = c("B.Tech", "M.Tech"),Branch = c("CSE", "CSE (AI)", "CSE(DS)"))

tab

Output:

Branch

GATES CSE CSE (AI) CSE(DS)

B.Tech 762 327 468

M.Tech 484 239 477

The first approach uses the function rbind and builds a matrix row by row. The as.table() function lets R know that the matrix represents a contingency table of counts:

Charts/ Plots:

The functions plot(), points(), lines(), text(), mtext(), axis(), identify() etc. form a suite that plots points, lines and text.

Function Graph type

plot Scatter plots and various others

barplot Bar plot (including stacked and grouped bar plots)

hist Histograms and (relative) frequency diagrams

curve Curves of mathematical expressions

pie Pie charts (for less scientific uses)

boxplot Box-and-whisker plots

symbols Like scatter plot, but symbols are sized by another variable

Visualize the data using Bar chart and box plot:

R uses the function barplot() to create bar charts. Here, both vertical and Horizontal bars can be

drawn.

Syntax:

barplot(H, xlab, ylab, main, names.arg, col)

Parameters:

Ø H: This parameter is a vector or matrix containing numeric values which are used in bar chart.

Ø xlab: This parameter is the label for x axis in bar chart.

Ø ylab: This parameter is the label for y axis in bar chart.

Ø main: This parameter is the title of the bar chart.

Ø names.arg: This parameter is a vector of names appearing under each bar in bar chart.

Ø col: This parameter is used to give colors to the bars in the graph.

# Create the data for the chart

A <- c(17, 32, 8, 53, 1)

# Plot the bar chart

barplot(A, xlab = "X-axis", ylab = "Y-axis", main ="Bar-Chart")

Horizontal Bar Chart:-

Creating a Horizontal Bar Chart

Approach: To create a horizontal bar chart:

1. Take all parameters which are required to make simple bar chart.

2. Now to make it horizontal new parameter is added.

barplot(A, horiz=TRUE )

Adding Label, Title and Color in the BarChart

Label, title and colors are some properties in the bar chart which can be added to the bar by

adding and passing an argument.

Approach:

1. To add the title in bar chart.

barplot( A, main = title_name )

2. X-axis and Y-axis can be labeled in bar chart. To add the label in bar chart.

barplot( A, xlab= x_label_name, ylab= y_label_name)

3. To add the color in bar chart.

barplot( A, col=color_name)

Create the data for the chart

A <- c(17, 2, 8, 13, 1, 22)

B <- c("Jan", "feb", "Mar", "Apr", "May", "Jun")

# Plot the bar chart

barplot(A, names.arg = B, xlab ="Month",

ylab ="Articles", col ="green",

main ="GATES Students-Attendance chart")

(ii) Scatter plot :-

Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables. One variable is chosen in the horizontal axis and another in the vertical axis. The simple scatterplot is created using the plot() function.

Syntax

plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Following is the description of the parameters used −

Ø x is the data set whose values are the horizontal coordinates.

Ø y is the data set whose values are the vertical coordinates.

Ø main is the tile of the graph.

Ø xlab is the label in the horizontal axis.

Ø ylab is the label in the vertical axis.

Ø xlim is the limits of the values of x used for plotting.

Ø ylim is the limits of the values of y used for plotting.

Ø axes indicates whether both axes should be drawn on the plot.

Example

We use the data set "mtcars" available in the R environment to create a basic scatterplot. Let's use the columns "wt" and "mpg" in mtcars.

input <- mtcars[,c('wt','mpg')]

print(head(input))

Output:

(b). Univariate data, measures of central tendency, frequency distributions, variation, and Shape.

Central Tendency or CT is one of the features of descriptive statistics. The Central tendency will let us know how the various groups of data are clustered around the central value of the distribution of the dataset.

There are 3 main measures of central tendency. They are as follows:

(i) Mean

(ii) Median

(iii) Mode

Mean:

The mean is nothing but the average value of our dataset. Mathematically, the mean is the sum of observations (∑X) divided by the total number of observations (n). It is denoted by x̄.

Syntax of mean is:

Mean(x, trim, na.rm = FALSE)

# Let's create a vector.

x <- c(1,3,5,7,4,9,2)

my_mean <- mean(x)

print(my_mean)

Output: 4.428

Large Dataset: (By reading excel sheet data)

Avg_marks = read_excel(C:\Users\Gates\Data Science\IIStudents.xls")

print(Avg_marks)

mean = mean(Avg_marks$Marks)

print(mean)

Output: 87.12

(ii) Median:

The median is another measure of central tendency. The median is nothing but the middle value of any dataset, i.e., it splits the dataset into 2 halves.

Syntax of the median is:

median(x, na.rm = FALSE)

Parameters are:

x = data vector

na.rm = If TRUE then it removes the value of NA from x

Example :

# Let's create a vector.

x <- c(16,23,52,27,43,39,12,38,66,10,15,25,14,73,54,62)

median(x)

Output: 32.5

Large Dataset: (By reading excel sheet data)

Avg_marks = read_excel(C:\Users\Gates\Data Science\IIStudents.xls")

print(Avg_marks)

median = median(Avg_marks$Marks)

print(median)

Output: 68

(iv) Mode:

Mode is the most frequently occurring number or value in our dataset. There can be a possibility that a dataset has no mode. This only occurs when the frequency of all data points is the same.

We can have one or more than one mode in a dataset when two or more than two data points have the same frequency.

But there is no built-in function pre-defined in R to find mode.

Hence, we can create a function for finding the mode in R, or else we can use the package modest.

Example 1: Single Value Mode

# defining vector

x <- c(13, 27, 51, 43, 28, 21, 44,

23, 44, 56, 78, 65, 56,

23, 44, 44, 37, 45, 71, 44, 98)

y <- table(x)

print(y)

# Mode of x

m <- names(y)[which(y == max(y))]

print(m)

Output: 44

Frequency Distribution :

The frequency distribution of a data variable is a summary of the data occurrence in a collection of non-overlapping categories.

The table() method in R is used to compute the frequency counts of the variables appearing in the specified column of the dataframe. The result is returned to the form of a two-row tabular structure, where the first row indicates the value of the column and the next indicates its corresponding frequencies.

frq-table <- table (x), where x is the data.table object

The cumulative frequency distribution of a given data set is the summation of all the classes including this class below it in a frequency distribution table obtained. The value at any cell position is obtained by the summation of all the previous values and the current value encountered till now. The cumsum() function can be used to calculate this.

cumsum( frq-table)

# creating a dataframe

data_table <- data.table(col1 = sample(6 : 9, 9 ,

replace = TRUE),

col2 = letters[1 : 3],

col3 = c(1, 4, 1, 2, 2, 2, 1, 2, 2))

print ("Original DataFrame")

print (data_table)

freq <- table(data_table$col1)

print ("Modified Frequency Table")

print (freq)

print ("Cumulative Frequency Table")

cumsum <- cumsum(freq)

print (cumsum)

Output:

"Original DataFrame"

col1 col2 col3

1: 9 a 1

2: 6 b 4

3: 6 c 1

4: 6 a 2

5: 8 b 2

6: 7 c 2

7: 6 a 1

8: 6 b 2

9: 8 c 2

[1] "Modified Frequency Table"

6 7 8 9

5 1 2 1

[1] "Cumulative Frequency Table"

6 7 8 9

5 6 8 9

(d) Relationship between two continuous variables – covariance, correlation coefficients, comparing multiple correlations

Covariance and Correlation are terms used in statistics to measure relationships between two random variables. Both of these terms measure linear dependency between a pair of random variables or bivariate data.

In this Exercise , we are going to discuss cov(), cor() function.

Covariance in R Programming:

Covariance can be measured using cov() function.

Covariance is a statistical term is used to measures the direction of the linear relationship between the data vectors.

Mathematically, it can be represented as

Cov(x,y) =

x represents the x data vector

y represents the y data vector

x̄ represents mean of x data vector

̅y represents mean of y data vector

N represents total observations

Covariance Syntax in R

Syntax: cov(x, y, method)

where,

x and y represents the data vectors

method defines the type of method to be used to compute covariance. Default is “pearson”.

# Data vectors

x <- c(1, 3, 5, 10)

y <- c(2, 4, 6, 20)

# Print covariance using different methods

print(cov(x, y))

print(cov(x, y, method = "pearson"))

Output:

30.66667

Correlation in R Programming:

cor() function in R programming measures the correlation coefficient value. Correlation is a relationship term in statistics that uses the covariance method to measure how strong the vectors are related.

Mathematically it can be represented as

Correlation in R

Syntax: cor(x, y, method)

where,

Ø x and y represents the data vectors

Ø method defines the type of method to be used to compute covariance. Default is “pearson”.

# Data vectors

x <- c(1, 3, 5, 10)

y <- c(2, 4, 6, 20)

# Print correlation using different methods

print(cor(x, y))

print(cor(x, y, method = "pearson"))

Output:

[1] 0.9724702

Study and Job Info

Total Pageviews

Tuesday, 27 June 2023

R Programming (SHAPE OF DATA AND DESCRIBING RELATIONSHIPS )

No comments:

Post a Comment

Followers

Pages