Introduction
Building high performing
machine learning algorithms depends on identifying the relationships between
the variables.
Create
Table:
tab <- as.table(rbind(c(762, 327, 468), c(484,
239, 477)))
dimnames(tab) <- list(GATES =
c("B.Tech", "M.Tech"),Branch = c("CSE", "CSE
(AI)", "CSE(DS)"))
tab
Output:
Branch
GATES CSE CSE
(AI) CSE(DS)
B.Tech 762 327 468
M.Tech 484 239
477
The
first approach uses the function rbind and builds a matrix row by row.
The as.table() function lets R know that the matrix represents a
contingency table of counts:
Charts/
Plots:
The
functions plot(), points(), lines(), text(), mtext(), axis(), identify() etc.
form a suite that plots points, lines and text.
Function
Graph type
plot
Scatter plots and various others
barplot
Bar plot
(including stacked and grouped bar plots)
hist
Histograms and
(relative) frequency diagrams
curve
Curves of mathematical
expressions
pie
Pie charts (for less
scientific uses)
boxplot
Box-and-whisker plots
symbols
Like scatter plot, but symbols
are sized by another variable
Visualize the data using Bar
chart and box plot:
R
uses the function barplot() to create bar charts. Here, both vertical and
Horizontal bars can be
drawn.
Syntax:
barplot(H,
xlab, ylab, main, names.arg, col)
Parameters:
Ø H: This parameter is a vector or
matrix containing numeric values which are used in bar chart.
Ø xlab: This parameter is the label
for x axis in bar chart.
Ø ylab: This parameter is the label
for y axis in bar chart.
Ø main: This parameter is the title
of the bar chart.
Ø names.arg: This parameter is a
vector of names appearing under each bar in bar chart.
Ø col: This parameter is used to
give colors to the bars in the graph.
#
Create the data for the chart
A
<- c(17, 32, 8, 53, 1)
#
Plot the bar chart
barplot(A,
xlab = "X-axis", ylab = "Y-axis", main
="Bar-Chart")
Horizontal
Bar Chart:-
Creating
a Horizontal Bar Chart
Approach:
To create a horizontal bar chart:
1.
Take all parameters which are required to make simple bar chart.
2.
Now to make it horizontal new parameter is added.
barplot(A,
horiz=TRUE )
Adding
Label, Title and Color in the BarChart
Label,
title and colors are some properties in the bar chart which can be added to the
bar by
adding
and passing an argument.
Approach:
1.
To add the title in bar chart.
barplot(
A, main = title_name )
2.
X-axis and Y-axis can be labeled in bar chart. To add the label in bar chart.
barplot(
A, xlab= x_label_name, ylab= y_label_name)
3.
To add the color in bar chart.
barplot(
A, col=color_name)
Create the data for the chart
A
<- c(17, 2, 8, 13, 1, 22)
B
<- c("Jan", "feb", "Mar", "Apr",
"May", "Jun")
#
Plot the bar chart
barplot(A,
names.arg = B, xlab ="Month",
ylab ="Articles", col
="green",
main ="GATES Students-Attendance chart")
(ii) Scatter plot :-
Scatterplots
show many points plotted in the Cartesian plane. Each point represents the
values of two variables. One variable is
chosen in the horizontal axis and another in the vertical axis. The simple
scatterplot is created using the plot() function.
Syntax
plot(x,
y, main, xlab, ylab, xlim, ylim, axes)
Following
is the description of the parameters used −
Ø x is the data set whose values
are the horizontal coordinates.
Ø y is the data set whose values
are the vertical coordinates.
Ø main is the tile of the graph.
Ø xlab is the label in the
horizontal axis.
Ø ylab is the label in the vertical
axis.
Ø xlim is the limits of the values
of x used for plotting.
Ø ylim is the limits of the values
of y used for plotting.
Ø axes indicates whether both axes
should be drawn on the plot.
Example
We
use the data set "mtcars" available in the R environment to create a
basic scatterplot. Let's use the columns "wt" and "mpg" in
mtcars.
input
<- mtcars[,c('wt','mpg')]
print(head(input))
Output:
(b). Univariate data, measures of
central tendency, frequency distributions, variation, and Shape.
Central
Tendency or CT is one of the features of descriptive statistics. The Central
tendency will let us know how the various groups of data are clustered around
the central value of the distribution of the dataset.
There
are 3 main measures of central tendency. They are as follows:
(i)
Mean
(ii)
Median
(iii)
Mode
Mean:
The
mean is nothing but the average value of our dataset. Mathematically, the mean
is the sum of observations (∑X) divided by the total number of observations
(n). It is denoted by x̄.
Syntax
of mean is:
Mean(x,
trim, na.rm = FALSE)
#
Let's create a vector.
x
<- c(1,3,5,7,4,9,2)
my_mean
<- mean(x)
print(my_mean)
Output: 4.428
Large Dataset: (By reading excel
sheet data)
Avg_marks
= read_excel(C:\Users\Gates\Data Science\IIStudents.xls")
print(Avg_marks)
mean
= mean(Avg_marks$Marks)
print(mean)
Output: 87.12
(ii)
Median:
The
median is another measure of central tendency. The median is nothing but the
middle value of any dataset, i.e., it splits the dataset into 2 halves.
Syntax
of the median is:
median(x,
na.rm = FALSE)
Parameters
are:
x
= data vector
na.rm
= If TRUE then it removes the value of NA from x
Example
:
#
Let's create a vector.
x
<- c(16,23,52,27,43,39,12,38,66,10,15,25,14,73,54,62)
median(x)
Output: 32.5
Large Dataset: (By reading excel
sheet data)
Avg_marks
= read_excel(C:\Users\Gates\Data Science\IIStudents.xls")
print(Avg_marks)
median
= median(Avg_marks$Marks)
print(median)
Output: 68
(iv)
Mode:
Mode
is the most frequently occurring number or value in our dataset. There can be a
possibility that a dataset has no mode. This only occurs when the frequency of
all data points is the same.
We
can have one or more than one mode in a dataset when two or more than two data
points have the same frequency.
But
there is no built-in function pre-defined in R to find mode.
Hence,
we can create a function for finding the mode in R, or else we can use the
package modest.
Example
1: Single Value Mode
#
defining vector
x
<- c(13, 27, 51, 43, 28, 21, 44,
23, 44, 56, 78, 65, 56,
23, 44, 44, 37, 45, 71, 44, 98)
y
<- table(x)
print(y)
#
Mode of x
m
<- names(y)[which(y == max(y))]
print(m)
Output: 44
Frequency Distribution :
The
frequency distribution of a data variable is a summary of the data occurrence
in a collection of non-overlapping categories.
The
table() method in R is used to compute the frequency counts of the variables
appearing in the specified column of the dataframe. The result is returned to
the form of a two-row tabular structure, where the first row indicates the
value of the column and the next indicates its corresponding frequencies.
frq-table <- table (x), where x is the data.table object
The
cumulative frequency distribution of a given data set is the summation of all
the classes including this class below it in a frequency distribution table
obtained. The value at any cell position is obtained by the summation of all
the previous values and the current value encountered till now. The cumsum()
function can be used to calculate this.
cumsum( frq-table)
# creating a
dataframe
data_table <-
data.table(col1 = sample(6 : 9, 9 ,
replace
= TRUE),
col2 = letters[1 : 3],
col3 = c(1, 4, 1, 2,
2, 2, 1, 2, 2))
print
("Original DataFrame")
print (data_table)
freq <-
table(data_table$col1)
print
("Modified Frequency Table")
print (freq)
print
("Cumulative Frequency Table")
cumsum <-
cumsum(freq)
print (cumsum)
Output:
"Original DataFrame"
col1 col2 col3
1:
9 a 1
2:
6 b 4
3:
6 c 1
4:
6 a 2
5:
8 b 2
6:
7 c 2
7:
6 a 1
8:
6 b 2
9:
8 c 2
[1] "Modified Frequency
Table"
6 7 8 9
5 1 2 1
[1] "Cumulative Frequency
Table"
6 7 8 9
5 6 8 9
(d) Relationship between two
continuous variables – covariance, correlation coefficients, comparing multiple
correlations
Covariance
and Correlation are terms used in statistics to measure relationships between
two random variables. Both of these terms measure linear dependency between a
pair of random variables or bivariate data.
In
this Exercise , we are going to discuss cov(), cor() function.
Covariance in R Programming:
Covariance
can be measured using cov() function.
Covariance
is a statistical term is used to measures the direction of the linear
relationship between the data vectors.
Mathematically,
it can be represented as
Cov(x,y)
=
x
represents the x data vector
y
represents the y data vector
x̄
represents mean of x data vector
̅y
represents mean of y data vector
N
represents total observations
Covariance Syntax in R
Syntax:
cov(x, y, method)
where,
x
and y represents the data vectors
method
defines the type of method to be used to compute covariance. Default is
“pearson”.
#
Data vectors
x
<- c(1, 3, 5, 10)
y
<- c(2, 4, 6, 20)
#
Print covariance using different methods
print(cov(x,
y))
print(cov(x,
y, method = "pearson"))
Output:
30.66667
Correlation in R Programming:
cor() function in R programming
measures the correlation coefficient value. Correlation is a relationship term
in statistics that uses the covariance method to measure how strong the vectors
are related.
Mathematically it can be represented
as
Correlation in R
Syntax: cor(x, y, method)
where,
Ø
x and y represents the data vectors
Ø
method defines the type of method to
be used to compute covariance. Default is “pearson”.
# Data vectors
x <- c(1, 3, 5, 10)
y <- c(2, 4, 6, 20)
# Print correlation using different
methods
print(cor(x, y))
print(cor(x, y, method =
"pearson"))
Output:
[1]
0.9724702
No comments:
Post a Comment