Open In App

Data analysis using R

Last Updated : 09 Dec, 2022
Improve
Improve
Like Article
Like
Save
Share
Report

Data Analysis is a subset of data analytics, it is a process where the objective has to be made clear, collect the relevant data, preprocess the data, perform analysis(understand the data, explore insights), and then visualize it. The last step visualization is important to make people understand what’s happening in the firm.

Steps involved in data analysis:

 

The process of data analysis would include all these steps for the given problem statement. Example- Analyze the products that are being rapidly sold out and details of frequent customers of a retail shop.

  • Defining the problem statement – Understand the goal, and what is needed to be done. In this case, our problem statement is – “The product is mostly sold out and list of customers who often visit the store.” 
  • Collection of data –  Not all the company’s data is necessary, understand the relevant data according to the problem. Here the required columns are product ID, customer ID, and date visited.
  • Preprocessing – Cleaning the data is mandatory to put it in a structured format before performing analysis. 
  1. Removing outliers( noisy data).
  2. Removing null or irrelevant values in the columns. (Change null values to mean value of that column.)
  3. If there is any missing data, either ignore the tuple or fill it with a mean value of the column.

Data Analysis using the Titanic dataset

You can download the titanic dataset (it contains data from real passengers of the titanic)from here. Save the dataset in the current working directory, now we will start analysis (getting to know our data).

R




titanic=read.csv("train.csv")
head(titanic)


Output:

  PassengerId Survived Pclass                                         Name    Sex
1         892        0      3                             Kelly, Mr. James   male
2         893        1      3             Wilkes, Mrs. James (Ellen Needs) female
3         894        0      2                    Myles, Mr. Thomas Francis   male
4         895        0      3                             Wirz, Mr. Albert   male
5         896        1      3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female
6         897        0      3                   Svensson, Mr. Johan Cervin   male
   Age SibSp Parch  Ticket    Fare Cabin Embarked
1 34.5     0     0  330911  7.8292              Q
2 47.0     1     0  363272  7.0000              S
3 62.0     0     0  240276  9.6875              Q
4 27.0     0     0  315154  8.6625              S
5 22.0     1     1 3101298 12.2875              S
6 14.0     0     0    7538  9.2250              S

Our dataset contains all the columns like name, age, gender of the passenger and class they have traveled in, whether they have survived or not, etc. To understand the class(data type) of each column sapply() method can be used.

R




sapply(train, class)


Output:

PassengerId    Survived      Pclass        Name         Sex         Age 
  "integer"   "integer"   "integer" "character" "character"   "numeric" 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
  "integer"   "integer" "character"   "numeric" "character" "character" 

We can categorize the value “survived” into “dead” to 0 and “alive” to 1 using factor() function.

R




train$Survived=as.factor(train$Survived)
train$Sex=as.factor(train$Sex)
sapply(train, class)


Output:

PassengerId    Survived      Pclass        Name         Sex         Age 
  "integer"    "factor"   "integer" "character"    "factor"   "numeric" 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
  "integer"   "integer" "character"   "numeric" "character" "character" 

We analyze data using a summary of all the columns, their values, and data types. summary() can be used for this purpose.

R




summary(train)


Output:

  PassengerId     Survived     Pclass          Name               Sex     
 Min.   : 892.0   0:266    Min.   :1.000   Length:418         female:152  
 1st Qu.: 996.2   1:152    1st Qu.:1.000   Class :character   male  :266  
 Median :1100.5            Median :3.000   Mode  :character               
 Mean   :1100.5            Mean   :2.266                                  
 3rd Qu.:1204.8            3rd Qu.:3.000                                  
 Max.   :1309.0            Max.   :3.000                                  
                                                                          
      Age            SibSp            Parch           Ticket         
 Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   Length:418        
 1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
 Median :27.00   Median :0.0000   Median :0.0000   Mode  :character  
 Mean   :30.27   Mean   :0.4474   Mean   :0.3923                     
 3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000                     
 Max.   :76.00   Max.   :8.0000   Max.   :9.0000                     
 NA's   :86                                                          
      Fare            Cabin             Embarked        
 Min.   :  0.000   Length:418         Length:418        
 1st Qu.:  7.896   Class :character   Class :character  
 Median : 14.454   Mode  :character   Mode  :character  
 Mean   : 35.627                                        
 3rd Qu.: 31.500                                        
 Max.   :512.329                                        
 NA's   :1

From the above summary we can extract below observations:

  • Total passengers:  891
  • The number of total people who survived:  342
  • Number of total people dead:  549
  • Number of males in the titanic:  577
  • Number of females in the titanic:  314
  • Maximum age among all people in titanic:  80
  • Median age:  28

Preprocessing of the data is important before analysis, so null values have to be checked and removed.

R




sum(is.na(train))


Output:

177

R




dropnull_train=train[rowSums(is.na(train))<=0,]


  • dropnull_train contains only 631 rows because (total rows in dataset (808) – null value rows (177) = remaining rows (631) )
  • Now we will divide survived and dead people into a separate list from 631 rows.

R




survivedlist=dropnull_train[dropnull_train$Survived == 1,]
notsurvivedlist=dropnull_train[dropnull_train$Survived == 0,]


Now we can visualize the number of males and females dead and survived using bar plots, histograms, and piecharts.

R




mytable <- table(titanic$Survived)
lbls <- paste(names(mytable), "\n", mytable, sep="")
pie(mytable,
    labels = lbls,
    main="Pie Chart of Survived column data\n (with sample sizes)")


Output:

 

From the above pie chart, we can certainly say that there is a data imbalance in the target/Survived column.

R




hist(survivedlist$Age,
     xlab="gender",
     ylab="frequency")


Output:

 

Now let’s draw a bar plot to visualize the number of males and females who were there on the titanic ship.

R




barplot(table(notsurvivedlist$Sex),
        xlab="gender",
        ylab="frequency")


Output:

 

From the barplot above we can analyze that there are nearly 350 males, and 50 females those are not survived in titanic.

R




temp<-density(table(titanic$Fare))
plot(temp, type="n",
     main="Fare charged from Passengers")
polygon(temp, col="lightgray",
        border="gray")


Output:

 

Here we can observe that there are some passengers who are charged extremely high. So, these values can affect our analysis as they are outliers. Let’s confirm their presence using a boxplot.

R




boxplot(titanic$Fare,
        main="Fare charged from passengers")


Output:

 

Certainly, there are some extreme outliers present in this dataset.



Previous Article
Next Article

Similar Reads

Factor Analysis | Data Analysis
Factor analysis is a statistical method used to analyze the relationships among a set of observed variables by explaining the correlations or covariances between them in terms of a smaller number of unobserved variables called factors. Table of Content What is Factor Analysis?What does Factor mean in Factor Analysis?How to do Factor Analysis (Facto
13 min read
Difference Between Factor Analysis and Principal Component Analysis
Factor Analysis (FA) and Principal Component Analysis (PCA) are two pivotal techniques used for data reduction and structure detection. Despite their similarities, they serve distinct purposes and operate under different assumptions. This article explores the key differences between FA and PCA. Understanding Principal Component Analysis (PCA)Princi
4 min read
Stock Data Analysis and Data Visualization with Quantmod in R
Analysis of historical stock price and volume data is done in order to obtain knowledge, make wise decisions, and create trading or investment strategies. The following elements are frequently included in the examination of stock data in the R Programming Language. Historical Price Data: Historical price data contains information about a stock's op
8 min read
Covid-19 Data Analysis Using Tableau
Tableau is a software used for data visualization and analysis. it's a tool that can make data-analysis easier. Visualizations can be in the form of worksheets or dashboard. Here are some simple steps in creating worksheets and dashboard using covid-19 dataset in Tableau tool. Data link: https://data.world/covid-19-data-resource-hub/covid-19-case-c
4 min read
Olympics Data Analysis Using Python
In this article, we are going to see the Olympics analysis using Python. The modern Olympic Games or Olympics are leading international sports events featuring summer and winter sports competitions in which thousands of athletes from around the world participate in a variety of competitions. The Olympic Games are considered the world's foremost spo
4 min read
Uber Rides Data Analysis using Python
In this article, we will use Python and its different libraries to analyze the Uber Rides Data. Importing Libraries The analysis will be done using the following libraries : Pandas: This library helps to load the data frame in a 2D array format and has multiple functions to perform analysis tasks in one go.Numpy: Numpy arrays are very fast and can
5 min read
YouTube Data Scraping, Preprocessing and Analysis using Python
YouTube is one of the oldest and most popular video distribution platforms in the world. We can't even imagine the video content available here. It has billion of users and viewers, which keeps on increasing every passing minute. Since its origins, YouTube and its content have transformed very much. Now we have SHORTS, likes, and many more features
5 min read
Quick Guide to Exploratory Data Analysis Using Jupyter Notebook
Before we pass our data into the machine learning model, data is pre-processed so that it is compatible to pass inside the model. To pre-process this data, some operations are performed on the data which is collectively called Exploratory Data Analysis(EDA). In this article, we'll be looking at how to perform Exploratory data analysis using jupyter
13 min read
IPL 2023 Data Analysis using Pandas AI
We are already familiar with performing data analysis using Pandas, in this article, we will see how we can leverage the power of PandasAI to perform analysis on IPL 2023 Auction dataset. We have already covered the Introduction to PandasAI. You can check out our blog post here. Data Analysis using Pandas AIStep 1: Install pandasai and openai libra
6 min read
Medical Analysis Using Python: Revolutionizing Healthcare with Data Science
In recent years, the intersection of healthcare and technology has given rise to groundbreaking advancements in medical analysis. Imagine a doctor faced with lots of patient information and records, searching for clues to diagnose complex disease? Analysing this data is like putting together a medical puzzle, and it's important for doctors to see t
9 min read