What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (2022)

Data Cleaning plays an important role in the field of Data Managements as well as Analytics and Machine Learning. In this article, I will try to give the intuitions about the importance of data cleaning and different data cleaning processes.

Data Cleaning means the process of identifying the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and then modifying, replacing or deleting them according to the necessity. Data cleaning is considered a foundational element of the basic data science.

Data is the most valuable thing for Analytics and Machine learning. In computing or Business data is needed everywhere. When it comes to the real world data, it is not improbable that data may contain incomplete, inconsistent or missing values. If the data is corrupted then it may hinder the process or provide inaccurate results. Let’s see some examples of the importance of data cleaning.

Suppose you are a general manager of a company. Your company collects data of different customers who buy products produced by your company. Now you want to know on which products people are interested most and according to that you want to increase the production of that product. But if the data is corrupted or contains missing values then you will be misguided to make the correct decision and you will be in trouble.

At the end of all, Machine Learning is a data-driven AI. In machine learning, if the data is irrelevant or error-prone then it leads to an incorrect model building.

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (1)

As much as you make your data clean, as much as you can make a better model. So, we need to process or clean the data before using it. Without the quality data,it would be foolish to expect anything good outcome.

Now let’s take a closer look in the different ways of cleaning data.

Inconsistent column :

If your DataFrame (A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns) contains columns that are irrelevant or you are never going to use them then you can drop them to give more focus on the columns you will work on. Let’s see an example of how to deal with such data set. Let’s create an example of students data set using pandas DataFrame.

import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O data={'Name':['A','B','C','D','E','F','G','H']
,'Height':[5.2,5.7,5.6,5.5,5.3,5.8,5.6,5.5],
'Roll':[55,99,15,80,1,12,47,104],
'Department':['CSE','EEE','BME','CSE','ME','ME','CE','CSE'],
'Address':['polashi','banani','farmgate','mirpur','dhanmondi','ishwardi','khulna','uttara']}
df=pd.DataFrame(data)
print(df)

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (2)

(Video) Python Machine Learning - Class 5 | Data Exploration - Data Cleaning | Machine Learning | Edureka

Here if we want to remove the “Height” column, we can use python pandas.DataFrame.drop to drop specified labels from rows or columns.

DataFrame.drop(self, labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')

Let us drop the height column. For this you need to push the column name in the column keyword.

df=df.drop(columns='Height')
print(df.head())

Missing data:

It is rare to have a real world dataset without having any missing values. When you start to work with real world data, you will find that most of the dataset contains missing values. Handling missing values is very important because if you leave the missing values as it is, it may affect your analysis and machine learning models. So, you need to be sure that whether your dataset contains missing values or not. If you find missing values in your dataset you must handle it. If you find any missing values in the dataset you can perform any of these three task on it:
1. Leave as it is
2. Filling the missing values
3. Drop them
For filling the missing values we can perform different methods. For example, Figure 4 shows that airquality dataset has missing values.

airquality.head() # return top n (5 by default) rows of a data frame

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (4)

In figure 4, NaN indicates that the dataset contains missing values in that position. After finding missing values in your dataset, You can use pandas.DataFrame.fillna to fill the missing values.

DataFrame.fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

You can use different statistical methods to fill the missing values according to your needs. For example, here in figure 5, we will use the statistical mean method to fill the missing values.

airquality['Ozone'] = airquality['Ozone'].fillna(airquality.Ozone.mean())airquality.head()

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (5)

You can see that the missing values in “Ozone” column is filled with the mean value of that column.

(Video) How to Do Data Cleaning (step-by-step tutorial on real-life dataset)

You can also drop the rows or columns where missing values are found. we drop the rows containing missing values. Here You can drop missing values with the help of pandas.DataFrame.dropna.

airquality = airquality.dropna() #drop the rows containing at least one missing valueairquality.head()

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (6)

Here, in figure 6, you can see that rows have missing values in column Solar.R is dropped.

airquality.isnull().sum(axis=0)

Outliers:

If you are new data Science then the first question that will arise in your head is “what does these outliers mean” ? Let’s talk about the outliers first and then we will talk about the detection of these outliers in the dataset and what will we do after detecting the outliers.
According to wikipedia,
In statistics, an outlier is a data point that differs significantly from other observations.
That means an outlier indicates a data point that is significantly different from the other data points in the data set. Outliers can be created due to the errors in the experiments or the variability in the measurements. Let’s look an example to clear the concept.

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (8)

In Figure 4 all the values in math column are in range between 90–95 except 20 which is significantly different from others. It can be an input error in the dataset. So we can call it a outliers. One thing should be added here — “ Not all the outliers are bad data points. Some can be errors but others are the valid values.

So, now the question is how can we detect the outliers in the dataset.
For detecting the outliers we can use :
1. Box Plot
2. Scatter plot
3. Z-score etc.
We will see the Scatter Plot method here. Let’s draw a scatter plot of a dataset.

dataset.plot(kind='scatter' , x='initial_cost' , y='total_est_fee' , rot = 70)
plt.show()

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (9)

(Video) Data Preprocessing Steps for Machine Learning & Data analytics

Here in Figure 9 there is a outlier with red outline. After detecting this, we can remove this from the dataset.

df_removed_outliers = dataset[dataset.total_est_fee<17500]df_removed_outliers.plot(kind='scatter', x='initial_cost' , y='total_est_fee' , rot = 70)plt.show()

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (10)

Duplicate rows:

Datasets may contain duplicate entries. It is one of the most easiest task to delete duplicate rows. To delete the duplicate rows you can use —
dataset_name.drop_duplicates(). Figure 12 shows a sample of a dataset having duplicate rows.

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (11)

dataset=dataset.drop_duplicates()#this will remove the duplicate rows.print(dataset)

Tidy data set:

Tidy dataset means each columns represent separate variables and each rows represent individual observations. But in untidy data each columns represent values but not the variables. Tidy data is useful to fix common data problem.You can turn the untidy data to tidy data by using pandas.melt.

(Video) Understanding Clean Data | Google Data Analytics Certificate

import pandas as pd
pd.melt(frame=df,id_vars='name',value_vars=['treatment a','treatment b'])

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (13)

You can also see pandas.DataFrame.pivot for un-melting the tidy data.

Converting data types:

In DataFrame data can be of many types. As example :
1. Categorical data
2. Object data
3. Numeric data
4. Boolean data

Some columns data type can be changed due to some reason or have inconsistent data type. You can convert from one data type to another by using pandas.DataFrame.astype.

DataFrame.astype(self, dtype, copy=True, errors='raise', **kwargs)

String manipulation:

One of the most important and interesting part of data cleaning is string manipulation. In the real world most of the data are unstructured data. String manipulation means the process of changing, parsing, matching or analyzing strings. For string manipulation, you should have some knowledge about regular expressions. Sometimes you need to extract some value from a large sentence. Here string manipulation gives us a strong benefit. Let say,
“This umbrella costs $12 and he took this money from his mother.”
If you want to exact the “$12” information from the sentence then you have to build a regular expression for matching that pattern.After that you can use the python libraries.There are many built in and external libraries in python for string manipulation.

import repattern = re.compile('|\$|d*')result = pattern.match("$12312312")print(bool(result))

This will give you an output showing “True”.

Data Concatenation:

In this modern era of data science the volume of data is increasing day by day. Due to the large number of volume of data data may stored in separated files. If you work with multiple files then you can concatenate them for simplicity. You can use the following python library for concatenate.

pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=None, copy=True)

Let’s see an example how to concatenate two dataset. Figure 14 shows an example of two different datasets loaded from two different files. We will concatenate them using pandas.concat.

What is Data Cleaning? How to Process Data for Analytics and Machine Learning Modeling? (14)

concatenated_data=pd.concat([dataset1,dataset2])
print(concatenated_data)

Data Cleaning is very import for making your analytics and machine learning models error-free. A small error in the dataset can cause you a lot of problem. All your efforts can be wasted. So, always try to make your data clean.

1. Dataframe
2. DataCamp-Cleaning data in python
3. Working with missing data
4. How to remove outliers in Data with Pandas
5. Ways to Detect and Remove the Outliers
6. Outlier removal clustring
7. 3 ways to remove outliers from your data
8. pandas.DataFrame.astype
9. pandas.concat
10. pandas.DataFrame.melt
11. Tidy data

FAQs

What is data cleaning in machine learning? ›

‍ Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

What is data cleaning in data analytics? ›

Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled.

What is data cleaning explain with examples? ›

Data cleaning is a process by which inaccurate, poorly formatted, or otherwise messy data is organized and corrected. For example, if you conduct a survey and ask people for their phone numbers, people may enter their numbers in different formats.

Why is data cleaning important in data analysis? ›

Data cleansing is also important because it improves your data quality and in doing so, increases overall productivity. When you clean your data, all outdated or incorrect information is gone – leaving you with the highest quality information.

Is data cleaning part of data analysis? ›

However, data cleaning is also a vital part of the data analytics process. If your data has inconsistencies or errors, you can bet that your results will be flawed, too. And when you're making business decisions based on those insights, it doesn't take a genius to figure out what might go wrong!

What is the difference between data cleaning and data cleansing? ›

Data cleansing and data cleaning are often used interchangeably. However, international data management standards - such as DAMA BMBoK and CMMI's DMM - refer to this process as data cleansing, so if you have to choose between one of the two, choose for data cleansing.

What are the best practices for data cleaning? ›

Here are some best practices for cleansing your data.
  • Develop a data quality strategy. Set expectations for your data. ...
  • Correct data at the point of entry. ...
  • Validate the accuracy of your data. ...
  • Manage duplicates. ...
  • Append missing data. ...
  • Promote the use of clean data across the organisation.
9 Apr 2021

What is the first step should a data analyst take to clean their data? ›

The first step in cleaning data is to carry out data profiling, which allows us to identify outlier values or identify problems in data collected. Once the field has been profiled, it is normalized, de-duplicated, and obsolete information is removed, among other things.

What are the data issues in data cleaning? ›

This process of making data accurate and consistent is riddled with many problems, few of which are mentioned below:
  • High Volume of Data: Table of Contents [hide] ...
  • Misspellings: ...
  • Lexical Errors: ...
  • Misfielded Value: ...
  • Domain Format Errors: ...
  • Irregularities: ...
  • Missing Values: ...
  • Contradiction:
22 Jan 2016

How do you write a data cleaning report? ›

It's a good idea to consider the following questions when writing the report:
  1. What types of noise occurred in the data?
  2. What approaches did you use to remove the noise? Which techniques were successful?
  3. Are there any cases or attributes that could not be salvaged? Be sure to note data excluded due to noise.

What is data cleansing in ETL? ›

In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning. 1 Introduction. Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.

Why is cleaning data such an important part of the data analysis process How can sorting and filtering help you clean data more effectively? ›

It removes major errors and inconsistencies that are inevitable when multiple sources of data are being pulled into one dataset. Using tools to clean up data will make everyone on your team more efficient as you'll be able to quickly get what you need from the data available to you.

What is data analysis explain in detail? ›

Data analysis is the process of cleaning, changing, and processing raw data, and extracting actionable, relevant information that helps businesses make informed decisions.

What is machine learning model? ›

A machine learning model is an expression of an algorithm that combs through mountains of data to find patterns or make predictions. Fueled by data, machine learning (ML) models are the mathematical engines of artificial intelligence.

What is Step 5 in machine learning? ›

Evaluation allows us to test the model against data that it has never seen before. The way the model performs is representative of how it is going to perform in the real world. Once the evaluation is done, we need to see if we can still improve our training. We can do this by tuning our parameters.

Do data scientists do data cleaning? ›

While many believe data science is all about using machine learning algorithms to build models and make business impact, data cleaning is also an essential part of being a data scientist.

Why is data cleaning difficult? ›

Data cleaning is tricky and time-consuming

Cleaning the data requires removal of duplications, removing or replacing missing entries, correcting misfielded values, ensuring consistent formatting and a host of other tasks which take a considerable amount of time.

Is data cleaning same as data preprocessing? ›

Data Preprocessing

Hence, certain steps are followed and executed in order to convert the data into a small and clean data set. These set of steps is known as Data Preprocessing. The Data Preprocessing steps are: Data Cleaning.

What is the difference between data cleansing and data transformation? ›

Data cleansing is the act of removing meaningless data from a data set to enhance consistency. In contrast, data transformation is about transforming data from one structure to another to make it easier to handle.

What is the difference between data cleaning and data wrangling? ›

Data cleaning focuses on removing erroneous data from your data set. In contrast, data-wrangling focuses on changing the data format by translating "raw" data into a more usable form.

What are the 5 steps in data analytics? ›

In this post we'll explain five steps to get you started with data analysis.
  • STEP 1: DEFINE QUESTIONS & GOALS.
  • STEP 2: COLLECT DATA.
  • STEP 3: DATA WRANGLING.
  • STEP 4: DETERMINE ANALYSIS.
  • STEP 5: INTERPRET RESULTS.

What are the 7 steps of data analysis? ›

Here are seven steps organizations should follow to analyze their data:
  • Define goals. Defining clear goals will help businesses determine the type of data to collect and analyze.
  • Integrate tools for data analysis. ...
  • Collect the data. ...
  • Clean the data. ...
  • Analyze the data. ...
  • Draw conclusions. ...
  • Visualize the data.
24 Aug 2021

What are the 7 steps of cleaning? ›

The seven-step cleaning process includes emptying the trash; high dusting; sanitizing and spot cleaning; restocking supplies; cleaning the bathrooms; mopping the floors; and hand hygiene and inspection.

What are the 5 methods of cleaning? ›

Cleaning is the most important and primary aspect of housekeeping. It is a process of removing dirt, dust and grime by using methods such as dusting, shaking, sweeping, mopping, washing or pol- ishing. There are certain areas you may clean daily, whereas you may clean other areas occasionally or once /twice in a year.

What are the 5 steps of cleaning? ›

For cleaning and sanitizing to be effective, it must follow this process: (1) Remove food bits or dirt on the surface; (2) Wash the surface; (3) Rinse the surface; (4) Sanitize the surface; (5) Allow the surface to air dry.

What are the advantages of data cleaning? ›

What are the Benefits of Data Cleansing?
  1. Improved decision making. Quality data deteriorates at an alarming rate. ...
  2. Boost results and revenue. ...
  3. Save money and reduce waste. ...
  4. Save time and increase productivity. ...
  5. Protect reputation. ...
  6. Minimise compliance risks.

What does data processing mean? ›

data processing, manipulation of data by a computer. It includes the conversion of raw data to machine-readable form, flow of data through the CPU and memory to output devices, and formatting or transformation of output. Any use of computers to perform defined operations on data can be included under data processing.

How do you clean data in Python? ›

Pythonic Data Cleaning With Pandas and NumPy
  1. Dropping Columns in a DataFrame.
  2. Changing the Index of a DataFrame.
  3. Tidying up Fields in the Data.
  4. Combining str Methods with NumPy to Clean Columns.
  5. Cleaning the Entire Dataset Using the applymap Function.
  6. Renaming Columns and Skipping Rows.

Which first step should a data analyst take to clean their data Accenture? ›

The first step is to identify the right set of data required for business problem analysis.

Why Data cleaning is important in Excel? ›

The reason data cleaning is important is to ensure that we achieve high data integrity. Data integrity is vital because it is the only way of ensuring that we have high quality data to make decisions upon. Since our decisions are typically based on data sets, if the data is of poor quality, our decisions will be too.

What is the key objective of data analytics? ›

The chief aim of data analytics is to apply statistical analysis and technologies on data to find trends and solve problems. Data analytics has become increasingly important in the enterprise as a means for analyzing and shaping business processes and improving decision-making and business results.

How do you clean data in Excel? ›

You will use Excel's built-in function to remove duplicates, as shown below. The original dataset has two rows as duplicates. To eliminate the duplicate data, you need to select the data option in the toolbar, and in the Data Tools ribbon, select the "Remove Duplicates" option.

Which of the following can be generally used to clean and prepare big data? ›

Which of the following can be generally used to clean and prepare big data. Answer - D) Data warehouse is generally used to clean and prepare big data.

What does it mean to clean scrub the data what activities are performed in this phase? ›

What does it mean to clean/scrub the data? What activities are performed in this phase? To handle missing values in the data, identify and reduce noise in the data. -find the outliers and remove them, fill in missing values with most appropriate values. Why do we need data transformation?

What is data cleaning in machine learning? ›

‍ Data cleaning is the process of preparing data for analysis by removing or modifying data that is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.

What is data cleansing process? ›

Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then changing, updating or removing data to correct them.

What is meant by data cleaning? ›

Data cleansing or data cleaning is the process of identifying and correcting corrupt, incomplete, duplicated, incorrect, and irrelevant data from a reference set, table, or database. Data issues typically arise through user entry errors, incomplete data capture, non-standard formats, and data integration issues.

What is data cleaning explain with example? ›

Data cleaning is a process by which inaccurate, poorly formatted, or otherwise messy data is organized and corrected. For example, if you conduct a survey and ask people for their phone numbers, people may enter their numbers in different formats.

Which tool is used for data cleansing? ›

Datamatch Enterprise by Data Ladder is a visually-driven data cleaning application. Like many of the other tools on our list, it focuses on customer data. However, unlike others, it is designed specifically to resolve data quality issues within datasets that are already in a poor condition.

Why is data cleaning important in data analysis? ›

Data cleansing is also important because it improves your data quality and in doing so, increases overall productivity. When you clean your data, all outdated or incorrect information is gone – leaving you with the highest quality information.

What is data cleaning in AI? ›

Data cleansing is the process of improving the quality of data by fixing errors and omissions based on certain standard practices.

What is the difference between data cleaning and data cleansing? ›

Data cleansing, also referred to as data cleaning or data scrubbing, is the process of fixing incorrect, incomplete, duplicate or otherwise erroneous data in a data set. It involves identifying data errors and then changing, updating or removing data to correct them.

What is the first step of data cleaning? ›

Removal of Unwanted Observations

Since one of the main goals of data cleansing is to make sure that the dataset is free of unwanted observations, this is classified as the first step to data cleaning. Unwanted observations in a dataset are of 2 types, namely; the duplicates and irrelevances.

What are the best practices for data cleaning? ›

Here are some best practices for cleansing your data.
  • Develop a data quality strategy. Set expectations for your data. ...
  • Correct data at the point of entry. ...
  • Validate the accuracy of your data. ...
  • Manage duplicates. ...
  • Append missing data. ...
  • Promote the use of clean data across the organisation.
9 Apr 2021

Is data cleaning part of data analysis? ›

However, data cleaning is also a vital part of the data analytics process. If your data has inconsistencies or errors, you can bet that your results will be flawed, too. And when you're making business decisions based on those insights, it doesn't take a genius to figure out what might go wrong!

What is machine learning? ›

Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

What does data processing mean? ›

data processing, manipulation of data by a computer. It includes the conversion of raw data to machine-readable form, flow of data through the CPU and memory to output devices, and formatting or transformation of output. Any use of computers to perform defined operations on data can be included under data processing.

Is data cleaning same as data preprocessing? ›

Data Preprocessing

Hence, certain steps are followed and executed in order to convert the data into a small and clean data set. These set of steps is known as Data Preprocessing. The Data Preprocessing steps are: Data Cleaning.

What is data cleansing in ETL? ›

In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning. 1 Introduction. Data cleaning, also called data cleansing or scrubbing, deals with detecting and removing errors and inconsistencies from data in order to improve the quality of data.

What is the difference between data cleansing and data transformation? ›

Data cleansing is the act of removing meaningless data from a data set to enhance consistency. In contrast, data transformation is about transforming data from one structure to another to make it easier to handle.

How many steps are in data cleaning? ›

Conclusion: So, we have discussed four different steps in data cleaning to make the data more reliable and to produce good results. After properly completing the Data Cleaning steps, we'll have a robust dataset that avoids many of the most common pitfalls.

What are the advantages of data cleaning? ›

What are the Benefits of Data Cleansing?
  1. Improved decision making. Quality data deteriorates at an alarming rate. ...
  2. Boost results and revenue. ...
  3. Save money and reduce waste. ...
  4. Save time and increase productivity. ...
  5. Protect reputation. ...
  6. Minimise compliance risks.

What is the first step should a data analyst take to clean their data? ›

The first step in cleaning data is to carry out data profiling, which allows us to identify outlier values or identify problems in data collected. Once the field has been profiled, it is normalized, de-duplicated, and obsolete information is removed, among other things.

What are the data issues in data cleaning? ›

This process of making data accurate and consistent is riddled with many problems, few of which are mentioned below:
  • High Volume of Data: Table of Contents [hide] ...
  • Misspellings: ...
  • Lexical Errors: ...
  • Misfielded Value: ...
  • Domain Format Errors: ...
  • Irregularities: ...
  • Missing Values: ...
  • Contradiction:
22 Jan 2016

What is data analysis explain in detail? ›

Data analysis is the process of cleaning, changing, and processing raw data, and extracting actionable, relevant information that helps businesses make informed decisions.

Videos

1. Data Cleaning Tutorial | Cleaning Data With Python and Pandas
(Soumil Shah)
2. Data Cleaning, Pre-Processing and Machine Learning Practices | Dr Shehan Perera | Data Storm 3.0
(OCTAVE)
3. A dive into the Data Analytics journey: Data Cleaning & Preprocessing
(Microsoft Reactor)
4. Data Cleaning and Normalization
(Sundog Education with Frank Kane)
5. What is Data Wrangling and Data Cleaning for beginners
(SkillCurb)
6. Data Cleaning Process Steps / Phases [Data Mining] Easiest Explanation Ever (Hindi)
(5 Minutes Engineering)

Top Articles

Latest Posts

Article information

Author: Jeremiah Abshire

Last Updated: 11/17/2022

Views: 5573

Rating: 4.3 / 5 (54 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Jeremiah Abshire

Birthday: 1993-09-14

Address: Apt. 425 92748 Jannie Centers, Port Nikitaville, VT 82110

Phone: +8096210939894

Job: Lead Healthcare Manager

Hobby: Watching movies, Watching movies, Knapping, LARPing, Coffee roasting, Lacemaking, Gaming

Introduction: My name is Jeremiah Abshire, I am a outstanding, kind, clever, hilarious, curious, hilarious, outstanding person who loves writing and wants to share my knowledge and understanding with you.