When utilizing ML to address real-world issues, data preparation is typically the initial step. This is due to the fact that before a dataset can be used to train machine learning or deep learning algorithms, it usually contains a number of inconsistencies that must be fixed. The following is a list of problems with unprepared data:
- Missing
Values in the Dataset
- Different
File Formats
- Outlying
Data Points
- Inconsistency
in variable values
- Irrelevant
feature variables
STEPS INVOLVED
IN DATA PREPARATION ARE AS FOLLOWS:
1. Gather data
Finding the appropriate data is the first step in the data preparation process. This might originate from a database that already exists or could be added on the fly.
2. Discover and
assess data
Discovering each dataset is crucial after data collection. Understanding the data and what has to be done before the data is relevant in a given context are the goals of this step.
3. Cleanse and
validate data
Though generally, the most time-consuming step in the data preparation process, cleaning up the data is essential for eliminating inaccurate data and filling in any gaps. Here, crucial duties include:
- Eliminating irrelevant information and outliers, Adding values when there are gaps Data standardization, data masking, and sensitive or private data entry
4. Transform and
enrich data
Data transformation involves changing the format or value entries to achieve a specific result or to make the data more comprehensible to a larger audience. Adding to and connecting data with additional relevant information in order to deliver deeper insights is referred to as enriching data.
5. Store data
When the data is ready, it can be saved or sent to a third-party program, like a business intelligence tool, opening the door for processing and analysis.
Step 1: Importing pandas
as pd in python to use this library
import pandas as pd
import numpy as np
Step 2: Reading data in excel/csv file
df