Learn with Anu Arora: June 2022

When utilizing ML to address real-world issues, data preparation is typically the initial step. This is due to the fact that before a dataset can be used to train machine learning or deep learning algorithms, it usually contains a number of inconsistencies that must be fixed. The following is a list of problems with unprepared data:

Missing Values in the Dataset
Different File Formats
Outlying Data Points
Inconsistency in variable values
Irrelevant feature variables

STEPS INVOLVED IN DATA PREPARATION ARE AS FOLLOWS:

1. Gather data

Finding the appropriate data is the first step in the data preparation process. This might originate from a database that already exists or could be added on the fly.

2. Discover and assess data

Discovering each dataset is crucial after data collection. Understanding the data and what has to be done before the data is relevant in a given context are the goals of this step.

3. Cleanse and validate data

Though generally, the most time-consuming step in the data preparation process, cleaning up the data is essential for eliminating inaccurate data and filling in any gaps. Here, crucial duties include:

Eliminating irrelevant information and outliers, Adding values when there are gaps Data standardization, data masking, and sensitive or private data entry

4. Transform and enrich data

Data transformation involves changing the format or value entries to achieve a specific result or to make the data more comprehensible to a larger audience. Adding to and connecting data with additional relevant information in order to deliver deeper insights is referred to as enriching data.

5. Store data

When the data is ready, it can be saved or sent to a third-party program, like a business intelligence tool, opening the door for processing and analysis.

Step 1: Importing pandas as pd in python to use this library

import pandas as pd

import numpy as np

Step 2: Reading data in excel/csv file

df=pd.read_csv('father-son.csv')
df

df.describe()

Step 2: Splitting dataset into Training and Test datasets

from sklearn.model_selection import train_test_split

X = df.drop("Son", axis=1)

y = df['Son']

X_train,X_test,y_train,y_test= train_test_split(X,y, test_size = 0.2,\

random_state=112)

from sklearn.linear_model import LinearRegression

my_model = LinearRegression()

result = my_model.fit(X_train, y_train)

predictions = result.predict(X_test)

plt.scatter(X_train, y_train, color ='c')

plt.plot(X_test, predictions, color ='k')

plt.show()

from sklearn.metrics import r2_score

r2_score(y_test,predictions)

import numpy as np

from sklearn import metrics

print ('MAE: ', metrics.mean_absolute_error(y_test, predictions))

print ('MSE: ',metrics.mean_squared_error(y_test, predictions))

print ('RMSE: ',np.sqrt(metrics.mean_squared_error(y_test, predictions)))

import seaborn as sns

sns.jointplot(df['Father'], df['Son'])

sns.jointplot(x='Father', y='Son', data = df, kind = 'hex')

sns.pairplot(df)

Learn with Anu Arora

Tuesday, June 14, 2022

PREPARE DATA FOR MACHINE LEARNING

Clustering in Machine Learning

Report Abuse