Tuesday, June 14, 2022

PREPARE DATA FOR MACHINE LEARNING


When utilizing ML to address real-world issues, data preparation is typically the initial step. This is due to the fact that before a dataset can be used to train machine learning or deep learning algorithms, it usually contains a number of inconsistencies that must be fixed. The following is a list of problems with unprepared data:

  • Missing Values in the Dataset
  • Different File Formats
  • Outlying Data Points
  • Inconsistency in variable values
  • Irrelevant feature variables

STEPS INVOLVED IN DATA PREPARATION ARE AS FOLLOWS:

1. Gather data

Finding the appropriate data is the first step in the data preparation process. This might originate from a database that already exists or could be added on the fly.

2. Discover and assess data

Discovering each dataset is crucial after data collection. Understanding the data and what has to be done before the data is relevant in a given context are the goals of this step.

3. Cleanse and validate data

Though generally, the most time-consuming step in the data preparation process, cleaning up the data is essential for eliminating inaccurate data and filling in any gaps. Here, crucial duties include:

  • Eliminating irrelevant information and outliers, Adding values when there are gaps Data standardization, data masking, and sensitive or private data entry

4. Transform and enrich data

Data transformation involves changing the format or value entries to achieve a specific result or to make the data more comprehensible to a larger audience. Adding to and connecting data with additional relevant information in order to deliver deeper insights is referred to as enriching data.

5. Store data

When the data is ready, it can be saved or sent to a third-party program, like a business intelligence tool, opening the door for processing and analysis.

Step 1: Importing pandas as pd in python to use this library

import pandas as pd

import numpy as np

Step 2:  Reading data in excel/csv file

df=pd.read_csv('father-son.csv')
df

df.describe()
Step 2: Splitting dataset into Training and Test datasets


from sklearn.model_selection import train_test_split
X = df.drop("Son", axis=1)
y = df['Son']
X_train,X_test,y_train,y_test= train_test_split(X,y, test_size = 0.2,\
                                                   random_state=112)
from sklearn.linear_model import LinearRegression
my_model = LinearRegression() 
result = my_model.fit(X_train, y_train) 

predictions = result.predict(X_test) 

plt.scatter(X_train, y_train, color ='c') 
plt.plot(X_test, predictions, color ='k') 
plt.show()


from sklearn.metrics import r2_score
r2_score(y_test,predictions)


import numpy as np
from sklearn import metrics
print ('MAE: ', metrics.mean_absolute_error(y_test, predictions))
print ('MSE: ',metrics.mean_squared_error(y_test, predictions))
print ('RMSE: ',np.sqrt(metrics.mean_squared_error(y_test, predictions)))

import seaborn as sns
sns.jointplot(df['Father'], df['Son'])
sns.jointplot(x='Father', y='Son', data = df, kind = 'hex')

sns.pairplot(df)




Clustering in Machine Learning

Clustering is a type of unsupervised learning in machine learning where the goal is to group a set of objects in such a way that objects in...