Tuesday, June 14, 2022

PREPARE DATA FOR MACHINE LEARNING


When utilizing ML to address real-world issues, data preparation is typically the initial step. This is due to the fact that before a dataset can be used to train machine learning or deep learning algorithms, it usually contains a number of inconsistencies that must be fixed. The following is a list of problems with unprepared data:

  • Missing Values in the Dataset
  • Different File Formats
  • Outlying Data Points
  • Inconsistency in variable values
  • Irrelevant feature variables

STEPS INVOLVED IN DATA PREPARATION ARE AS FOLLOWS:

1. Gather data

Finding the appropriate data is the first step in the data preparation process. This might originate from a database that already exists or could be added on the fly.

2. Discover and assess data

Discovering each dataset is crucial after data collection. Understanding the data and what has to be done before the data is relevant in a given context are the goals of this step.

3. Cleanse and validate data

Though generally, the most time-consuming step in the data preparation process, cleaning up the data is essential for eliminating inaccurate data and filling in any gaps. Here, crucial duties include:

  • Eliminating irrelevant information and outliers, Adding values when there are gaps Data standardization, data masking, and sensitive or private data entry

4. Transform and enrich data

Data transformation involves changing the format or value entries to achieve a specific result or to make the data more comprehensible to a larger audience. Adding to and connecting data with additional relevant information in order to deliver deeper insights is referred to as enriching data.

5. Store data

When the data is ready, it can be saved or sent to a third-party program, like a business intelligence tool, opening the door for processing and analysis.

Step 1: Importing pandas as pd in python to use this library

import pandas as pd

import numpy as np

Step 2:  Reading data in excel/csv file

df=pd.read_csv('father-son.csv')
df

df.describe()
Step 2: Splitting dataset into Training and Test datasets


from sklearn.model_selection import train_test_split
X = df.drop("Son", axis=1)
y = df['Son']
X_train,X_test,y_train,y_test= train_test_split(X,y, test_size = 0.2,\
                                                   random_state=112)
from sklearn.linear_model import LinearRegression
my_model = LinearRegression() 
result = my_model.fit(X_train, y_train) 

predictions = result.predict(X_test) 

plt.scatter(X_train, y_train, color ='c') 
plt.plot(X_test, predictions, color ='k') 
plt.show()


from sklearn.metrics import r2_score
r2_score(y_test,predictions)


import numpy as np
from sklearn import metrics
print ('MAE: ', metrics.mean_absolute_error(y_test, predictions))
print ('MSE: ',metrics.mean_squared_error(y_test, predictions))
print ('RMSE: ',np.sqrt(metrics.mean_squared_error(y_test, predictions)))

import seaborn as sns
sns.jointplot(df['Father'], df['Son'])
sns.jointplot(x='Father', y='Son', data = df, kind = 'hex')

sns.pairplot(df)




Thursday, May 19, 2022

Introduction to Python Programming

Python is a high-level, interpreted programming language known for its simplicity and readability. Created by Guido van Rossum and first released in 1991, Python has become one of the most popular programming languages in the world. It is widely used in various fields, from web development and data science to artificial intelligence and automation.

Why Choose Python?

Python’s design philosophy emphasizes code readability and simplicity, making it an ideal choice for both beginners and experienced developers. Here are some reasons why Python stands out:

  1. Ease of Learning: Python’s syntax is clear and straightforward, closely resembling the English language. This makes it easier to learn and write code.
  2. Versatility: Python is a general-purpose language, which means it can be used to build a wide variety of applications, including web applications, desktop applications, games, and more.
  3. Extensive Libraries and Frameworks: Python boasts a rich ecosystem of libraries and frameworks that simplify development tasks. For example, Django and Flask for web development, NumPy and pandas for data analysis, and TensorFlow and PyTorch for machine learning.
  4. Community Support: Python has a large and active community, which means you can find plenty of resources, tutorials, and forums to help you solve problems and learn new skills.
  5. Integration Capabilities: Python can easily integrate with other languages and technologies, making it a versatile tool in a developer’s toolkit.

Key Features of Python

  • Interpreted Language: Python is an interpreted language, meaning that your code is executed line by line. This allows for quick testing and debugging.
  • Dynamic Typing: Python uses dynamic typing, so you don’t need to declare the data type of a variable when you create one. This flexibility can speed up development.
  • Object-Oriented: Python supports object-oriented programming (OOP), which allows you to create classes and objects, enabling code reuse and modularity.
  • High-Level Language: Python abstracts away most of the complex details of the computer’s hardware, allowing you to focus on programming logic rather than low-level details.
  • Cross-Platform: Python is cross-platform, meaning it runs on various operating systems, including Windows, macOS, and Linux, without requiring modification.

Python’s Role in Modern Development

Python has cemented its role in modern development due to its applicability in emerging fields:

  • Data Science and Analytics: Python’s data manipulation and visualization libraries like pandas, Matplotlib, and Seaborn have made it a staple in data science.
  • Machine Learning and Artificial Intelligence: Libraries like TensorFlow, Keras, and Scikit-Learn have enabled Python to become a leading language in AI and machine learning development.
  • Web Development: Frameworks like Django and Flask have simplified the creation of robust and scalable web applications.
  • Automation and Scripting: Python’s simplicity makes it an excellent choice for writing scripts to automate repetitive tasks.

Getting Started with Python

To start programming in Python, you’ll need to set up your development environment:

  1. Install Python: Download and install the latest version of Python from the official website (python.org).
  2. Choose an Integrated Development Environment (IDE): Popular choices include PyCharm, VSCode, and Jupyter Notebook. These tools provide features like syntax highlighting, debugging, and project management.
  3. Write Your First Program: Open your IDE and write a simple “Hello, World!” program to get a feel for Python’s syntax.
1
print("Hello, World!")

Conclusion

Python is a powerful and versatile language that can open doors to numerous opportunities in various fields. Whether you’re a beginner looking to learn your first programming language or an experienced developer exploring new domains, Python provides the tools and resources you need to succeed.

Saturday, May 14, 2022

How to handle Outliers

Handling outliers in a dataset is an important step in data preprocessing to ensure that extreme values do not adversely affect the performance and accuracy of machine learning models. Here are some common approaches to handling outliers:

1. Detecting outliers:

  • Visual inspection: Plotting the data using box plots, scatter plots, or histograms can help identify outliers visually.
  • Statistical methods: Use statistical techniques such as z-scores, interquartile range (IQR), or modified z-scores to detect outliers based on their deviation from the mean or median.

2. Handling outliers:

  • Removing outliers: If the outliers are due to data entry errors or represent extreme anomalies, it might be appropriate to remove them from the dataset. However, this should be done with caution, as removing too many outliers can lead to the loss of valuable information.
  • Transforming data: Applying mathematical transformations such as logarithmic, square root, or reciprocal transformations can help make the data more normally distributed and reduce the impact of outliers.
  • Winsorizing: Winsorization involves replacing extreme outlier values with less extreme values. For example, capping the outliers at a certain percentile (e.g., replacing values above the 99th percentile with the 99th percentile value).
  • Binning: Grouping continuous data into bins or intervals can help reduce the impact of outliers. Instead of using the raw values, you can assign the data points to the corresponding bin.
  • Imputation: If the outliers are due to missing values, imputation techniques such as mean, median, or regression-based imputation can be used to replace the outliers with plausible values.
  • Model-based approaches: Some machine learning algorithms are robust to outliers, such as robust regression or decision tree-based models. In such cases, using models that can handle outliers effectively might be a suitable approach.

It's important to note that the choice of handling outliers depends on the specific dataset, the nature of the outliers, and the objectives of the analysis. It is advisable to carefully evaluate the impact of outlier handling on the overall data distribution and the downstream analysis or modeling tasks.

Additionally, it's crucial to document and report the handling of outliers in your analysis to ensure transparency and reproducibility.

Thursday, April 14, 2022

Machine Learning Algorithms : Python Vs. R

Linear Regression


Using Python

#Import all necessary libraries like pandas,numpy etc.

from sklearn import linear_model

#Load Train and Test datasets

#Identify feature(s) and response variable(s) and values must be numeric and numpy arrays 

X_train=input_variables_values_training_datasets 

y_train=target_variables_values_training_datasets 

x_test=input_variables_values_test_datasets

#Create linear regression object

linear = linear_model.LinearRegression ()

#Train the model using the training sets and check score 

linear.fit(X_train, y_train) 

linear.score(X_train, y_train)

#Equation coefficient and Intercept

print('Coefficient: In', linear.coef_)

print('Intercept: \n', linear.intercept_)

#Predict Output 

predicted= linear.predict(x_test)


Using R

#Load Train and Test datasets

#Identify feature and response variable(s) and values must be numeric and numpy arrays 

X_train <- input_variables_values_training_datasets 

y_train <- target_variables_values_training_datasets 

x_test ‹- input variables values test datasets 

x <-cbind(x_train,y_train)

#Train the model using the training sets and check score

linear <- lm(y_train ~ ., data = x)

summary (linear)

#Predict Output

predicted= predict (linear, x_test)




Clustering in Machine Learning

Clustering is a type of unsupervised learning in machine learning where the goal is to group a set of objects in such a way that objects in...