Machine Learning Description

Machine Learning is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions.
From the multi fields of data, brought together as either through statistical or machine learning, data is understood and extract insights from it.
Usually there are two types of activities in data research :

Hypothesis Driven : Given a problem, what kind of data is needed to solve them.
Data Driven : Given a data, what kind of problems could be solved by this.

Machine Learning Life Cycle

Identification or Understanding of the Business Problem is very important
Collection of Data

Primary : Existing files, tables
Secondary : Going for surveys

Data Cleaning, making sure to feed right format and correct data to our ML model

Missing values imputations
Outliers
Generation of required columns

Data Transformation, converting all the data to quantitative data
Visualization Data, easy to understand the spread and relationship between data

input: Age, Standard, Subject, Gender, School
output: Marks

Model Data
Model Selection

Descriptive
Inferential
Predictive

Model Build
Model Evaluation
Insights / Prediction

Data Types

Quantitative

Continuous

Ratio

Proportions can be taken by comparing
Zero is absolute
Eg: Salary as 0, means no salary
Note: so we don't need to worry on data transformation

Interval

Proportions cannot be taken
Zero is not absolute
Eg: Temperature : 0 Degrees, where it means here freezing point, but there are values below this also, we can transform this to Kelvin
Note: Data can be transform here

Discrete

Nominal

Red
Blue
Green

Ordinal

Good
Better
Best

Binary

Symmetric: equal importance

Female
Male

Asymmetric: significance given

Positive
Negative

Qualitative

Text
Images
Audio
Vedio

Data Cleaning

Process of turning the raw data to analyse able data
Note: Garbage In, Garbage Out
Methods :

Missing Values

Imputation using central tendency

Mean

1, 2, 2,4,6,6

1+2+2+4+6+6
45/6

Median
Mode

Special Cases

Default value: Making a reasonable guess, based on the domain knowledge

45,34,56,12,46,34,34 – Normal Class
65,89,12,90,56,60,60 – Upper Class

Imputation through predictions

Before or after values: also called as nearest neighbour
Model Estimation: or called as interpolation, using a model to predict missing values, could be either Linear Regression, KNN, etc

Drop all the missing records, please make sure it involves risk, so be careful, this should be last option

Outliers

As outliers values can have an affect in the model prediction.
Variance is important, for prediction and should be controllable
There are two types of variance we must know :

Good Variance

Could be genuine, need to take steps to fix this
Such can split two data sets
Eg: reliance super market, where we have actual retailers and small shop owners, its important to differentiate both of them

Bad Variance

Could be because of wrong imputation

Identification of Outliers

Univariate

Bivariate
Multivariate

Regression/Residual analysis
So here we may need to build model and predict outliers

Handling outliers

Identification as first step:

Eg: Age, Salary, Gender – for prediction of policy Buy/Not

32, 34, 40, 1, 80
10L, 5L, 15L, 10K, 100L
1,2,1,1,3

Scale

zscore
min-max

Replace
Remove, as a last option

Data Transformation

All data needs to be converted to the quantitative data
Models are stronger when the variance are small
Transformation is very much needed for Distance or Slope or Weight based models, such as

Linear
Logistic
KNN

Transformation methods:

Log
MinMax

Xnorm = (X-Xmin)/(XMax-Xmin)
Will always contain values between 0,1

Scaling
NLP

Set of preprocessing information qualative to quantative
Text Conversion : using DTM – Document/Data Term Matrix

One Hot Encoding

Male/Female

Red, Green, Blue

1,2,3
Red- [1,0,0]
Blue-[0,0,1]
Green-[0,1,0]

Each of the m features becomes a vector of length m with containing only one 1 (e.g. [r, g, b] becomes [[1,0,0],[0,1,0],[0,0,1]])

Label Encoding

Good, Bad, Ugly – 1,2,3
convert each distinct feature into a random number (e.g. [r, g, b] becomes [1,2,3])

Model Data

Modelling is the process of incorporating information into a tool which can forecast and make predictions.
Usually, we are dealing with statistical modelling where we want to analyze relationships between variables.
Formally, we want to estimate a function f (X) such that:

Y = f (X) + E
where

X = (X 1 , X 2 , ...X p ) represents the input variables
Y represents the output variable
E represents random error

Model Life Cycle

Model Selection

Descriptive

Visualization

Inferential

Statistical
When we want to understand the relationship between X and Y. We can no longer treat f ˆ as a black box since we want to understand how Y changes with respect to X = (X 1 , X 2 , ...X n )

Predictive

Machine Learning
Once we have a good estimate f ˆ (X), we can use it to make predictions on new data.
We treat f ˆ as a black box, since we only care about the accuracy of the predictions, not why or how it works.

Model Build
Model Evaluation

Supervised

Regression

R2 score
Adj R2 score

Classification

Confussion matrix
Classification matrix
ROC
AUC

Model Selection

Inferential

Test of population parameters
Test of means

One Sample Test
Two Sample Test
Anova
Ancova
Manova
Mancova
N-Way Nanova

Test of variance

Levene Test
F Test

Test of normality

Shapiro’s Test
Anderson Darling

Test of proposition

Chi X Square

Predictive

Supervised

Regression

Dependent variables are continuous
It points estimate
Models :

Linear Regression
KNN for regression
SVM for regression
Decision Tree for regression
Ensemble Techniques for Regression

Basic Ensemble Techniques

Max Voting
Averaging
Weighted Average

Advanced Ensemble Techniques

Stacking
Blending
Bagging
Boosting

Algorithms based on Bagging and Boosting

Bagging meta-estimator
Random Forest
AdaBoost
GBM
XGB
Light GBM
CatBoost

Neural Networks for Regression
Time Series Forcasting

Independent variables are always fixed, that is time
Based on time will predict the dependent variables
Models :

Arima model

Controlled noise
Also, parameteric model
Eg: Sales forcast

Arc and Garc model

Uncontrolled data
Also, nonparametric model
Eg: Stock market

Classification Models :

Logistic Regression
Naive Bayes
Discriminant Analysis
KNN for classification
SVM for classification
Descision Tree for classification
Neural Networks for classification

UnSupervised

Identification of the group but do not name them
After discovering the results, we can try to change this further to Supervised Learning
Clustering

Finding similarities and dissimilarities
Find rows – grouping
Hierarchal

Agoloment
Divove

Non Heirarical

K Means
DAW
Clara

Fuzzy Clustering
Location Based Clustering

Dimension Reduction

Would be also used for preprocession

column
correlation

Continous Variable

Principal Component Analysis
Factor Analysis

Discreate Variable

Multiple Correspondence Analysis

Recommendation Systems

Eg: Between users and products, ranking
Models

Popularity Based
Market Based Analysis
Association Analysis
Content Based Recommendation Systems
Collarative Recommendation Systems
Model Based Recommendation Systems
Hybrid Recommendation Systems

Model Evaluation

Classification

Confusion Matrix

An easy and popular method of diagnosing model performance.

TN : True Negative
TP : True Positive
FN : False Negative - Type II
FP : False Positive - Type I

Fig: Confusion Matrix

Always Type I and Type II models are inversely proportions, intuitively if Type I increases Type II error decreases
Accuracy

Percentage of correct predictions
But in real life scenarios, we may be more keen in looking for precision and recall
Because eg: we have a scenario where in when need to find if the there is fraud transaction, as out of all the transactions we would have only 1% of fraud transactions, we may have good results in accuracy, but our intention would be to find less False Negatives.
Accuracy of model: TP+TN / (TP+TN+FP+FN)

Misclassification

Percentage of incorrect predictions
Formala :

1 – Accuracy
or

FP+FN / (TP+TN+FP+FN)

Precision

The precision is the ratio TP / (TP + FP) where tp is the number of true positives and FP the number of false positives.
The precision is intuitively the ability, indentify the Type I error, find score based on how False Positive value 0-1, can be converted to percentages.
Eg:

TP = 1
FP = 3
Precision:

1 / (1+3)
1/4
.25

So here only .25 score, as there are more FP

TP = 6
FP = 2
Precission :

6 / (6+2)
.75

So here score is .75, as we have less FP

Recall/Sensitive

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.
The recall is intuitively the ability, identify the Type II error, find score based on how False Negative value 0-1, can be converted to percentages.

Specificity

The specificity is the ratio tn / (tn + fn) where tn is the number of true negatives and fn the number of false negatives.
The recall is intuitively the ability, negative correct prediction, could also consider this as Type II error.

F1 Score

The harmonic mean of the precision and recall.
F1 score reaches its best value at 1 and worst score at 0.
Formula : (2 x Precision x Recall) / (Precision + Recall)

ROC stands for Receiver Operating Characteristics Curve
Used to find performance matrix of classification model, under various Threshold parameters
This curve is plot using two parameters:

True Positive Rate

TPR = TP / (TP + FN)
Intuitively, Percentage of identifying actual true values

False Positive Rate

FPR = FP / (FP + TN)
Intuitively, Percentage of identifying actual false values

This plot is drawn at different thresholds
Visualization:

Fig: ROC

AUC stands for Area Under ROC Curve
Also written as AUROC (Area Under the Receiver Operating Characteristics)
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
If the value is 1, it means our model is overfitted and if the value is 0 it means it underfitted
But Higher the AUC value, better the performance

How AUC and ROC are used together

Case 1

When AUC is 1, it means the TP and TN never overlap, overfitting
No wrong predictions

Case 2

When AUC is .7, it means the TP and TN have little overlap
Probability of .3 wrong predictions

Case 3

When AUC is .5, it means the TP and TN have complete overlap
Probability of .5 wrong predictions

Case 4

When AUC is 0, it means the TP and TN have opposite overlap
Probability of all wrong predictions

Observations:

Maximum of AUC value would be our best threshold

Regression

RSquare

Goodness of the best fit model
More the R^2 value, better the model
Formula : R^2

= 1 – (SSRes/SSTotal)
= 1 – ( Ʃ (yi- ŷi) ^2 / Ʃ (yi - ̅yi) ^2 )

SSRes

Sum of Residuals : Ʃ (yi- ŷi) ^2
Residuals means errors, predictions minus actual Y value
The Sum of Blue Lines are SSRes
The reason we are squaring is to absolute the negative values

SSTotal

Sum of Total means Sum of Average Total : Ʃ (yi - ̅yi) ^2
Mean Line is in Aqua Blue
Here instead of slope (best fit) we are calculating the Mean line
And finding differences from each actual lines to Y Mean

Credits: https://www.datasciencecentral.com/wp-content/uploads/2021/10/2742052271.jpg

Eg: Average Model

1- 2/4
1- 1/2
1- .5
1-0.5
0.5

Eg: Good Model

1-1/4
1-.25
.75

Eg: Bad Model

1-4/4
1-1
0

Eg: Very Bad Model

1-8/4
1-2
-1

Adjusted R Square

When we have new number of features added, we have an increase in R2 value
So AdjRSquare value is used when there are comparisions between 2 or more reg models with different independent variables
It helps us to find the new added independent variable is helpful or not in increasing the R2 square value
Formula :

1 - ( (1-R^2) * (N-1) / (N-P-1) )
N: no of rows or sample size
P: no of predictors or independent features

If the independent variables are correlated to the Target variable, we will have small decrease in the AdjR2.
Else we will have higher decrease in AdjR2

Points to remember :

Every time we add a independent variable to a model the RSquare always increase
Even if there is no significant correlation with Target variable, it will never decline
Whereas Adjusted R Square increases only when independent variable is significant and affects dependent variable
Adjusted R Square value would always be less than or equal to R Square value
Reference : https://www.youtube.com/watch?v=WuuyD3Yr-js

Assumptions

Parametric Test

Parametric analysis, are done when the population parameter is known or assumed as known
Performs well even when the data is skewed or non normal
Mean is used for the measure of tendency
Interval and ratio are the measurement levels
Pros :

Proven to be highly powerful

Cons :

Expects population to be known
Complicated theory

Non Parametric Test

NonParametric analysis, are done when the population distribution is unknown.
Median is used for the measure of tendency
Ordinal and nominal are the measurement levels
Pros :

Simple and easy to understand
No assumptions made

Cons :

Applicable for nominal and ordinal scale
Consider all data groups have same spread

Refer : https://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test

Balance

Overfitting

Overfitting refers to a model that learns the training data too well.
Reduces Overfitting :

Use a resampling technique to estimate model accuracy
Hold back a validation dataset

Underfitting

Underfitting refers to a model that can neither learn the training data nor generalize to new data.

Model Evaluation Error

Err (X) = Biaŝ 2 + Variance + Random Error

Random Error

It means Irreducible Error
Usually the amount of noise in our data, which is unknown and uncontrolled

Bias error

Bias is actually the accuracy score from the predicted values minus actual values
Generally its ROC_AUC in classification and RMSE in regression
Model with high bias pays very little attention to the training data and oversimplifies the model.
High bias is underfitting of the model
High bias in training will also lead to high bias in testing so it would be underfitted model
Bias are the simplifying assumptions made by a model to make the target function easier to learn.

Low bias = Less assumptions
High bias = More assumptions
Low bias models :

Decision Tree, KNN, SVM

High bias models :

Linear, Logistic

Usually High Bias and Low Variance are in parametric ML algorithms, as they make assumptions of the population
Bias is also introduced by us, such as when we select the features, if we have less features or irrelevant features added to model, we add bias
When the data is not scaled, there in also we introduce bias, as higher value data are prone to make model bias towards its value

Variance errors

Variance is Variability of model prediction, how good the model predicts in different samples of same dataset
High variance is overfitting the model
High variance will win in the training set, but if different set is introduced the model accuracy score will be affected
Low variance, small change to estimate of the target function with the changes of the training dataset
High variance, many changes to estimate of the target function with the changes of the training dataset
Usually High Variance and Low Bias are in non parameteric ML algorithms, as they make no assumptions of the population and have lot of flexibility
High variance models :

Decision Tree, KNN, SVM

Low variance models :

Linear, Logistic

Bias Tradeoff

If our model is too simple we may introduce High Bias, which is Underfitting
If our model is too complicated we may introduce High Variance, which is Overfitting
Finding the sweet spot is important here, where there is low variance and low bias
In reality there is no escaping the relationship between bias and variance in machine learning

Increasing the bias will decrease the variance
Increasing the variance will decrease the bias

So we need to find the optimal balance

Fig: Bias Tradeoff

Problem of Imbalance

When there is imbalance of class information, models tend to low sensitivity
In such calls,

Over sampling of records to bring in the balance of might work
Ensemble methods might work
Changing the cut off or threshold of predicted probability might also work

Cross Validation

Due to imbalance of data, we can try to fix this using cross validation(cv)
Usually we have data split into Train and Test.
But using CV we can have number of splits with the train data, as train and test
As several samples are generated using the same train data set, we can with ease have high chances of detecting over fitting
Used when :

When the data is noisy, (high outliers)
When the model is a greedy algorithm, because the model never takes optimal value

eg: decision tree, as in step 1 it does not know the further steps

Generally there are parameters for the model

eg: knn, the k values are used for model building, best k value would be after analysis such as it could be 3,5,11,..?

Types:

K-Fold Cross Validation

Take the train data, the data then can be split to K equal parts
Then one of the K part is considered as Test and others as Train
Drawback: computationally expensive
Sample Source :

from sklearn.model_selection import cross_val_score
print(cross_val_score(model, X, y, cv=5))

Visualization:

Leave P-out Cross Validation
Leave One-out Cross Validation
Repeated Random Sub-sampling Method
Holdout Method

References and Credit

https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications

Several Medium, Wikipedia and other sources.

HelloWorldEngineer

Thursday, 25 August 2022

Machine Learning Basics - Cheat Sheet