Thursday, 25 August 2022

Machine Learning Basics - Cheat Sheet



Machine Learning Description

  • Machine Learning is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions.
  • From the multi fields of data, brought together as either through statistical or machine learning, data is understood and extract insights from it.
  • Usually there are two types of activities in data research :
    • Hypothesis Driven : Given a problem, what kind of data is needed to solve them.
    • Data Driven : Given a data, what kind of problems could be solved by this.


Machine Learning Life Cycle

  • Identification or Understanding of the Business Problem is very important
  • Collection of Data
    • Primary : Existing files, tables
    • Secondary : Going for surveys
  • Data Cleaning, making sure to feed right format and correct data to our ML model
    • Missing values imputations
    • Outliers
    • Generation of required columns
  • Data Transformation, converting all the data to quantitative data
  • Visualization Data, easy to understand the spread and relationship between data
    • input: Age, Standard, Subject, Gender, School
    • output: Marks
  • Model Data
  • Model Selection
    • Descriptive
    • Inferential
    • Predictive
  • Model Build
  • Model Evaluation
  • Insights / Prediction

Data Types 

  • Quantitative
    • Continuous
      • Ratio
        • Proportions can be taken by comparing
        • Zero is absolute
        • Eg: Salary as 0, means no salary
        • Note: so we don't need to worry on data transformation
      • Interval
        • Proportions cannot be taken
        • Zero is not absolute
        • Eg: Temperature : 0 Degrees, where it means here freezing point, but there are values below this also, we can transform this to Kelvin
        • Note: Data can be transform here
    • Discrete
      • Nominal
        • Red
        • Blue
        • Green
      • Ordinal
        • Good
        • Better
        • Best
      • Binary
        • Symmetric: equal importance
          • Female
          • Male
        • Asymmetric: significance given
          • Positive
          • Negative
  • Qualitative
    • Text
    • Images
    • Audio
    • Vedio


Data Cleaning

  • Process of turning the raw data to analyse able data
  • Note: Garbage In, Garbage Out
  • Methods :
    • Missing Values
      • Imputation using central tendency
        • Mean
          • 1, 2, 2,4,6,6
            • 1+2+2+4+6+6
            • 45/6
        • Median
        • Mode
      • Special Cases
        • Default value: Making a reasonable guess, based on the domain knowledge
          • 45,34,56,12,46,34,34 – Normal Class 
          • 65,89,12,90,56,60,60 – Upper Class
      • Imputation through predictions
        • Before or after values: also called as nearest neighbour
        • Model Estimation: or called as interpolation, using a model to predict missing values, could be either Linear Regression, KNN, etc
      • Drop all the missing records, please make sure it involves risk, so be careful, this should be last option
    • Outliers
      • As outliers values can have an affect in the model prediction.
      • Variance is important, for prediction and should be controllable
      • There are two types of variance we must know :
        • Good Variance
          • Could be genuine, need to take steps to fix this
          • Such can split two data sets
          • Eg: reliance super market, where we have actual retailers and small shop owners, its important to differentiate both of them
        • Bad Variance
          • Could be because of wrong imputation
      • Identification of Outliers
        • Univariate
          • IQR
        • Bivariate
        • Multivariate
          • Regression/Residual analysis
          • So here we may need to build model and predict outliers
      • Handling outliers
        • Identification as first step:
          • Eg: Age, Salary, Gender – for prediction of policy Buy/Not 
            • 32, 34, 40, 1, 80
            • 10L, 5L, 15L, 10K, 100L
            • 1,2,1,1,3
        • Scale
          • zscore
          • min-max
        • Replace
        • Remove, as a last option


Data Transformation

  • All data needs to be converted to the quantitative data
  • Models are stronger when the variance are small
  • Transformation is very much needed for Distance or Slope or Weight based models, such as
    • Linear
    • Logistic
    • KNN
  • Transformation methods:
    • Log
    • MinMax
      • Xnorm = (X-Xmin)/(XMax-Xmin)
      • Will always contain values between 0,1
    • Scaling
    • NLP
      • Set of preprocessing information qualative to quantative
      • Text Conversion : using DTM – Document/Data Term Matrix
    • One Hot Encoding
      • Male/Female
        • 0,1
      • Red, Green, Blue
        • 1,2,3
        • Red- [1,0,0]
        • Blue-[0,0,1]
        • Green-[0,1,0]
      • Each of the m features becomes a vector of length m with containing only one 1 (e.g. [r, g, b] becomes [[1,0,0],[0,1,0],[0,0,1]])
    • Label Encoding
      • Good, Bad, Ugly – 1,2,3
      • convert each distinct feature into a random number (e.g. [r, g, b] becomes [1,2,3])


Model Data

  • Modelling is the process of incorporating information into a tool which can forecast and make predictions. 
  • Usually, we are dealing with statistical modelling where we want to analyze relationships between variables. 
  • Formally, we want to estimate a function f (X) such that:
    • Y = f (X) + E
    • where 
      • X = (X 1 , X 2 , ...X p ) represents the input variables
      • Y represents the output variable
      • E represents random error

Model Life Cycle

  • Model Selection
    • Descriptive
      • Visualization
    • Inferential
      • Statistical
      • When we want to understand the relationship between X and Y. We can no longer treat f ˆ as a black box since we want to understand how Y changes with respect to X = (X 1 , X 2 , ...X n )
    • Predictive
      • Machine Learning
      • Once we have a good estimate f ˆ (X), we can use it to make predictions on new data.
      • We treat f ˆ as a black box, since we only care about the accuracy of the predictions, not why or how it works.
  • Model Build
  • Model Evaluation
    • Supervised
      • Regression
        • R2 score
        • Adj R2 score
      • Classification
        • Confussion matrix
        • Classification matrix
        • ROC
        • AUC



Model Selection

Inferential

  • Test of population parameters
  • Test of means
    • One Sample Test
    • Two Sample Test
    • Anova
    • Ancova
    • Manova
    • Mancova
    • N-Way Nanova
  • Test of variance
    • Levene Test
    • F Test
  • Test of normality
    • Shapiro’s Test
    • Anderson Darling
  • Test of proposition
    • Chi X Square

Predictive

  • Supervised
    • Regression
      • Dependent variables are continuous
      • It points estimate
      • Models :
        • Linear Regression
        • KNN for regression
        • SVM for regression
        • Decision Tree for regression
        • Ensemble Techniques for Regression
          • Basic Ensemble Techniques
            • Max Voting
            • Averaging
            • Weighted Average
          • Advanced Ensemble Techniques
            • Stacking
            • Blending
            • Bagging
            • Boosting
          • Algorithms based on Bagging and Boosting
            • Bagging meta-estimator
            • Random Forest
            • AdaBoost
            • GBM
            • XGB
            • Light GBM
            • CatBoost
        • Neural Networks for Regression
        • Time Series Forcasting
          • Independent variables are always fixed, that is time
          • Based on time will predict the dependent variables
          • Models :
            • Arima model
              • Controlled noise
              • Also, parameteric model
              • Eg: Sales forcast 
            • Arc and Garc model
              • Uncontrolled data
              • Also, nonparametric model
              • Eg: Stock market
    • Classification Models :
      • Logistic Regression
      • Naive Bayes
      • Discriminant Analysis
      • KNN for classification
      • SVM for classification
      • Descision Tree for classification
      • Neural Networks for classification

  • UnSupervised
    • Identification of the group but do not name them
    • After discovering the results, we can try to change this further to Supervised Learning
    • Clustering
      • Finding similarities and dissimilarities
      • Find rows – grouping
      • Hierarchal
        • Agoloment
        • Divove
      • Non Heirarical
        • K Means
        • DAW
        • Clara
      • Fuzzy Clustering
      • Location Based Clustering
    • Dimension Reduction
      • Would be also used for preprocession
        • column
        • correlation
      • Continous Variable
        • Principal Component Analysis
        • Factor Analysis
      • Discreate Variable
        • Multiple Correspondence Analysis
  • Recommendation Systems
    • Eg: Between users and products, ranking
    • Models
      • Popularity Based
      • Market Based Analysis
      • Association Analysis
      • Content Based Recommendation Systems
      • Collarative Recommendation Systems
      • Model Based Recommendation Systems
      • Hybrid Recommendation Systems

Model Evaluation

Classification

  • Confusion Matrix
    • An easy and popular method of diagnosing model performance.
      • TN : True Negative
      • TP : True Positive
      • FN : False Negative - Type II 
      • FP : False Positive - Type I
    • Eg:
Fig: Confusion Matrix
    • Always Type I and Type II models are inversely proportions, intuitively if Type I increases Type II error decreases
    • Accuracy
      • Percentage of correct predictions
      • But in real life scenarios, we may be more keen in looking for precision and recall
      • Because eg: we have a scenario where in when need to find if the there is fraud transaction, as out of all the transactions we would have only 1% of fraud transactions, we may have good results in accuracy, but our intention would be to find less False Negatives. 
      • Accuracy of model: TP+TN / (TP+TN+FP+FN)
    • Misclassification
      • Percentage of incorrect predictions
      • Formala : 
        • 1 – Accuracy
        • or 
      • FP+FN / (TP+TN+FP+FN)
  • Precision
    • The precision is the ratio TP / (TP + FP) where tp is the number of true positives and FP the number of false positives. 
    • The precision is intuitively the ability, indentify the Type I error, find score based on how False Positive value 0-1, can be converted to percentages.
    • Eg:
      • 1)
        • TP = 1
        • FP = 3
        • Precision:
          • 1 / (1+3)
          • 1/4
          • .25
        • So here only .25 score, as there are more FP
      • 2)
        • TP = 6
        • FP = 2
        • Precission :
          • 6 / (6+2)
          • .75
        • So here score is .75, as we have less FP
  • Recall/Sensitive
    • The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. 
    • The recall is intuitively the ability, identify the Type II error, find score based on how False Negative value 0-1, can be converted to percentages.
  • Specificity
    • The specificity is the ratio tn / (tn + fn) where tn is the number of true negatives and fn the number of false negatives. 
    • The recall is intuitively the ability, negative correct prediction, could also consider this as Type II error.
  • F1 Score
    • The harmonic mean of the precision and recall.
    • F1 score reaches its best value at 1 and worst score at 0.
    • Formula : (2 x Precision x Recall) / (Precision + Recall)
  • ROC
    • ROC stands for Receiver Operating Characteristics Curve
    • Used to find performance matrix of classification model, under various Threshold parameters
    • This curve is plot using two parameters:
      • True Positive Rate
        • TPR = TP / (TP + FN)
        • Intuitively, Percentage of identifying actual true values
      • False Positive Rate
        • FPR = FP / (FP + TN)
        • Intuitively, Percentage of identifying actual false values
    • This plot is drawn at different thresholds 
    • Visualization:
Fig: ROC

  • AUC
    • AUC stands for Area Under ROC Curve
    • Also written as AUROC (Area Under the Receiver Operating Characteristics)
    • AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
    • If the value is 1, it means our model is overfitted and if the value is 0 it means it underfitted
    • But Higher the AUC value, better the performance
  • How AUC and ROC are used together
    • Case 1
      • When AUC is 1, it means the TP and TN never overlap, overfitting
      • No wrong predictions

    • Case 2
      • When AUC is .7, it means the TP and TN have little overlap
      • Probability of .3 wrong predictions
    • Case 3
      • When AUC is .5, it means the TP and TN have complete overlap
      • Probability of .5 wrong predictions
    • Case 4
      • When AUC is 0, it means the TP and TN have opposite overlap
      • Probability of all wrong predictions
    • Observations:
      • Maximum of AUC value would be our best threshold



Regression

  • RSquare
    • Goodness of the best fit model
    • More the R^2 value, better the model
    • Formula : R^2 
      • = 1 – (SSRes/SSTotal)
      • = 1 – ( Ʃ (yiŷi) ^2 / Ʃ (yi - ̅yi) ^2 ) 
    • SSRes
      • Sum of Residuals : Ʃ (yi- ŷi) ^2
      • Residuals means errors, predictions minus actual Y value
      • The Sum of Blue Lines are SSRes
      • The reason we are squaring is to absolute the negative values
    • SSTotal
      • Sum of Total means Sum of Average Total : Ʃ (yi -  ̅yi) ^2
      • Mean Line is in Aqua Blue
      • Here instead of slope (best fit) we are calculating the Mean line
      • And finding differences from each actual lines to Y Mean
Credits: https://www.datasciencecentral.com/wp-content/uploads/2021/10/2742052271.jpg

    • Eg: Average Model
      • 1- 2/4
      • 1- 1/2 
      • 1- .5
      • 1-0.5
      • 0.5
    • Eg: Good Model
      • 1-1/4
      • 1-.25
      • .75
    • Eg: Bad Model
      • 1-4/4
      • 1-1
      • 0
    • Eg: Very Bad Model
      • 1-8/4
      • 1-2
      • -1
  • Adjusted R Square
    • When we have new number of features added, we have an increase in R2 value
    • So AdjRSquare value is used when there are comparisions between 2 or more reg models with different independent variables
    • It helps us to find the new added independent variable is helpful or not in increasing the R2 square value
    • Formula :
      • 1 - ( (1-R^2) * (N-1) / (N-P-1) )
      • N: no of rows or sample size
      • P: no of predictors or independent features
    • If the independent variables are correlated to the Target variable, we will have small decrease in the AdjR2.
    • Else we will have higher decrease in AdjR2
  • Points to remember :
    • Every time we add a independent variable to a model the RSquare always increase
    • Even if there is no significant correlation with Target variable, it will never decline
    • Whereas Adjusted R Square increases only when independent variable is significant and affects dependent variable
    • Adjusted R Square value would always be less than or equal to R Square value
    • Reference : https://www.youtube.com/watch?v=WuuyD3Yr-js


Assumptions

Parametric Test

  • Parametric analysis, are done when the population parameter is known or assumed as known
  • Performs well even when the data is skewed or non normal
  • Mean is used for the measure of tendency
  • Interval and ratio are the measurement levels
  • Pros :
    • Proven to be highly powerful
  • Cons :
    • Expects population to be known
    • Complicated theory

Non Parametric Test

  • NonParametric analysis, are done when the population distribution is unknown.
  • Median is used for the measure of tendency
  • Ordinal and nominal are the measurement levels
  • Pros :
    • Simple and easy to understand
    • No assumptions made
  • Cons :
    • Applicable for nominal and ordinal scale
    • Consider all data groups have same spread
  • Refer : https://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test

Balance

Overfitting

  • Overfitting refers to a model that learns the training data too well.

  • Reduces Overfitting :
    • Use a resampling technique to estimate model accuracy
    • Hold back a validation dataset

Underfitting

  • Underfitting refers to a model that can neither learn the training data nor generalize to new data.


Model Evaluation Error

  • Err (X) = Biaŝ 2 + Variance + Random Error

Random Error

  • It means Irreducible Error
  • Usually the amount of noise in our data, which is unknown and uncontrolled

Bias error

  • Bias is actually the accuracy score from the predicted values minus actual values
  • Generally its ROC_AUC in classification and RMSE in regression
  • Model with high bias pays very little attention to the training data and oversimplifies the model.
  • High bias is underfitting of the model
  • High bias in training will also lead to high bias in testing so it would be underfitted model
  • Bias are the simplifying assumptions made by a model to make the target function easier to learn.
    • Low bias = Less assumptions
    • High bias = More assumptions
    • Low bias models :
      • Decision Tree, KNN, SVM
    • High bias models :
      • Linear, Logistic
  • Usually High Bias and Low Variance are in parametric ML algorithms, as they make assumptions of the population
  • Bias is also introduced by us, such as when we select the features, if we have less features or irrelevant features added to model, we add bias
  • When the data is not scaled, there in also we introduce bias, as higher value data are prone to make model bias towards its value

Variance errors

  • Variance is Variability of model prediction, how good the model predicts in different samples of same dataset
  • High variance is overfitting the model
  • High variance will win in the training set, but if different set is introduced the model accuracy score will be affected
  • Low variance, small change to estimate of the target function with the changes of the training dataset
  • High variance, many changes to estimate of the target function with the changes of the training dataset
  • Usually High Variance and Low Bias are in non parameteric ML algorithms, as they make no assumptions of the population and have lot of flexibility
  • High variance models :
    • Decision Tree, KNN, SVM
  • Low variance models :
    • Linear, Logistic

Bias Tradeoff

  • If our model is too simple we may introduce High Bias, which is Underfitting
  • If our model is too complicated we may introduce High Variance, which is Overfitting
  • Finding the sweet spot is important here, where there is low variance and low bias
  • In reality there is no escaping the relationship between bias and variance in machine learning
    • Increasing the bias will decrease the variance
    • Increasing the variance will decrease the bias
  • So we need to find the optimal balance
Fig: Bias Tradeoff


Problem of Imbalance

  • When there is imbalance of class information, models tend to low sensitivity
  • In such calls,
    • Over sampling of records to bring in the balance of might work
    • Ensemble methods might work
    • Changing the cut off or threshold of predicted probability might also work


Cross Validation

  • Due to imbalance of data, we can try to fix this using cross validation(cv)
  • Usually we have data split into Train and Test.
  • But using CV we can have number of splits with the train data, as train and test
  • As several samples are generated using the same train data set, we can with ease have high chances of detecting over fitting
  • Used when :
    • When the data is noisy, (high outliers)
    • When the model is a greedy algorithm, because the model never takes optimal value
      • eg: decision tree, as in step 1 it does not know the further steps
    • Generally there are parameters for the model
      • eg: knn, the k values are used for model building, best k value would be after analysis such as it could be 3,5,11,..?
  • Types:
    • K-Fold Cross Validation
      • Take the train data, the data then can be split to K equal parts
      • Then one of the K part is considered as Test and others as Train
      • Drawback: computationally expensive
      • Sample Source :
        • from sklearn.model_selection import cross_val_score
        • print(cross_val_score(model, X, y, cv=5))
      • Visualization:

    • Leave P-out Cross Validation
    • Leave One-out Cross Validation
    • Repeated Random Sub-sampling Method
    • Holdout Method


References and Credit

https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications

Several Medium, Wikipedia and other sources.


No comments:

Post a Comment

Scarcity Brings Efficiency: Python RAM Optimization

  In today’s world, with the abundance of RAM available, we rarely think about optimizing our code. But sooner or later, we hit the limits a...