Machine Learning Description
- Machine Learning is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions.
- From the multi fields of data, brought together as either through statistical or machine learning, data is understood and extract insights from it.
- Usually there are two types of activities in data research :
- Hypothesis Driven : Given a problem, what kind of data is needed to solve them.
- Data Driven : Given a data, what kind of problems could be solved by this.
Machine Learning Life Cycle
- Identification or Understanding of the Business Problem is very important
- Collection of Data
- Primary : Existing files, tables
- Secondary : Going for surveys
- Data Cleaning, making sure to feed right format and correct data to our ML model
- Missing values imputations
- Outliers
- Generation of required columns
- Data Transformation, converting all the data to quantitative data
- Visualization Data, easy to understand the spread and relationship between data
- input: Age, Standard, Subject, Gender, School
- output: Marks
- Model Data
- Model Selection
- Descriptive
- Inferential
- Predictive
- Model Build
- Model Evaluation
- Insights / Prediction
Data Types
- Quantitative
- Continuous
- Ratio
- Proportions can be taken by comparing
- Zero is absolute
- Eg: Salary as 0, means no salary
- Note: so we don't need to worry on data transformation
- Interval
- Proportions cannot be taken
- Zero is not absolute
- Eg: Temperature : 0 Degrees, where it means here freezing point, but there are values below this also, we can transform this to Kelvin
- Note: Data can be transform here
- Discrete
- Nominal
- Red
- Blue
- Green
- Ordinal
- Good
- Better
- Best
- Binary
- Symmetric: equal importance
- Female
- Male
- Asymmetric: significance given
- Positive
- Negative
- Qualitative
- Text
- Images
- Audio
- Vedio
Data Cleaning
- Process of turning the raw data to analyse able data
- Note: Garbage In, Garbage Out
- Methods :
- Missing Values
- Imputation using central tendency
- Mean
- 1, 2, 2,4,6,6
- 1+2+2+4+6+6
- 45/6
- Median
- Mode
- Special Cases
- Default value: Making a reasonable guess, based on the domain knowledge
- 45,34,56,12,46,34,34 – Normal Class
- 65,89,12,90,56,60,60 – Upper Class
- Imputation through predictions
- Before or after values: also called as nearest neighbour
- Model Estimation: or called as interpolation, using a model to predict missing values, could be either Linear Regression, KNN, etc
- Drop all the missing records, please make sure it involves risk, so be careful, this should be last option
- Outliers
- As outliers values can have an affect in the model prediction.
- Variance is important, for prediction and should be controllable
- There are two types of variance we must know :
- Good Variance
- Could be genuine, need to take steps to fix this
- Such can split two data sets
- Eg: reliance super market, where we have actual retailers and small shop owners, its important to differentiate both of them
- Bad Variance
- Could be because of wrong imputation
- Identification of Outliers
- Univariate
- IQR
- Bivariate
- Multivariate
- Regression/Residual analysis
- So here we may need to build model and predict outliers
- Handling outliers
- Identification as first step:
- Eg: Age, Salary, Gender – for prediction of policy Buy/Not
- 32, 34, 40, 1, 80
- 10L, 5L, 15L, 10K, 100L
- 1,2,1,1,3
- Scale
- zscore
- min-max
- Replace
- Remove, as a last option
Data Transformation
- All data needs to be converted to the quantitative data
- Models are stronger when the variance are small
- Transformation is very much needed for Distance or Slope or Weight based models, such as
- Linear
- Logistic
- KNN
- Transformation methods:
- Log
- MinMax
- Xnorm = (X-Xmin)/(XMax-Xmin)
- Will always contain values between 0,1
- Scaling
- NLP
- Set of preprocessing information qualative to quantative
- Text Conversion : using DTM – Document/Data Term Matrix
- One Hot Encoding
- Male/Female
- 0,1
- Red, Green, Blue
- 1,2,3
- Red- [1,0,0]
- Blue-[0,0,1]
- Green-[0,1,0]
- Each of the m features becomes a vector of length m with containing only one 1 (e.g. [r, g, b] becomes [[1,0,0],[0,1,0],[0,0,1]])
- Label Encoding
- Good, Bad, Ugly – 1,2,3
- convert each distinct feature into a random number (e.g. [r, g, b] becomes [1,2,3])
Model Data
- Modelling is the process of incorporating information into a tool which can forecast and make predictions.
- Usually, we are dealing with statistical modelling where we want to analyze relationships between variables.
- Formally, we want to estimate a function f (X) such that:
- Y = f (X) + E
- where
- X = (X 1 , X 2 , ...X p ) represents the input variables
- Y represents the output variable
- E represents random error
Model Life Cycle
- Model Selection
- Descriptive
- Visualization
- Inferential
- Statistical
- When we want to understand the relationship between X and Y. We can no longer treat f ˆ as a black box since we want to understand how Y changes with respect to X = (X 1 , X 2 , ...X n )
- Predictive
- Machine Learning
- Once we have a good estimate f ˆ (X), we can use it to make predictions on new data.
- We treat f ˆ as a black box, since we only care about the accuracy of the predictions, not why or how it works.
- Model Build
- Model Evaluation
- Supervised
- Regression
- R2 score
- Adj R2 score
- Classification
- Confussion matrix
- Classification matrix
- ROC
- AUC
Model Selection
Inferential
- Test of population parameters
- Test of means
- One Sample Test
- Two Sample Test
- Anova
- Ancova
- Manova
- Mancova
- N-Way Nanova
- Test of variance
- Levene Test
- F Test
- Test of normality
- Shapiro’s Test
- Anderson Darling
- Test of proposition
- Chi X Square
Predictive
- Supervised
- Regression
- Dependent variables are continuous
- It points estimate
- Models :
- Linear Regression
- KNN for regression
- SVM for regression
- Decision Tree for regression
- Ensemble Techniques for Regression
- Basic Ensemble Techniques
- Max Voting
- Averaging
- Weighted Average
- Advanced Ensemble Techniques
- Stacking
- Blending
- Bagging
- Boosting
- Algorithms based on Bagging and Boosting
- Bagging meta-estimator
- Random Forest
- AdaBoost
- GBM
- XGB
- Light GBM
- CatBoost
- Neural Networks for Regression
- Time Series Forcasting
- Independent variables are always fixed, that is time
- Based on time will predict the dependent variables
- Models :
- Arima model
- Controlled noise
- Also, parameteric model
- Eg: Sales forcast
- Arc and Garc model
- Uncontrolled data
- Also, nonparametric model
- Eg: Stock market
- Classification Models :
- Logistic Regression
- Naive Bayes
- Discriminant Analysis
- KNN for classification
- SVM for classification
- Descision Tree for classification
- Neural Networks for classification
- UnSupervised
- Identification of the group but do not name them
- After discovering the results, we can try to change this further to Supervised Learning
- Clustering
- Finding similarities and dissimilarities
- Find rows – grouping
- Hierarchal
- Agoloment
- Divove
- Non Heirarical
- K Means
- DAW
- Clara
- Fuzzy Clustering
- Location Based Clustering
- Dimension Reduction
- Would be also used for preprocession
- column
- correlation
- Continous Variable
- Principal Component Analysis
- Factor Analysis
- Discreate Variable
- Multiple Correspondence Analysis
- Recommendation Systems
- Eg: Between users and products, ranking
- Models
- Popularity Based
- Market Based Analysis
- Association Analysis
- Content Based Recommendation Systems
- Collarative Recommendation Systems
- Model Based Recommendation Systems
- Hybrid Recommendation Systems
Model Evaluation
Classification
- Confusion Matrix
- An easy and popular method of diagnosing model performance.
- TN : True Negative
- TP : True Positive
- FN : False Negative - Type II
- FP : False Positive - Type I
- Eg:
![]() |
| Fig: Confusion Matrix |
- Always Type I and Type II models are inversely proportions, intuitively if Type I increases Type II error decreases
- Accuracy
- Percentage of correct predictions
- But in real life scenarios, we may be more keen in looking for precision and recall
- Because eg: we have a scenario where in when need to find if the there is fraud transaction, as out of all the transactions we would have only 1% of fraud transactions, we may have good results in accuracy, but our intention would be to find less False Negatives.
- Accuracy of model: TP+TN / (TP+TN+FP+FN)
- Misclassification
- Percentage of incorrect predictions
- Formala :
- 1 – Accuracy
- or
- FP+FN / (TP+TN+FP+FN)
- Precision
- The precision is the ratio TP / (TP + FP) where tp is the number of true positives and FP the number of false positives.
- The precision is intuitively the ability, indentify the Type I error, find score based on how False Positive value 0-1, can be converted to percentages.
- Eg:
- 1)
- TP = 1
- FP = 3
- Precision:
- 1 / (1+3)
- 1/4
- .25
- So here only .25 score, as there are more FP
- 2)
- TP = 6
- FP = 2
- Precission :
- 6 / (6+2)
- .75
- So here score is .75, as we have less FP
- Recall/Sensitive
- The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives.
- The recall is intuitively the ability, identify the Type II error, find score based on how False Negative value 0-1, can be converted to percentages.
- Specificity
- The specificity is the ratio tn / (tn + fn) where tn is the number of true negatives and fn the number of false negatives.
- The recall is intuitively the ability, negative correct prediction, could also consider this as Type II error.
- F1 Score
- The harmonic mean of the precision and recall.
- F1 score reaches its best value at 1 and worst score at 0.
- Formula : (2 x Precision x Recall) / (Precision + Recall)
- ROC
- ROC stands for Receiver Operating Characteristics Curve
- Used to find performance matrix of classification model, under various Threshold parameters
- This curve is plot using two parameters:
- True Positive Rate
- TPR = TP / (TP + FN)
- Intuitively, Percentage of identifying actual true values
- False Positive Rate
- FPR = FP / (FP + TN)
- Intuitively, Percentage of identifying actual false values
- This plot is drawn at different thresholds
- Visualization:
- AUC
- AUC stands for Area Under ROC Curve
- Also written as AUROC (Area Under the Receiver Operating Characteristics)
- AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
- If the value is 1, it means our model is overfitted and if the value is 0 it means it underfitted
- But Higher the AUC value, better the performance
- How AUC and ROC are used together
- Case 1
- When AUC is 1, it means the TP and TN never overlap, overfitting
- No wrong predictions
- Case 2
- When AUC is .7, it means the TP and TN have little overlap
- Probability of .3 wrong predictions
- Case 3
- When AUC is .5, it means the TP and TN have complete overlap
- Probability of .5 wrong predictions
- Case 4
- When AUC is 0, it means the TP and TN have opposite overlap
- Probability of all wrong predictions
- Observations:
- Maximum of AUC value would be our best threshold
Regression
- RSquare
- Goodness of the best fit model
- More the R^2 value, better the model
- Formula : R^2
- = 1 – (SSRes/SSTotal)
- = 1 – ( Ʃ (yi- ŷi) ^2 / Ʃ (yi - ̅yi) ^2 )
- SSRes
- Sum of Residuals : Ʃ (yi- ŷi) ^2
- Residuals means errors, predictions minus actual Y value
- The Sum of Blue Lines are SSRes
- The reason we are squaring is to absolute the negative values
- SSTotal
- Sum of Total means Sum of Average Total : Ʃ (yi - ̅yi) ^2
- Mean Line is in Aqua Blue
- Here instead of slope (best fit) we are calculating the Mean line
- And finding differences from each actual lines to Y Mean
![]() |
| Credits: https://www.datasciencecentral.com/wp-content/uploads/2021/10/2742052271.jpg |
- Eg: Average Model
- 1- 2/4
- 1- 1/2
- 1- .5
- 1-0.5
- 0.5
- Eg: Good Model
- 1-1/4
- 1-.25
- .75
- Eg: Bad Model
- 1-4/4
- 1-1
- 0
- Eg: Very Bad Model
- 1-8/4
- 1-2
- -1
- Adjusted R Square
- When we have new number of features added, we have an increase in R2 value
- So AdjRSquare value is used when there are comparisions between 2 or more reg models with different independent variables
- It helps us to find the new added independent variable is helpful or not in increasing the R2 square value
- Formula :
- 1 - ( (1-R^2) * (N-1) / (N-P-1) )
- N: no of rows or sample size
- P: no of predictors or independent features
- If the independent variables are correlated to the Target variable, we will have small decrease in the AdjR2.
- Else we will have higher decrease in AdjR2
- Points to remember :
- Every time we add a independent variable to a model the RSquare always increase
- Even if there is no significant correlation with Target variable, it will never decline
- Whereas Adjusted R Square increases only when independent variable is significant and affects dependent variable
- Adjusted R Square value would always be less than or equal to R Square value
- Reference : https://www.youtube.com/watch?v=WuuyD3Yr-js
Assumptions
Parametric Test
- Parametric analysis, are done when the population parameter is known or assumed as known
- Performs well even when the data is skewed or non normal
- Mean is used for the measure of tendency
- Interval and ratio are the measurement levels
- Pros :
- Proven to be highly powerful
- Cons :
- Expects population to be known
- Complicated theory
Non Parametric Test
- NonParametric analysis, are done when the population distribution is unknown.
- Median is used for the measure of tendency
- Ordinal and nominal are the measurement levels
- Pros :
- Simple and easy to understand
- No assumptions made
- Cons :
- Applicable for nominal and ordinal scale
- Consider all data groups have same spread
- Refer : https://blog.minitab.com/blog/adventures-in-statistics-2/choosing-between-a-nonparametric-test-and-a-parametric-test
Balance
Overfitting
Overfitting refers to a model that learns the training data too well.
- Reduces Overfitting :
- Use a resampling technique to estimate model accuracy
- Hold back a validation dataset
Underfitting
- Underfitting refers to a model that can neither learn the training data nor generalize to new data.
Model Evaluation Error
- Err (X) = Biaŝ 2 + Variance + Random Error
Random Error
- It means Irreducible Error
- Usually the amount of noise in our data, which is unknown and uncontrolled
Bias error
- Bias is actually the accuracy score from the predicted values minus actual values
- Generally its ROC_AUC in classification and RMSE in regression
- Model with high bias pays very little attention to the training data and oversimplifies the model.
- High bias is underfitting of the model
- High bias in training will also lead to high bias in testing so it would be underfitted model
- Bias are the simplifying assumptions made by a model to make the target function easier to learn.
- Low bias = Less assumptions
- High bias = More assumptions
- Low bias models :
- Decision Tree, KNN, SVM
- High bias models :
- Linear, Logistic
- Usually High Bias and Low Variance are in parametric ML algorithms, as they make assumptions of the population
- Bias is also introduced by us, such as when we select the features, if we have less features or irrelevant features added to model, we add bias
- When the data is not scaled, there in also we introduce bias, as higher value data are prone to make model bias towards its value
Variance errors
- Variance is Variability of model prediction, how good the model predicts in different samples of same dataset
- High variance is overfitting the model
- High variance will win in the training set, but if different set is introduced the model accuracy score will be affected
- Low variance, small change to estimate of the target function with the changes of the training dataset
- High variance, many changes to estimate of the target function with the changes of the training dataset
- Usually High Variance and Low Bias are in non parameteric ML algorithms, as they make no assumptions of the population and have lot of flexibility
- High variance models :
- Decision Tree, KNN, SVM
- Low variance models :
- Linear, Logistic
Bias Tradeoff
- If our model is too simple we may introduce High Bias, which is Underfitting
- If our model is too complicated we may introduce High Variance, which is Overfitting
- Finding the sweet spot is important here, where there is low variance and low bias
- In reality there is no escaping the relationship between bias and variance in machine learning
- Increasing the bias will decrease the variance
- Increasing the variance will decrease the bias
- So we need to find the optimal balance
![]() |
| Fig: Bias Tradeoff |
Problem of Imbalance
- When there is imbalance of class information, models tend to low sensitivity
- In such calls,
- Over sampling of records to bring in the balance of might work
- Ensemble methods might work
- Changing the cut off or threshold of predicted probability might also work
Cross Validation
- Due to imbalance of data, we can try to fix this using cross validation(cv)
- Usually we have data split into Train and Test.
- But using CV we can have number of splits with the train data, as train and test
- As several samples are generated using the same train data set, we can with ease have high chances of detecting over fitting
- Used when :
- When the data is noisy, (high outliers)
- When the model is a greedy algorithm, because the model never takes optimal value
- eg: decision tree, as in step 1 it does not know the further steps
- Generally there are parameters for the model
- eg: knn, the k values are used for model building, best k value would be after analysis such as it could be 3,5,11,..?
- Types:
- K-Fold Cross Validation
- Take the train data, the data then can be split to K equal parts
- Then one of the K part is considered as Test and others as Train
- Drawback: computationally expensive
- Sample Source :
- from sklearn.model_selection import cross_val_score
- print(cross_val_score(model, X, y, cv=5))
- Visualization:

- Leave P-out Cross Validation
- Leave One-out Cross Validation
- Repeated Random Sub-sampling Method
- Holdout Method
References and Credit
https://www.wordstream.com/blog/ws/2017/07/28/machine-learning-applications
Several Medium, Wikipedia and other sources.










No comments:
Post a Comment