Ensemble Technique — Wisdom of the crowd!!!

Ensemble technique is Machine Learning method that combines several base model estimators i.e. machine algorithms in order to produce one optimal Predictive model. Both Regression and Classification can be done using Ensemble learning . This technique believes that Judgement of experts of committee/jury is more accurate than individual expert decision. Point to be noted: Jury trials were dis-continued in India in 1973. Akshay Kumar’s movie Rustom has Jury trial presented very well.

Base estimators are the real ground workers who actually sets the skeleton of Ensemble execution. Decision Trees is the default choice of sklearn.ensemble library but could also be tested with other machine models like KNN, Logicstic Regression , SVM, etc depending on dataset & requirement. Most ensemble methods uses a single base learning algorithm to produce homogeneous base learners, i.e. learners of the same type, leading to homogeneous ensembles. However there are also some methods like Voting Ensemble/Stacking that supports the use of heterogeneous learners.

Ex. of Hetrogenous & Homogenous ensemble methods

Voting is used for classification and averaging is used for regression to decide on final outcome:

  1. Majority Voting (Hard voting). Every model makes a prediction (votes) for each test instance and the final output prediction. A hard voting classifier just counts the votes of each classifier in the ensemble and picks the class that gets the most votes.
  2. Weighted Voting. Unlike majority voting, where each model has the same rights, importance can be assigned to one or more models. In weighted voting count the prediction of the better models multiple times. Finding a reasonable set of weights is very customized approach.
  3. Simple Averaging. For every instance of test dataset, the average predictions are calculated. This method often reduces overfit and creates a smoother regression model.
  4. Weighted Averaging. Slightly modified version of simple averaging, where the prediction of each model is multiplied by the weight and then their average is calculated.

Ensemble technique is broadly divided in 2 categories : Bagging (parallel beauty)& Boosting (sequential glory).

BAGGING aka Bootstrap Aggregation aka Parallel Ensemble

Bootstrap aggregations are combinations of multiple diverse complex models trained on a different set of datasets and almost always outperform the best model in the ensemble.The basic motivation of parallel methods is to exploit independence between the base learners since the error can possibly be reduced dramatically by averaging. So no multicollinearity (among the estimators) rule which is generally applicable on Independent Features is also very significant in base models here. Idea of bagging comes from the fact that averaging models reduces model variance. Since trees are notoriously noisy, they benefit greatly from the averaging.

  1. Sample sets are created by random functions will never be same as original set/other sample sets. Sampling with replacement is approx.63.2%
  2. Samples may have duplicate/triplicate records with-in. Variance error will average out as no sample is the same.
  3. Boosting Classifiers generally benefit from complex individual models executed in parallel
  4. Combines (averaging or voting) learners together to “smooth out” predictions.

Random Forest is also one of the Bagging flavor with a “Feature selection” tweak. Each tree gets the full set of features but at each node, only a random subset of features is considered. No pruning/regularization needed here . Random forests have two tuning parameters: the number of predictors considered at each split and the number of trees (number of bootstrapped samples). No feature scaling (standardization and normalization) required in case of Random Forest as it uses rule based approach instead of distance calculation.

BOOSTING aka Ada Boosting & Gradient Boosting aka Sequential Ensemble

AdaBoosting (Adaptive Boosting) , the successive learners are created with a focus on the ill fitted data of the previous learner. Each successive learner focuses more and more harder to fit data i.e. their residuals in the previous estimator.

Gradient Boosting Each learner is fit on a modified version of original data (original data is replaced with the x values), and residuals from previous learner. By fitting new models to the residuals, the overall learner gradually improves in areas where residuals are initially high.However, instead of tweaking the instance weights at every iteration like AdaBoost does, this method tries to fit the new predictor to the residual errors made by the previous predictor because errors of early predictions indicate the “hard” examples. The three principal hyperparameters to tune in gradient boosting are the number of the trees, the learning rate and the depth of trees — all three affect the model performance, The depth of the tree also affects the speed of training and prediction: the shorter, the faster. Popular flavors of Gradient Boosting on Kaggel are LightGbm, XGBoost, CatBoost

Base estimators are built sequentially.

  1. Trains a large number of “weak” learners in sequence.
  2. Miss-classified data weights are increased for training the next model. So training has to be done in sequence.
  3. Boosting then combines all the weak learners into a single strong learner.
  4. Don’t regularize / prune Decision Trees in Ensemble arena

Enough of lecture. Lets visualize famous problem of Machine Learning world:

The classification goal is to predict if the client will subscribe (yes/no) a term deposit

Detailed problem statement & step by step execution is deployed@sonalsharmagithub Few synopsis of final derived results.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store