Skip to main content

Module 5 - Random Forest Introduction

What is Random Forest?

The Random Forest statistical technique uses an ensemble method to predict the outcome. There are two types of random forest methods:

  1. Random forest regression: For predicting continuous outcome variables. (Salary, Income, Enrollments, etc.)
  2. Random forest classifier: For predicting a class label (Yes or No, Active Vs. Lapsed, etc.)

In this module, let us first understand some terms we will use later in upcoming modules.

Ensemble learning: The term ensemble means combining multiple model outcomes to predict the final outcome in machine learning. Combining multiple models is a highly efficient method than predicting the outcome using a single model. Some popular ensemble methods include Stacking, bagging, blending etc.

Stacking: Stacking is a machine learning process that uses training data, runs multiple models, and generalizes predictions to get the final output. The generalization of the predictions can be of two types.

Averaging: In this method, the output from the predictive models is averaged out to get the final prediction.


Meta-model: In this method, the output from the predictive models is used as input features for training a Meta-model.


Bagging: Bagging, also known as bootstrapping, is an ensemble method in which the train data is run on multiple versions of predictive models and combines the final prediction using averaging process.


Blending: Blending is an ensemble machine learning technique that learns how to best combine the predictions that we derive from multiple models. Blending and stacking are usually interchangeable. The only difference is that in stacking, we use two or more base models and a meta-model that either uses the average of all the predictions derived from the model or uses the predictions as input for the meta-model (see stacking above). But in blending, the meta-model is usually a linear regression model (for continuous outcome) or logistic regression (for categorical outcome). Blending uses the weighted sum of the predictions and hence the term blending.

Regression tree: A random forest regression/classification tree is an ensemble machine learning algorithm. It starts with running the regression/classification tree algorithm on several subsets of the training dataset. This addresses the overfitting problem. The algorithm uses the bootstrapping method to pick sample datasets (by replacement) from the original train dataset and runs the data thru the decision trees to get the predictions. The final predictions are either the majority predictions or the average of all the predictions made by decision trees.


Features & Feature Importance: In machine learning, the variables are addressed as features. Let’s have a look at the dataset below. Heart disease is our dependent variable and all other variables in the dataset like smoking, alcohol drinking, physical activity etc. are our independent variables. These variables are addressed as features in the machine learning world.



Are all the independent variables equally important in predicting heart disease?

No! Some variables have high importance in predicting heart disease when compared to others. For example, race/physical activity/diabetic status are high predictors when compared to the profession of the patient.

Any algorithm you use for predicting a dependent variable will assign the importance to the independent features to predict the outcome based on that. The process of assigning the importance to the features is done by adding weights (a number) to each feature. So for example, race/physical activity will have higher weights when compared to the profession of the patient will be assigned a lower weight.

Loss Function In machine learning the loss function is defined as the difference or the distance between the predicted outcome and actual outcome.


The term overfitting is used in machine learning when the predictive model works great on the training dataset and does not show the desirable results on the new dataset. In simple terms, the model is too closely aligned to the training data set points and cannot be generalized on other datasets. Even though it shows 99% accuracy on a training dataset, when you run a new data set the accuracy is around 50%. That’s how you can say that the model is overfitting.

What causes overfitting?

  • Training dataset is too small.
  • Variance is too high
  • Predictive model is too complex
  • Training dataset might contain noise (meaningless values)

How to address overfitting?

There are several ways to address overfitting. In this module we will be discussing L1 & L2 regularization.

  • L1 Regularization (Lasso regression): This is a regularization technique which adds an absolute value of coefficient as the penalty. It reduces the weights of the features and adjusts it closer to 0. L1 regularization is very helpful when we have more features. The features with zero coefficients or closer to 0 coefficients are dropped.

  • L2 Regularization (Ridge Regression): This is a regularization technique which adds a squared value of coefficient as the penalty. It reduces the weights of the features and adjusts it closer to 0. L2 regularization is helpful in case of collinear predictors.