Linear Regression: 
Forward Selection vs Backward Elimination.

Linear Regression: Forward Selection vs Backward Elimination.

Introduction

In this blog we will be exploring two of the method used for significant feature selection in regression models and demonstrate each with a simple example.

Getting Started

For our demonstration we will be using a simple dataset of 50 startups with 50 rows and 5 columns.

Now let us import the libraries required.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Before we start building our model, let us get our data ready.

df = pd.read_csv('50_Startups.csv')
df.head()

output:

R&D SpendAdministrationMarketing SpendStateProfit
0165349.20136897.80471784.10New York192261.83
1162597.70151377.59443898.53California191792.06
2153441.51101145.55407934.54Florida191050.39
3144372.41118671.85383199.62New York182901.99
4142107.3491391.77366168.42Florida166187.94

From the above table we can infer that the Profit attribute is our target variable. Hence, we will be using all the other independent variables to predict our target attribute.

Also, as the State variable is categorical, we will be One Hot Encoding it to be used in our model.

#Creating Dummy Variables for state
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
oe = OneHotEncoder()

X.State = le.fit_transform(X.State)
X_enc = oe.fit_transform(X[['State']]).toarray()

col = ['California', 'Florida', 'New York']
for i in range(len(X.columns.values)):
    col.append(X.columns.values[i])

data = pd.DataFrame(np.append(arr=X_enc, values=X, axis=1).astype(int), columns=col)

#Avoiding Dummy Variable Trap
X = data.drop(['State','California'],axis=1)
X.head()

output:

FloridaNew YorkR&D SpendAdministrationMarketing Spend
001165349136897471784
100162597151377443898
210153441101145407934
301144372118671383199
41014210791391366168

Now that our data is ready to be used in our regression model, let us move to the feature selection step.

Backward Elimination

Steps:

  • 1. Select a Significance Level (eg. SL = 0.05) to stay in the model.
  • 2. Fit model with all the possible predictors.
  • 3. Consider predictor with the highest P-value.
  • 4. IF P > SL: Remove Predictor ELSE: EXIT
  • 5. Fit model without the removed predictor.
  • 6.Repeat steps 3 to 5 until EXIT.

Demonstration

Before we get started, we need to create a constant variable to be used in our model.

##creating a constant variable
c = np.ones((50,1))
col.insert(0,'constant')
X = pd.DataFrame(np.append(arr=c, values=X, axis=1).astype(int), columns=col)

Now, let us first build the model with all the possible predictors.

significance_level = 0.05
model = sm.OLS(endog=y, exog=X).fit()
model.summary()

output:

coefstd errtP>\t\[0.0250.975]
constant5.013e+046884.8557.2810.0003.63e+046.4e+04
Florida198.75423371.0260.0590.953-6595.1036992.611
New York-42.00633256.058-0.0130.990-6604.1616520.148
R&D Spend0.80600.04617.3680.0000.7120.900
Administration-0.02700.052-0.5170.608-0.1320.078
Marketing Spend0.02700.0171.5740.123-0.0080.062

Since New York and Florida has the highest P-value we will be dropping those columns and re-fitting the model without them.

#New York and Florida has the highest P-value
X = X.drop(['New York', 'Florida'], axis=1)
model = sm.OLS(endog=y, exog=X).fit()
model.summary()

output:

coefstd errtP>\t\[0.0250.975]
constant5.012e+046572.3847.6260.0003.69e+046.34e+04
R&D Spend0.80570.04517.8460.0000.7150.897
Administration-0.02680.051-0.5260.602-0.1300.076
Marketing Spend0.02720.0161.6550.105-0.0060.060

Since Administration has the highest P-value we will be dropping thi columns and re-fitting the model without it.

#Administration has the highest P-value
X = X.drop('Administration', axis=1)
model = sm.OLS(endog=y, exog=X).fit()
model.summary()

output:

coefstd errtP>\t\[0.0250.975]
constant4.698e+042689.94117.4640.0004.16e+045.24e+04
R&D Spend0.79660.04119.2650.0000.7130.880
Marketing Spend0.02990.0161.9270.060-0.0010.061

Now, as the P-value for all the remaining predictors are less than our significance level, we can stop the process. Hence, R&D Spend and Marketing Spend are the most significant features for our regression model.

Forward Selection

Steps:

  • 1. Select a Significance Level (eg. SL = 0.05) to enter in the model.
  • 2. Fit all simple regression model y ~ Xn. Select one with the lowest P-value.
  • 3. Keep this predictor and fit all possible models with one extra predictor added to the one(s) you already have.
  • 4. Consider predictor with the lowest P-value.
  • 5. IF P < SL: Repeat step 3 ElSE: EXIT.

Demonstration

First we calculate the P-values for all the possible predictors individually.

##All possible predictors
X1 = X[["constant", "Florida"]]
X2 = X[["constant", "New York"]]
X3 = X[["constant", "R&D Spend"]]
X4 = X[["constant", "Administration"]]
X5 = X[["constant", "Marketing Spend"]]

model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()
model4 = sm.OLS(endog=y, exog=X4).fit()
model5 = sm.OLS(endog=y, exog=X5).fit()

print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())
print("Model 4")
print(model4.summary())
print("Model 5")
print(model5.summary())

From the output of the above code snippet we could see that R&D Spend had the lowest P-value among all the predictors. Hence, we select R&D Spend as a significant feature for our model and repeat the steps to finds all the possible predictors.

##All possible predictors
X1 = X[["constant", "R&D Spend", "Florida"]]
X2 = X[["constant", "R&D Spend", "New York"]]
X3 = X[["constant", "R&D Spend", "Administration"]]
X4 = X[["constant", "R&D Spend", "Marketing Spend"]]

model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()
model4 = sm.OLS(endog=y, exog=X4).fit()

print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())
print("Model 4")
print(model4.summary())

From the output of the above code snippet we could see that Marketing Spend in conjunction with R&D spend had the lowest P-value among all the predictors. Hence, we select Marketing Spend and R&D Spend as significant features for our model and repeat the steps to finds all the possible predictors.

##All possible predictors
X1 = X[["constant", "R&D Spend", "Marketing Spend", "Florida"]]
X2 = X[["constant", "R&D Spend", "Marketing Spend", "New York"]]
X3 = X[["constant", "R&D Spend", "Marketing Spend", "Administration"]]

model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()

print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())

As from the output we can see that all the P-values are greater than our significance level, we stop the process and choose the predictors from the previous step (ie. R&D Spend and Marketing Spend) as our significant features.

Conclusion

Using both the methods Backward Elimination and Forward Selection we came to the conclusion that R&D Spend and Marketing Spend are the most significant features while predicting the Profit using our regression model.