Linear Regression: Forward Selection vs Backward Elimination.

Introduction

In this blog we will be exploring two of the method used for significant feature selection in regression models and demonstrate each with a simple example.

Getting Started

For our demonstration we will be using a simple dataset of 50 startups with 50 rows and 5 columns.

Now let us import the libraries required.

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Before we start building our model, let us get our data ready.

df = pd.read_csv('50_Startups.csv')
df.head()

output:

	R&D Spend	Administration	Marketing Spend	State	Profit
0	165349.20	136897.80	471784.10	New York	192261.83
1	162597.70	151377.59	443898.53	California	191792.06
2	153441.51	101145.55	407934.54	Florida	191050.39
3	144372.41	118671.85	383199.62	New York	182901.99
4	142107.34	91391.77	366168.42	Florida	166187.94

From the above table we can infer that the Profit attribute is our target variable. Hence, we will be using all the other independent variables to predict our target attribute.

Also, as the State variable is categorical, we will be One Hot Encoding it to be used in our model.

#Creating Dummy Variables for state
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
oe = OneHotEncoder()

X.State = le.fit_transform(X.State)
X_enc = oe.fit_transform(X[['State']]).toarray()

col = ['California', 'Florida', 'New York']
for i in range(len(X.columns.values)):
    col.append(X.columns.values[i])

data = pd.DataFrame(np.append(arr=X_enc, values=X, axis=1).astype(int), columns=col)

#Avoiding Dummy Variable Trap
X = data.drop(['State','California'],axis=1)
X.head()

output:

	Florida	New York	R&D Spend	Administration	Marketing Spend
0	0	1	165349	136897	471784
1	0	0	162597	151377	443898
2	1	0	153441	101145	407934
3	0	1	144372	118671	383199
4	1	0	142107	91391	366168

Now that our data is ready to be used in our regression model, let us move to the feature selection step.

Backward Elimination

Steps:

1. Select a Significance Level (eg. SL = 0.05) to stay in the model.
2. Fit model with all the possible predictors.
3. Consider predictor with the highest P-value.
4. IF P > SL: Remove Predictor ELSE: EXIT
5. Fit model without the removed predictor.
6.Repeat steps 3 to 5 until EXIT.

Demonstration

Before we get started, we need to create a constant variable to be used in our model.

##creating a constant variable
c = np.ones((50,1))
col.insert(0,'constant')
X = pd.DataFrame(np.append(arr=c, values=X, axis=1).astype(int), columns=col)

Now, let us first build the model with all the possible predictors.

significance_level = 0.05
model = sm.OLS(endog=y, exog=X).fit()
model.summary()

output:

	coef	std err	t	P>\t\	[0.025	0.975]
constant	5.013e+04	6884.855	7.281	0.000	3.63e+04	6.4e+04
Florida	198.7542	3371.026	0.059	0.953	-6595.103	6992.611
New York	-42.0063	3256.058	-0.013	0.990	-6604.161	6520.148
R&D Spend	0.8060	0.046	17.368	0.000	0.712	0.900
Administration	-0.0270	0.052	-0.517	0.608	-0.132	0.078
Marketing Spend	0.0270	0.017	1.574	0.123	-0.008	0.062

Since New York and Florida has the highest P-value we will be dropping those columns and re-fitting the model without them.

#New York and Florida has the highest P-value
X = X.drop(['New York', 'Florida'], axis=1)
model = sm.OLS(endog=y, exog=X).fit()
model.summary()

output:

	coef	std err	t	P>\t\	[0.025	0.975]
constant	5.012e+04	6572.384	7.626	0.000	3.69e+04	6.34e+04
R&D Spend	0.8057	0.045	17.846	0.000	0.715	0.897
Administration	-0.0268	0.051	-0.526	0.602	-0.130	0.076
Marketing Spend	0.0272	0.016	1.655	0.105	-0.006	0.060

Since Administration has the highest P-value we will be dropping thi columns and re-fitting the model without it.

#Administration has the highest P-value
X = X.drop('Administration', axis=1)
model = sm.OLS(endog=y, exog=X).fit()
model.summary()

output:

	coef	std err	t	P>\t\	[0.025	0.975]
constant	4.698e+04	2689.941	17.464	0.000	4.16e+04	5.24e+04
R&D Spend	0.7966	0.041	19.265	0.000	0.713	0.880
Marketing Spend	0.0299	0.016	1.927	0.060	-0.001	0.061

Now, as the P-value for all the remaining predictors are less than our significance level, we can stop the process. Hence, R&D Spend and Marketing Spend are the most significant features for our regression model.

Forward Selection

Steps:

1. Select a Significance Level (eg. SL = 0.05) to enter in the model.
2. Fit all simple regression model y ~ Xn. Select one with the lowest P-value.
3. Keep this predictor and fit all possible models with one extra predictor added to the one(s) you already have.
4. Consider predictor with the lowest P-value.
5. IF P < SL: Repeat step 3 ElSE: EXIT.

Demonstration

First we calculate the P-values for all the possible predictors individually.

##All possible predictors
X1 = X[["constant", "Florida"]]
X2 = X[["constant", "New York"]]
X3 = X[["constant", "R&D Spend"]]
X4 = X[["constant", "Administration"]]
X5 = X[["constant", "Marketing Spend"]]

model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()
model4 = sm.OLS(endog=y, exog=X4).fit()
model5 = sm.OLS(endog=y, exog=X5).fit()

print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())
print("Model 4")
print(model4.summary())
print("Model 5")
print(model5.summary())

From the output of the above code snippet we could see that R&D Spend had the lowest P-value among all the predictors. Hence, we select R&D Spend as a significant feature for our model and repeat the steps to finds all the possible predictors.

##All possible predictors
X1 = X[["constant", "R&D Spend", "Florida"]]
X2 = X[["constant", "R&D Spend", "New York"]]
X3 = X[["constant", "R&D Spend", "Administration"]]
X4 = X[["constant", "R&D Spend", "Marketing Spend"]]

model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()
model4 = sm.OLS(endog=y, exog=X4).fit()

print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())
print("Model 4")
print(model4.summary())

From the output of the above code snippet we could see that Marketing Spend in conjunction with R&D spend had the lowest P-value among all the predictors. Hence, we select Marketing Spend and R&D Spend as significant features for our model and repeat the steps to finds all the possible predictors.

##All possible predictors
X1 = X[["constant", "R&D Spend", "Marketing Spend", "Florida"]]
X2 = X[["constant", "R&D Spend", "Marketing Spend", "New York"]]
X3 = X[["constant", "R&D Spend", "Marketing Spend", "Administration"]]

model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()

print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())

As from the output we can see that all the P-values are greater than our significance level, we stop the process and choose the predictors from the previous step (ie. R&D Spend and Marketing Spend) as our significant features.

Conclusion

Using both the methods Backward Elimination and Forward Selection we came to the conclusion that R&D Spend and Marketing Spend are the most significant features while predicting the Profit using our regression model.