Introduction
In this blog we will be exploring two of the method used for significant feature selection in regression models and demonstrate each with a simple example.
Getting Started
For our demonstration we will be using a simple dataset of 50 startups with 50 rows and 5 columns.
Now let us import the libraries required.
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
Before we start building our model, let us get our data ready.
df = pd.read_csv('50_Startups.csv')
df.head()
output:
R&D Spend | Administration | Marketing Spend | State | Profit | |
0 | 165349.20 | 136897.80 | 471784.10 | New York | 192261.83 |
1 | 162597.70 | 151377.59 | 443898.53 | California | 191792.06 |
2 | 153441.51 | 101145.55 | 407934.54 | Florida | 191050.39 |
3 | 144372.41 | 118671.85 | 383199.62 | New York | 182901.99 |
4 | 142107.34 | 91391.77 | 366168.42 | Florida | 166187.94 |
From the above table we can infer that the Profit
attribute is our target variable. Hence, we will be using all the other independent variables to predict our target attribute.
Also, as the State
variable is categorical, we will be One Hot Encoding it to be used in our model.
#Creating Dummy Variables for state
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
oe = OneHotEncoder()
X.State = le.fit_transform(X.State)
X_enc = oe.fit_transform(X[['State']]).toarray()
col = ['California', 'Florida', 'New York']
for i in range(len(X.columns.values)):
col.append(X.columns.values[i])
data = pd.DataFrame(np.append(arr=X_enc, values=X, axis=1).astype(int), columns=col)
#Avoiding Dummy Variable Trap
X = data.drop(['State','California'],axis=1)
X.head()
output:
Florida | New York | R&D Spend | Administration | Marketing Spend | |
0 | 0 | 1 | 165349 | 136897 | 471784 |
1 | 0 | 0 | 162597 | 151377 | 443898 |
2 | 1 | 0 | 153441 | 101145 | 407934 |
3 | 0 | 1 | 144372 | 118671 | 383199 |
4 | 1 | 0 | 142107 | 91391 | 366168 |
Now that our data is ready to be used in our regression model, let us move to the feature selection step.
Backward Elimination
Steps:
- 1. Select a Significance Level (eg. SL = 0.05) to stay in the model.
- 2. Fit model with all the possible predictors.
- 3. Consider predictor with the highest P-value.
- 4. IF P > SL: Remove Predictor ELSE: EXIT
- 5. Fit model without the removed predictor.
- 6.Repeat steps 3 to 5 until EXIT.
Demonstration
Before we get started, we need to create a constant variable to be used in our model.
##creating a constant variable
c = np.ones((50,1))
col.insert(0,'constant')
X = pd.DataFrame(np.append(arr=c, values=X, axis=1).astype(int), columns=col)
Now, let us first build the model with all the possible predictors.
significance_level = 0.05
model = sm.OLS(endog=y, exog=X).fit()
model.summary()
output:
coef | std err | t | P>\t\ | [0.025 | 0.975] | |
constant | 5.013e+04 | 6884.855 | 7.281 | 0.000 | 3.63e+04 | 6.4e+04 |
Florida | 198.7542 | 3371.026 | 0.059 | 0.953 | -6595.103 | 6992.611 |
New York | -42.0063 | 3256.058 | -0.013 | 0.990 | -6604.161 | 6520.148 |
R&D Spend | 0.8060 | 0.046 | 17.368 | 0.000 | 0.712 | 0.900 |
Administration | -0.0270 | 0.052 | -0.517 | 0.608 | -0.132 | 0.078 |
Marketing Spend | 0.0270 | 0.017 | 1.574 | 0.123 | -0.008 | 0.062 |
Since New York and Florida has the highest P-value we will be dropping those columns and re-fitting the model without them.
#New York and Florida has the highest P-value
X = X.drop(['New York', 'Florida'], axis=1)
model = sm.OLS(endog=y, exog=X).fit()
model.summary()
output:
coef | std err | t | P>\t\ | [0.025 | 0.975] | |
constant | 5.012e+04 | 6572.384 | 7.626 | 0.000 | 3.69e+04 | 6.34e+04 |
R&D Spend | 0.8057 | 0.045 | 17.846 | 0.000 | 0.715 | 0.897 |
Administration | -0.0268 | 0.051 | -0.526 | 0.602 | -0.130 | 0.076 |
Marketing Spend | 0.0272 | 0.016 | 1.655 | 0.105 | -0.006 | 0.060 |
Since Administration has the highest P-value we will be dropping thi columns and re-fitting the model without it.
#Administration has the highest P-value
X = X.drop('Administration', axis=1)
model = sm.OLS(endog=y, exog=X).fit()
model.summary()
output:
coef | std err | t | P>\t\ | [0.025 | 0.975] | |
constant | 4.698e+04 | 2689.941 | 17.464 | 0.000 | 4.16e+04 | 5.24e+04 |
R&D Spend | 0.7966 | 0.041 | 19.265 | 0.000 | 0.713 | 0.880 |
Marketing Spend | 0.0299 | 0.016 | 1.927 | 0.060 | -0.001 | 0.061 |
Now, as the P-value for all the remaining predictors are less than our significance level, we can stop the process. Hence, R&D Spend and Marketing Spend are the most significant features for our regression model.
Forward Selection
Steps:
- 1. Select a Significance Level (eg. SL = 0.05) to enter in the model.
- 2. Fit all simple regression model y ~ Xn. Select one with the lowest P-value.
- 3. Keep this predictor and fit all possible models with one extra predictor added to the one(s) you already have.
- 4. Consider predictor with the lowest P-value.
- 5. IF P < SL: Repeat step 3 ElSE: EXIT.
Demonstration
First we calculate the P-values for all the possible predictors individually.
##All possible predictors
X1 = X[["constant", "Florida"]]
X2 = X[["constant", "New York"]]
X3 = X[["constant", "R&D Spend"]]
X4 = X[["constant", "Administration"]]
X5 = X[["constant", "Marketing Spend"]]
model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()
model4 = sm.OLS(endog=y, exog=X4).fit()
model5 = sm.OLS(endog=y, exog=X5).fit()
print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())
print("Model 4")
print(model4.summary())
print("Model 5")
print(model5.summary())
From the output of the above code snippet we could see that R&D Spend had the lowest P-value among all the predictors. Hence, we select R&D Spend as a significant feature for our model and repeat the steps to finds all the possible predictors.
##All possible predictors
X1 = X[["constant", "R&D Spend", "Florida"]]
X2 = X[["constant", "R&D Spend", "New York"]]
X3 = X[["constant", "R&D Spend", "Administration"]]
X4 = X[["constant", "R&D Spend", "Marketing Spend"]]
model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()
model4 = sm.OLS(endog=y, exog=X4).fit()
print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())
print("Model 4")
print(model4.summary())
From the output of the above code snippet we could see that Marketing Spend in conjunction with R&D spend had the lowest P-value among all the predictors. Hence, we select Marketing Spend and R&D Spend as significant features for our model and repeat the steps to finds all the possible predictors.
##All possible predictors
X1 = X[["constant", "R&D Spend", "Marketing Spend", "Florida"]]
X2 = X[["constant", "R&D Spend", "Marketing Spend", "New York"]]
X3 = X[["constant", "R&D Spend", "Marketing Spend", "Administration"]]
model1 = sm.OLS(endog=y, exog=X1).fit()
model2 = sm.OLS(endog=y, exog=X2).fit()
model3 = sm.OLS(endog=y, exog=X3).fit()
print("Model 1")
print(model1.summary())
print("Model 2")
print(model2.summary())
print("Model 3")
print(model3.summary())
As from the output we can see that all the P-values are greater than our significance level, we stop the process and choose the predictors from the previous step (ie. R&D Spend and Marketing Spend) as our significant features.
Conclusion
Using both the methods Backward Elimination and Forward Selection we came to the conclusion that R&D Spend and Marketing Spend are the most significant features while predicting the Profit using our regression model.