Feature Engineering and the Creation of Dummy Variables

In Logistic Regression, classification or categorical variables are present in the data. An example of a classification variable are the different boroughs in New York City. When a 311 Service Request is made, it is made in either the Bronx, Brooklyn, Manhattan, Queens, Staten Island or unspecified. This is different from a continuous variable, which is numerical. An example of this is predicting the sales price of your car based on make, model, year, etc. The sales price is numerical and continuous.

In regression analysis, we want to see how our independent (X) variables affect our dependent variable (y). In order to do this with classification/categorical variables we create Dummy Variables. A Dummy Variable is an artificial variable created to represent an attribute with two or more distinct categories or levels. In our example above, borough, is a feature (column) with six different categories.

Creating dummy variables for this borough feature will create six different features (columns) with two different categories: 0 for false or 1 for true.

Now why do we do this? It is because in Regression analysis all independent (X) variables in the analysis as treated as numerical. Our regression model knows that the 311 service request in Line 2 was made in Queens because there is a 1 in the b_Queens feature. At this point the Borough feature is redundant and can be deleted.

How do we do this in python?

import pandas as pd

dv = pd.get_dummies(aug['borough'], prefix='b')
aug = pd.concat([aug, dv], axis=1)

Line 3 specifies which feature we are creating dummy variables for, in this case it is for borough. Each new feature created will prefixed with b, so Bronx becomes b_Bronx when it is created.

This has to be done for every categorical feature we want to consider. Creating dummy variables for the borough feature only added six more features (columns) to the dataframe.

If we look at the agency feature, we find that there are 14 different values. Such values include: NYPD, DOT, DEP, HPD, DPR, DOB, DOHMH, and others.

dv = pd.get_dummies(aug['agency'], prefix='agency'
aug = pd.concat([aug, dv], axis=1)

This will create features agency_NYPD, agency_DOT, agency_DOP and so on for a total of 14 new features.

When we take a look at the feature complaint_type, we see that there are 151 different types including: Noise-Residential, Illegal Parking, Block Driveway, Rodent, etc.

dv = pd.get_dummies(aug['complaint_type'], prefix='complaint'
aug = pd.concat([aug, dv], axis=1

We have to be careful with other features that may have many different values. As the number of features increases with each new value we turn into a dummy variable, so does the amount of time and computational power needed to make our prediction.

Share this:

Related

Leave a comment Cancel reply