Salary Classification Using Census Data
Predict whether an individual earns less than $50k or more using US Census data
Problem Statement - An organization wants to identify people in an available census dataset who earn more than $50k, so that it can target them via emails to solicit funds for a charity.
Note - This project was done as a part of Udacity ML Nanodegree Program.
Exact details of the dataset and the analysis code can be found in this Jupyter Notebook. The aim of this article is to provide a high level overview of the steps taken and the reasons behind them. This article complements the information present in the notebook, so you can quickly read the article for main ideas and thoroughly go through the notebook for in-depth treatment of topic.
The project can be divided into 4 major steps -
- Data Exploration
- Data Preprocessing
- Trying different types of classification algorithms
- Evaluating selected models and choosing the best
- Model Tuning (Hyperparameter Optimization)
- Looking at Feature Importance to improve model explanability
- Feature Selection and evaluating performance
The above steps reflect the typical workflow that is being used in a short ML project. Let’s go through each of them briefly in this article. Detailed description can be found in the Jupyter Notebook and/or hyperlinks in the article.
Step 1 - Data Exploration
Aim - Familiarize yourself with the dataset i.e. variables available (their types and distributions), missing values (types and reasons for their presence), identify target & feature variables and check for class imbalance
Process - Use pandas describe()
, isna()
methods, make histograms to check the distribution of numerical variables, value_counts()
method to check distribution of categorical variables
Step 2 - Data Preprocessing
Aim - Many ML algorithms expect data in some format to work. Major preprocessing steps involve(not in order) -
- Dealing with missing data (removing or imputing)
- Outlier Treatment (removing or transforming data)
- Normalizing numerical data
- Encoding of categorical variables (Onehot , Label)
- Split data into test and train
Curious about what should be the order? Read this amazing post.
Process - Pandas and Sklearn provide methods / classes for all of the above (check Jupyter Notebook for implementation)
Step 3 - Trying different types of classification algorithms
Aim - Since sklearn has made our life so simple by providing us with implementations of all the major ML algorithms, we just have to write a few lines of code to try them out for our preprocessed data separated into features and targets.
Process - Write a function to pass the preprocessed data through different algorithms (for example - Logistic Regression, Naive Bayes Classifier, SVM, AdaBoost , Random Forest etc.)
Step 4 - Evaluating models
Aim - Define an evaluation metric and compare the models selected during step 3 using the same metric. The choice of evaluation metric depends on business objective and class imbalance of the dataset. For example - if the problem statement is classifying email as spam, it is important that there are as less false positives as possible because we don’t want a non-spam important email to end up in the spam folder. Therefore, we need to prioritize improving ‘precision’ more than ‘recall’ (as a spam email landing in the normal folder is less harmful than the non spam email landing in the spam folder).
Read this article to get clarity on precision and recall metrics.
Step 5 - Hyperparameter Optimization
Aim - After decided the best performing algorithm, it’s time to tune hyperparameters of the associated model to get the optimal model.
Process - Use sklearn’s GridSearchCV
to define values for each hyperparameter and search for the best model over all value combinations. RandomizedSearchCV
can also be used to reduce computations.
Setp 6 & 7 - Feature Importance and Feature Selection
Aim - Many times we have our model trained on hundreds of features. It becomes hard to explain the outputs of the model and make simple rules to broadly define characteristics of each of the segment in our model. For example - business people would like simple definitions for people earning < $50k like unmarried people <35 years of age , natives of south east asian countries who have a college degree having occupations like craft repairs , handlers-cleaners etc.
Therefore, we need to see which of the features in our feature set are most important for the classification so that we can take only those and drop others. The accuracy or any other evaluation metric most probably will take a hit but often we can find a balance.
In sklearn’s implementation of models like Random Forest, AdaBoost etc. we can directly find feature importances by calling the feature_importance_
attribute of the trained model.