How to find the importance of features in predicting defaults?

0

Understanding the importance of features is very important for a data science practitioner so that he can select the important features for training the machine learning model. Feature importance analysis is needed for some predictive analytics work such as credit or loan default forecasting. In addition to minimizing default risk with prediction, it is also necessary to find the important characteristics when a customer defaults. Thus, in this article, we will perform a feature importance analysis on popular default prediction problems using a Random Forest classifier. Below are the main points that we are going to discuss in this article.

Contents

  1. Importance of Features
  2. Predictive modeling and feature analysis
  3. Training a random forest classifier
  4. Analyze the importance of features

Let’s start by understanding the importance of feature importance.

Importance of features

Feature importance becomes an essential part of the machine learning pipeline, feature importance generates the list of features with corresponding importance scores. So once we get the score, we can select the important feature.

As mentioned in the title, we will be using a Loan Default dataset. A defaulting lender’s data set usually has so many features, so which feature should a data scientist focus on more is a big question. Because it is not possible to closely examine each feature one by one, there should be a way that gives us a set of essential features, and then the importance of the feature comes into play. With the help of tree-based algorithms such as random forest, we can have a list of important features.

Predictive modeling and feature analysis

Random Forest has built-in feature importance. The random forest uses its Gini impurity criterion to select the important feature. The characteristic that helps the model to decrease impurity becomes an important characteristic, which implies that if a characteristic contributes more to reduce impurity, it becomes more important.

We will use the Loan Default dataset to implement the feature significance, loan default is the most widely solved problem in machine learning. It contains customer information which could be a bank or a loan company or a vehicle company. Our dataset consists of bank data, it contains all the information about customers who have had loans granted in the past and a target variable that indicates whether a particular customer is in default or not. 0 means not failing, 1 means failing.

Training a random forest classifier

(NOTE: – Only a few pre-processing steps are not mentioned in this article, but you can check them all in ColabNotebook, referenced below.)

In this section, we will first perform classification on Kaggle’s Loan Default dataset and then generate feature importance maps.

First, import essential libraries like for math computation we imported NumPy, for plotting graphs we imported seaborn and matplotlib. From sklearn we have imported modules for preprocessing.

import pandas as pd 
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

Read the CSV file and save it to the pandas database.

data_raw = pd.read_csv("/content/train.csv")
data_raw.head(7)

Some categorical features given in the dataset need to be encoded. For ‘years_in_current_job’ we just remove the ‘year’ part and leave only the years. But for other functionality, we encode the values ​​in tags.


data_raw["years_in_current_job"] = data_raw["years_in_current_job"].replace({'-1': -1, '10+ years': 10, '8 years': 8, '6 years': 6, 
                                                                              '7 years': 7, '5 years': 5, '1 year': 1, '< 1 year': 0, 
                                                                              '4 years': 4, '3 years': 3, '2 years': 2, '9 years': 9})

data_raw["purpose"] = le.fit_transform(data_raw.purpose.values)
data_raw["home_ownership"] = le.fit_transform(data_raw.home_ownership.values)
data_raw["term"] = le.fit_transform(data_raw.term.values)

The next step is to fill in the nan values. For ‘months_since_last_delinquent’, we fill nan values ​​with -1 because it contains almost 50% missing values, that’s why filling it with the average value makes no sense. Other characteristics such as ‘annual_income’, ‘credit_score’ are filled by the average.

data_raw['months_since_last_delinquent'] = data_raw['months_since_last_delinquent'].fillna(-1)
data_raw['annual_income'].fillna(int(data_raw['annual_income'].mean()), inplace=True)
data_raw['credit_score'].fillna(int(data_raw['credit_score'].mean()), inplace=True)
data_raw['years_in_current_job'].fillna(int(data_raw['years_in_current_job'].mean()), inplace=True)
data_raw['bankruptcies'].fillna(int(data_raw['bankruptcies'].mean()), inplace=True)

Get all the features in the X variable and target column in the there variable.

X = data_raw[['home_ownership', 'annual_income', 'years_in_current_job', 'tax_liens',
              'number_of_open_accounts', 'years_of_credit_history',
              'maximum_open_credit', 'number_of_credit_problems',
              'months_since_last_delinquent', 'bankruptcies', 'purpose', 'term',
              'current_loan_amount', 'current_credit_balance', 'monthly_debt',
              'credit_score']]


y = data_raw[['credit_default']]

We will now convert the features to standard form.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scale = scaler.fit_transform(X)

Divide the data set into 80:20 ratios. 80 for training and 20 for testing the dataset

#split the data 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X_scale, y, test_size=0.2, random_state = 4)

Next, we’ll import and initialize the random forest classifier and fit the model to the training data.

from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, random_state=0)
rfc.fit(X_train, y_train)

Analyze the importance of features

Now that we have trained the random forest classifier, let’s proceed to the powerlessness analysis of the feature. First we will convert the feature importance in pandas series and then we will print the feature importance score

feature_imp = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False)
feature_imp

This output shows that credit_socre and current_loan_amount are the most important characteristics for establishing a classification of credit defaults.

We can visualize these features using seaborn.

f, ax = plt.subplots(figsize=(15, 7))
ax = sns.barplot(x=feature_imp, y=feature_imp.index)
ax.set_title("Visualize feature scores of the features")
ax.set_yticklabels(feature_imp.index)
ax.set_xlabel("Feature importance score")
ax.set_ylabel("Features")
plt.show()

Now that looks cool, the X axis shows the feature importance score and the Y axis shows the feature name. Credit_score has the longest bar indicating that it is the most important feature in the event of a fault.

Last words

In this article, we saw how feature importance is key to the machine learning pipeline and understood how random forests determine feature importance. Finally, we analyzed the feature importance by implementing the random forest on the loan default dataset and found the features important in case of default.

Reference

Share.

About Author

Comments are closed.