A Project on predicting Loan Defaulters. This will be the same approach for Credit Card Payment defaulters as well.

Data Source:

Since the Datasets used are large, please refer to them on my Kaggle account. Click Here for Datasets

R Code:

There is only one R Code file and its in Markdown Format. Click Here for RCodes folder

Once we load the Dataset, we divide them into Categorical and Numerical to find out % of missing values.

Then We engage in Data Enhancement using existing columns with relationships. We also engage in some other Data cleaning and finally use Median Impute for imputing Numerical Values. PS: Imputation only works on Blank cells and not on ‘0’s

Then we used dummyVars for Dummification of all categorical columns; cleaned column names and Balanced Train/Test datasets using ROSE Sampling method.

We then built 4 Models:

  • RF (Accuracy: 90%+)
  • gbm (Accuracy: 70%)
  • Naive Bayes
  • rpart

Lastly we predicted on the test_file using gbm method, but you can try with RF by subsetting the Data with Top 50 features.

Conclusion & Learnings

Using RF model is the best way to predict as you can get above 90% accuracy. Refer to the Presentation deck for more technical details on steps taken, process and findings

Apart from honing my skills in R, this also helped me develop my soft skills such as critical thinking, team management, Project timeline management and gave me an opportunity to interact with people from various cultural and professional backgrounds and build synergy to achieve the goal of completing this Project 

Leave a Reply

Your email address will not be published. Required fields are marked *

error: Content is protected !!