Default of Loan Payment


Background:-

A person’s creditworthiness is often associated (conversely) with the likelihood they may default on loans.

We’re giving you anonymized data on about 1000 loan applications, along with a certain set of attributes about the applicant itself, and whether they were considered high risk.

0 = Low credit risk i.e high chance of paying back the loan amount

1 = High credit risk i.e low chance of paying back the loan amount


Understanding the data:-

Applicant:- https://drive.google.com/file/d/1_QE3Bkd6Z0NPjC_IUXBkRhhTCEUIRsUQ/view?usp=share_link

We have 14 independent variables in the application data set , where applicant_id is unique in the applicant datasets.




Loan:-https://drive.google.com/file/d/1-mm46_3cl60W4AeIpj7YLcFU-Phr5Mah/view?usp=share_link

We have 10 independent variables in the application data set , where loan_application_id and applicant_id is unique and  1 target variable , i.e  high_risk_applicant in the loan datasets.




 


Now, let's visualize each variable separately. Different types of variables are Categorical, ordinal, and numerical in both datasets combined ( i.e applicant and loan )

·       Categorical features: These features have categories (Gender , Marital_status, Housing , employment_status , purpose , loan_history )

·       Ordinal features : Variables in categorical features having some order involved (Number_of_dependents, Education, Property_Area)

·       Numerical features: These features have numerical values (Years_of_current_residence , Has_been_employed_for_at_least, Has_been_employed_for_at_most ,Has_coapplicant, Has_guarantor, Number_of_existing_loans_at_this-bank)

 

Independent Variable (Categorical)

  # Gender

 

70% of applicants in the dataset are male and 30% of applicants are female.

 

 # Marital_status




     53% of applicants in the dataset are Single and 30% of applicants are married.







1.    # Housing



     70% of applicants in the dataset are have own house and 20% of applicants are lived on rent.






       

    #Purpose of taken loan


27% of applicant taken loan for electronic equipment and least taken for career development is 2%-3% .








1.    # Employment Status


# Loan history of applicants

60% of applicants are skilled employee and 20% of applicants are unskilled.




52% of applicants have paid the existing loans back duly till now and 30% of applicant are critical/pending loans at the other banks


Independent Variable VS Target Varibale (High_risk_applicants)



 # Target Variable

      We can see that the datasets consists of 70% applicants are not expected to default payment whereas 30% applicants are expected to default the payment.

 





    # Gender vs High risk applicants


In Male 28% applicants are chance for default the payment and 35% female applicants are chance of defaulting.








 # Foreign Worker vs High risk applicants


We can see applicant working in foregin has 31% chance of defaulting.








 # Months_loan_taken_for vs High risk applicants













we can clearly see if loan period is more than 40 months than the chance of defaulting is increased.


# Has_been_employed_for_at_most  vs High risk applicants


If person works for more than 4 years have less chance to defaults.






 # making class of age group checking the creditworthy














Are young people more creditworthy?

Looking to above chart we clearly see Age between 21-30 have high chance of default of Payments 142 applicants default out of 253.


Would a person with more credit accounts be more

creditworthy?

Since Person with the more credit accounts are creditworthy because of past history payments of loan give confidence to banker to lend moneys.

 

 

Observation :

If persons is older and have atmost 4 years of employees experience and has good creditworthy pay all his loan on duly has higher chance of loan getting.


TASK-02

1.   Explain your intuition behind the features used for modeling.

Since it consist of categorical data is converted numeric value using One Hot Encoding,  We have divided our data into training and testing data

    • 80% data is training data
    • 20% data is testing data

2.   Are you creating new derived features? If yes explain the intuition behind them .

 Yes  , for checking the gender age group , which group has the maximum defaulter,  it was 21-30  age grouped.

 

3.   Are there missing values? If yes how you plan to handle it

 Yes , Since it was in multiple columns the way to treat each them was different.

·       Has_been_employed_for_at_least =nan was replaced with 0

     Has_been_employed_for_at_most =nan was replaced with 0

·       Telephone : Not relevance to target variable this column has been drop.

·       Balance_in_existing_bank_account_(lower_limit_of_bucket): nan value was replaced by 0

·       Balance_in_existing_bank_account_(upper_limit_of_bucket): nan value was replaced by 0

 

4.   How categorical features are handled for modelling

 Using the One Hot Encoding for the categorical features for modelling the datasets.

These features have categories in the datasets (Gender , Marital_status, Housing , employment_status , purpose loan_history, 'Savings_account_balance'  ,'Property' )

 

5.   Describe the features correlation using correlation matrix. Tell us about few correlated feature & share your understanding on why they are correlated.

 Age , Marital_status , Loan_history, No. of years working etc this are few which have high correlation to the Target variable , suppose if a person is young and is married the experience of working is less high chance to get default it may the reason of less salary and more expense.

 

6.   Do you plan to drop the correlated feature? If yes then how.

 No , I would not like to drop any correlated feature to target variable.

 

7.   Which ML algorithm you plan to use for modeling.

Since it’s a  Supervised Learning , the Classification model we have used is

·      Logistic Regression

·      Decision Trees

·      Random Forest

·      Support Vector Machine (SVM)

 

8.   Train two (at least) ML models to predict the credit risk & provide the confusion matrix for each model.





 


 










   


   9.   Which metric(s) you will choose to select between the set of models.

Out of 4 models we got the desired output train and test output closed in Logistic Regression and Support Vector Machine (SVM)

 

 

Explain how you will export the trained models & deploy it for prediction in production..

To deploy your trained models to AI Platform Prediction and use them to serve predictions .


kaggle  task:- https://www.kaggle.com/code/raghavchoudhary/task-01

                          https://www.kaggle.com/code/raghavchoudhary/task-02       

Linked ID :- 

https://www.linkedin.com/in/raghavcho/


github :-     

https://github.com/dsraghav/


Feel free to connect to me if any issues with analysis or any suggestion do it.


Thank You.                   

   

Comments

Popular posts from this blog