Default of Loan Payment
Background:-
A person’s
creditworthiness is often associated (conversely) with the likelihood they may
default on loans.
We’re giving
you anonymized data on about 1000 loan applications, along with a certain set
of attributes about the applicant itself, and whether they were considered high
risk.
0 = Low credit
risk i.e high chance of paying back the loan amount
Applicant:- https://drive.google.com/file/d/1_QE3Bkd6Z0NPjC_IUXBkRhhTCEUIRsUQ/view?usp=share_link
Loan:-https://drive.google.com/file/d/1-mm46_3cl60W4AeIpj7YLcFU-Phr5Mah/view?usp=share_link
We have 10 independent variables in the application data set
, where loan_application_id and applicant_id is unique and 1 target variable , i.e high_risk_applicant in the loan datasets.
Now, let's visualize each variable
separately. Different types of variables are Categorical, ordinal, and
numerical in both datasets combined ( i.e applicant and loan )
· Categorical features: These
features have categories (Gender , Marital_status, Housing , employment_status
, purpose , loan_history )
· Ordinal features : Variables in categorical features
having some order involved (Number_of_dependents, Education, Property_Area)
· Numerical features: These features have numerical values
(Years_of_current_residence , Has_been_employed_for_at_least, Has_been_employed_for_at_most
,Has_coapplicant, Has_guarantor, Number_of_existing_loans_at_this-bank)
Independent Variable (Categorical)
# Marital_status
53% of applicants in the dataset are Single and 30% of applicants are married.
1. # Housing
70% of applicants in the dataset are have own house and 20% of applicants are lived on rent.
#Purpose of taken loan
27% of applicant taken
loan for electronic equipment and least taken for career development is 2%-3% .
1. # Employment Status
60% of applicants are skilled employee and 20% of applicants are unskilled.
Independent Variable VS Target
Varibale (High_risk_applicants)
We can see that the datasets
consists of 70% applicants are not expected to default payment whereas 30%
applicants are expected to default the payment.
# Gender vs High risk applicants
In Male 28% applicants are
chance for default the payment and 35% female applicants are chance of
defaulting.
# Foreign Worker vs High risk applicants
We can see applicant
working in foregin has 31% chance of defaulting.
# Months_loan_taken_for vs High risk applicants
we can clearly see if loan period is more than 40 months than the chance of defaulting is increased.
# Has_been_employed_for_at_most vs High risk applicants
If person works for more
than 4 years have less chance to defaults.
# making class of age group checking the creditworthy
Are young people more creditworthy?
Looking to above chart we
clearly see Age between 21-30 have high chance of default of Payments 142
applicants default out of 253.
Would a person with more credit accounts be more
creditworthy?
Since Person with the more
credit accounts are creditworthy because of past history payments of loan give
confidence to banker to lend moneys.
Observation :
If persons is older and have atmost 4 years of employees experience and has good creditworthy pay all his loan on duly has higher chance of loan getting.
TASK-02
1.
Explain
your intuition behind the features used for modeling.
Since it consist of categorical
data is converted numeric value using One Hot Encoding, We have divided our data into training
and testing data
- 80%
data is training data
- 20%
data is testing data
2.
Are you creating new derived features? If yes explain
the intuition behind them .
3.
Are there missing values? If yes how you plan to
handle it
· Has_been_employed_for_at_least =nan was replaced with 0
Has_been_employed_for_at_most =nan was replaced with 0
·
Telephone : Not relevance to target variable this
column has been drop.
·
Balance_in_existing_bank_account_(lower_limit_of_bucket): nan value was
replaced by 0
·
Balance_in_existing_bank_account_(upper_limit_of_bucket): nan value was
replaced by 0
4.
How categorical features are handled for modelling
These features have categories in the datasets (Gender ,
Marital_status, Housing , employment_status , purpose loan_history, 'Savings_account_balance'
,'Property' )
5.
Describe the features correlation using correlation
matrix. Tell us about few correlated feature & share your understanding on
why they are correlated.
6.
Do you plan to drop the correlated feature? If yes
then how.
7.
Which ML algorithm you plan to use for modeling.
Since it’s a Supervised Learning , the Classification model we have used is
·
Logistic Regression
·
Decision Trees
·
Random Forest
·
Support Vector Machine (SVM)
8.
Train two (at
least) ML models to predict the credit risk & provide the
confusion matrix for each model.
9.
Which metric(s) you will choose to select between the
set of models.
Out of 4 models we got the desired output train and test output closed in Logistic Regression and Support Vector Machine (SVM)
Explain how you will export the trained
models & deploy it for prediction in production..
To deploy your
trained models to AI Platform Prediction and use them to serve predictions .
kaggle task:- https://www.kaggle.com/code/raghavchoudhary/task-01
https://www.kaggle.com/code/raghavchoudhary/task-02
Linked ID :-
https://www.linkedin.com/in/raghavcho/
github :-
Feel free to connect to me if any issues with analysis or any suggestion do it.
Thank You.
Comments
Post a Comment