{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "accelerator": "GPU" }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "mORzER5q7ZsK" }, "source": [ "# Ensembles and Error Measures\n", "\n", "In this lab we will focus on Random Forest and XGBoosting methods created over the original data. For this, we will first import the data and re-train our original logistic regression.\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "rHyEtC9T8VGs" }, "source": [ "## Imports and preparation" ] }, { "cell_type": "code", "metadata": { "id": "XQnvq4b08TNp" }, "source": [ "# Import the csv files from last week.\n", "!gdown 'https://drive.google.com/uc?id=1LWRFLpJtTopAlRqTuUd9XZvGB6CoHa2z'\n", "!gdown 'https://drive.google.com/uc?id=1IvY78EGu-eizec_9agJUsQWDLT-wmSHF'\n", "!gdown 'https://drive.google.com/uc?id=1aDraDSR2OQbIMjIY07s-rD5cel2x_iS-'" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "7pYwtDRG-GVo" }, "source": [ "!pip install git+https://github.com/CBravoR/scorecardpy" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Kuz712WQ9rxQ" }, "source": [ "# Package loading\n", "import pandas as pd\n", "import numpy as np\n", "import scorecardpy as sc\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import OneHotEncoder" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "8HemZO3l8kho" }, "source": [ "# Import the files as Pandas datasets\n", "bankloan_train_WoE = pd.read_csv('train_woe.csv')\n", "bankloan_test_WoE = pd.read_csv('test_woe.csv')\n", "bankloan_data = pd.read_pickle('BankloanCleanNewVars.pkl')\n", "\n", "# Eliminate unused variables\n", "#bankloan_train_WoE.drop(columns=['OthDebt_woe'], inplace = True)\n", "#bankloan_test_WoE.drop(columns=['OthDebt_woe'], inplace = True)\n", "\n", "# Same train-test split as before (because of seed!)\n", "bankloan_train_noWoE, bankloan_test_noWoE = sc.split_df(bankloan_data.iloc[:, 1:],\n", " y = 'Default',\n", " ratio = 0.7,\n", " seed = 20190227).values()\n", "\n", "# Give breaks for WoE\n", "breaks_adj = {'Address': [1.0,2.0,8.0,17.0],\n", " 'Age': [30.0,45.0,50.0],\n", " 'Creddebt': [1.0, 6.0],\n", " 'Employ': [4.0,14.0,22.0],\n", " 'Income': [30.0,40.0,80.0,140.0],\n", " 'Leverage': [8.0,16.0,22.0],\n", " 'MonthlyLoad': [0.1,0.2,0.30000000000000004,0.7000000000000001],\n", " 'OthDebtRatio': [0.1]\n", " }\n", "\n", "# Apply breaks.\n", "bins_adj = sc.woebin(bankloan_train_noWoE, y=\"Default\",\n", " breaks_list=breaks_adj)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "g3diknG1AY4L" }, "source": [ "# Train logistic regression\n", "from sklearn.linear_model import LogisticRegressionCV\n", "\n", "bankloan_logreg = LogisticRegressionCV(penalty='elasticnet', # Type of penalization l1 = lasso, l2 = ridge\n", " Cs = 10, # How many parameters to try. Can also be a vector with parameters to try.\n", " tol=0.0001, # Tolerance for parameters\n", " cv = 3, # How many CV folds to try. 3 or 5 should be enough.\n", " fit_intercept=True, # Use constant?\n", " class_weight='balanced', # Weights, see below\n", " random_state=20190301, # Random seed\n", " max_iter=100, # Maximum iterations\n", " verbose=0, # Show process. 1 is yes.\n", " solver = 'saga', # How to optimize.\n", " n_jobs = 2, # Processes to use. Set to number of physical cores. \n", " refit = True, # If to retrain with the best parameter and all data after finishing.\n", " l1_ratios = np.arange(0, 1, 0.1)\n", " )\n", "\n", "bankloan_logreg.fit(X = bankloan_train_WoE.iloc[:, 1:], # All rows and from the second var to end\n", " y = bankloan_train_WoE['Default'] # The target\n", " )\n", "\n", "# Calculate scorecard\n", "bankloan_sc = sc.scorecard(bins_adj, bankloan_logreg, \n", " bankloan_train_WoE.columns[1:], # The column names in the trained LR\n", " points0=750, # Base points\n", " odds0=0.01, # Base odds\n", " pdo=50) # PDO \n", "\n", "# Applying the credit score. Applies over the original data!\n", "train_score = sc.scorecard_ply(bankloan_train_noWoE, bankloan_sc, \n", " print_step=0)\n", "test_score = sc.scorecard_ply(bankloan_test_noWoE, bankloan_sc, \n", " print_step=0)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "fup34HDIJWwM" }, "source": [ "## Random Forests\n", "\n", "Now we will train a random forest. It is included in the ```sklearn.ensemble``` subpackage, function [```RandomForestClassifier```](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html), so it is straightforward to use. It comes with many parameters, but in general there is a philosophy to follow:\n", "\n", "- In a Random Forest we want each tree to be large, and to learn as much as possible from its subset of data. We don't care too much if each tree is overadjusted, as we can always increase the number of trees to take care of this.\n", "\n", "- This said, a good idea is to limit the minimum number of samples per leaf when we have few cases (this is not usually a problem in large datasets).\n", "\n", "- We might want to limit the minimum impurity decrease to stop growing a tree if not much is happening.\n", "\n", "- There is also a class weight to include. It does include one automatically if we use the option ```balanced```.\n", "\n", "Let's train one and check the options." ] }, { "cell_type": "code", "metadata": { "id": "VHCrdH0rJgMk" }, "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "#Define the classifier\n", "bankloan_rf = RandomForestClassifier(n_estimators=1000, # Number of trees to train\n", " criterion='entropy', # How to train the trees. Also supports gini.\n", " max_depth=None, # Max depth of the trees. Not necessary to change.\n", " min_samples_split=2, # Minimum samples to create a split.\n", " min_samples_leaf=0.0001, # Minimum samples in a leaf. Accepts fractions for %. This is 0.1% of sample.\n", " min_weight_fraction_leaf=0.0, # Same as above, but uses the class weights.\n", " max_features='sqrt', # Maximum number of features per split (not tree!) by default is sqrt(vars)\n", " max_leaf_nodes=None, # Maximum number of nodes.\n", " min_impurity_decrease=0.00001, # Minimum impurity decrease. This is 10^-4.\n", " bootstrap=True, # If sample with repetition. For large samples (>100.000) set to false.\n", " oob_score=True, # If report accuracy with non-selected cases.\n", " n_jobs=2, # Parallel processing. Set to the number of cores you have. Watch your RAM!!\n", " random_state=20190305, # Seed\n", " verbose=1, # If to give info during training. Set to 0 for silent training.\n", " warm_start=False, # If train over previously trained tree.\n", " class_weight='balanced' # Balance the classes.\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "6q4nSoXeUNPn" }, "source": [ "Now we are ready to train. We just give it our original training set variables and target." ] }, { "cell_type": "code", "metadata": { "id": "JzZRHQpZUd0x" }, "source": [ "# Create dummy variables for education\n", "categorical_features = [\"Education\"]\n", "categorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n", "preprocessor = ColumnTransformer(\n", " transformers=[(\"cat\", categorical_transformer, categorical_features)],\n", " remainder='passthrough'\n", ")\n", "\n", "# Now we define a Pipeline to process everything\n", "clf = Pipeline(\n", " steps=[(\"preprocessor\", preprocessor), (\"classifier\", bankloan_rf)]\n", ")\n", "\n", "# Train the RF.\n", "clf.fit(bankloan_train_noWoE.drop(columns='Default'), # X \n", " bankloan_train_noWoE['Default'] # y\n", " ) " ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ouI571SgUsEs" }, "source": [ "We can see it used two jobs (two processors are available to us in this Google Colab server). It converges very quickly. Let's check how it did, this time we will print a nicer confusion matrix using seaborn, and will plot the ROC curve of the model. " ] }, { "cell_type": "code", "metadata": { "id": "HDZAnTvYUnpD" }, "source": [ "from sklearn.metrics import roc_auc_score, confusion_matrix, roc_curve\n", "\n", "# Apply the model to the test set.\n", "rf_pred_class_test = clf.predict(bankloan_test_noWoE.drop(columns='Default'))\n", "rf_probs_test = clf.predict_proba(bankloan_test_noWoE.drop(columns='Default'))" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "uwW2hVgBVkZX" }, "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "%matplotlib inline\n", "\n", "# Calculate confusion matrix\n", "confusion_matrix_rf = confusion_matrix(y_true = bankloan_test_noWoE['Default'], \n", " y_pred = rf_pred_class_test)\n", "\n", "# Turn matrix to percentages\n", "confusion_matrix_rf = confusion_matrix_rf.astype('float') / confusion_matrix_rf.sum(axis=1)[:, np.newaxis]\n", "\n", "# Turn to dataframe\n", "df_cm = pd.DataFrame(\n", " confusion_matrix_rf, index=['good', 'bad'], columns=['good', 'bad'], \n", ")\n", "\n", "# Parameters of the image\n", "figsize = (10,7)\n", "fontsize=14\n", "\n", "# Create image\n", "fig = plt.figure(figsize=figsize)\n", "heatmap = sns.heatmap(df_cm, annot=True, fmt='.2f')\n", "\n", "# Make it nicer\n", "heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, \n", " ha='right', fontsize=fontsize)\n", "heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45,\n", " ha='right', fontsize=fontsize)\n", "\n", "# Add labels\n", "plt.ylabel('True label')\n", "plt.xlabel('Predicted label')\n", "\n", "# Plot!\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "w46guL4bXh6B" }, "source": [ "Looks a bit unbalanced, but otherwise ok. It's harder to predict the defaulters. Now let's see the ROC curve." ] }, { "cell_type": "code", "metadata": { "id": "38LrSbkeVmEo" }, "source": [ "# Calculate the ROC curve points\n", "fpr, tpr, thresholds = roc_curve(bankloan_test_noWoE['Default'], rf_probs_test[:,1])\n", "\n", "# Save the AUC in a variable to display it. Round it first\n", "auc = np.round(roc_auc_score(y_true = bankloan_test_noWoE['Default'], \n", " y_score = rf_probs_test[:,1]),\n", " decimals = 3)\n", "\n", "# Create and show the plot\n", "plt.plot(fpr,tpr,label=\"Bankloan RF, auc=\"+str(auc))\n", "plt.legend(loc=4)\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Let's calculate the bootstrapped confidence interval for the AUC." ], "metadata": { "id": "7ZT0H7H37Yl-" } }, { "cell_type": "code", "source": [ "# Write a Bootstrap function that calculates the different AUCs\n", "def BootstrapPredErr(pred_prob, y_true, num_samples=1000):\n", " n = len(pred_prob)\n", " full_data = pd.DataFrame({'y_score':pred_prob, 'y_true':y_true})\n", " err = np.zeros(num_samples) \n", " for i in range(num_samples):\n", " d = full_data.sample(n, replace=True)\n", " err[i] = np.round(roc_auc_score(y_true=d['y_true'], \n", " y_score=d['y_score']),\n", " decimals=3\n", " )\n", " return err" ], "metadata": { "id": "1desTK_R7YT-" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Apply bootstrapping\n", "auc_boot = BootstrapPredErr(pred_prob=rf_probs_test[:,1], \n", " y_true=bankloan_test_noWoE['Default'])\n", "upper=np.quantile(auc-auc_boot, 0.975, axis=0)\n", "lower=np.quantile(auc-auc_boot, 0.025, axis=0)\n", "\n", "# Compute confidence interval\n", "boot_ci = [auc - upper, \n", " auc - lower]\n", "print(f\"AUC Confidence Interval: [{boot_ci[0]:.3f}, {boot_ci[1]:.3f}]\")" ], "metadata": { "id": "tSdB5g8-LptH" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "7bF-ic7GYN2g" }, "source": [ "Now, let's print the variable importance. The importance is calculated by averaging the accuracy of trees when the variables is included the tree, and comparing it to when it's NOT included the tree." ] }, { "cell_type": "code", "metadata": { "id": "NniaKfqnPEl0" }, "source": [ "# What are our new variable names?\n", "clf[:-1].get_feature_names_out()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "nfVlwHjk6dDt" }, "source": [ "# Plot variable importance\n", "importances = bankloan_rf.feature_importances_\n", "indices = np.argsort(importances)[::-1] \n", "\n", "f, ax = plt.subplots(figsize=(3, 8))\n", "plt.title(\"Variable Importance - Random Forest\")\n", "sns.set_color_codes(\"pastel\")\n", "sns.barplot(y=[clf[:-1].get_feature_names_out()[i] for i in indices], \n", " x=importances[indices], \n", " label=\"Total\", color=\"b\")\n", "ax.set(ylabel=\"Variable\",\n", " xlabel=\"Variable Importance (Entropy)\")\n", "sns.despine(left=True, bottom=True)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "wPP3ECvawIhT" }, "source": [ "That's it! Now we'll compare this with an XGBoost and see which one of our three models is better.\n" ] }, { "cell_type": "markdown", "metadata": { "id": "dspGHED8YVl6" }, "source": [ "## XGBoosting\n", "\n", "The stochastic gradient boosting model is the alternative to Random Forest. Now we want to create a series of small trees, which will be poorer in performance, but together they will be stronger. Training an XGBoost model is harder, because we need to control the model so it creates small trees, but it performs better in small data, something Random Forests do not necessarily accomplish.\n", "\n", "While scikit-learn does have its own implementation of XGB ([```sklearn.ensemble```](https://scikit-learn.org/stable/modules/ensemble.html)), there are a couple of very strong packages out there that implement the algorithm. ```xgboost``` and ```lightgbm``` are two of the best known ones. We will use [```xgboost```](https://xgboost.readthedocs.io/en/latest/python/) for this lab, available pretty much for every language out there.\n", "\n", "The first step is to define a classifier that we will use." ] }, { "cell_type": "code", "metadata": { "id": "p2WDsujNYNBD" }, "source": [ "from xgboost import XGBClassifier \n", "#Define the classifier.\n", "XGB_Bankloan = XGBClassifier(max_depth=2, # Depth of each tree\n", " learning_rate=0.1, # How much to shrink error in each subsequent training. Trade-off with no. estimators.\n", " n_estimators=50, # How many trees to use, the more the better, but decrease learning rate if many used.\n", " verbosity=1, # If to show more errors or not.\n", " objective='binary:logistic', # Type of target variable.\n", " booster='gbtree', # What to boost. Trees in this case.\n", " n_jobs=2, # Parallel jobs to run. Set your processor number.\n", " gamma=0.001, # Minimum loss reduction required to make a further partition on a leaf node of the tree. (Controls growth!)\n", " subsample=0.632, # Subsample ratio. Can set lower\n", " colsample_bytree=1, # Subsample ratio of columns when constructing each tree.\n", " colsample_bylevel=1, # Subsample ratio of columns when constructing each level. 0.33 is similar to random forest.\n", " colsample_bynode=1, # Subsample ratio of columns when constructing each split.\n", " reg_alpha=1, # Regularizer for first fit. alpha = 1, lambda = 0 is LASSO.\n", " reg_lambda=0, # Regularizer for first fit.\n", " scale_pos_weight=1, # Balancing of positive and negative weights. G / B\n", " base_score=0.5, # Global bias. Set to average of the target rate.\n", " random_state=20201108, # Seed\n", " tree_method='hist', # How to train the trees?\n", " #gpu_id=0 # With which GPU? \n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "D7NI0qGdK5Mo" }, "source": [ "This classifier can be used to tune the parameters of the model. We will use sklearn's ```GridSearchCV``` for this. It requires a dictionary of the parameters to look for. We will tune the number of trees (XGB overfits relatively easily, always tune this), the depth, and the learning rate." ] }, { "cell_type": "code", "metadata": { "id": "kv93PK3hK7n9" }, "source": [ "# Define the parameters. Play with this grid!\n", "param_grid = dict({'n_estimators': [50, 100, 150],\n", " 'max_depth': [2, 3, 4],\n", " 'learning_rate' : [0.01, 0.05, 0.1, 0.15]\n", " })" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "_jRGje4UK8vH" }, "source": [ "This training process can be very long. We will create a validation set for the sample." ] }, { "cell_type": "code", "metadata": { "id": "FsU6yQ5vM-rc" }, "source": [ "# Always a good idea to tune on a reduce sample of the train set, as we will call many functions.\n", "val_train = bankloan_train_noWoE.sample(frac = 0.5, # The fraction to extract\n", " random_state = 20201108, # The seed.\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "NU1ZZWfrNCVt" }, "source": [ "Now we can do a grid search over the parameter space. We will use the AUC (as this is a binary classification problem)" ] }, { "cell_type": "code", "metadata": { "id": "KipB5YMwNGOg" }, "source": [ "from sklearn.model_selection import GridSearchCV\n", "\n", "# Define grid search object.\n", "GridXGB = GridSearchCV(XGB_Bankloan, # Original XGB. \n", " param_grid, # Parameter grid\n", " cv = 3, # Number of cross-validation folds. \n", " scoring = 'roc_auc', # How to rank outputs.\n", " n_jobs = 2, # Parallel jobs. -1 is \"all you have\"\n", " refit = False, # If refit at the end with the best. We'll do it manually.\n", " verbose = 1 # If to show what it is doing.\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "h7OzB8G8NH2b" }, "source": [ "# Create dummy variables for education\n", "categorical_features = [\"Education\"]\n", "categorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n", "preprocessor = ColumnTransformer(\n", " transformers=[(\"cat\", categorical_transformer, categorical_features)],\n", " remainder='passthrough'\n", ")\n", "\n", "# Now we define a Pipeline to process everything\n", "clf = Pipeline(\n", " steps=[(\"preprocessor\", preprocessor), (\"classifier\", GridXGB)]\n", ")\n", "\n", "# Train the XGB.\n", "clf.fit(val_train.drop(columns='Default'), # X \n", " val_train['Default'] # y\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "The output of the training process can be checked like this." ], "metadata": { "id": "2neLTXIL1BxH" } }, { "cell_type": "code", "source": [ "CV_results = pd.DataFrame(GridXGB.cv_results_)\n", "CV_results" ], "metadata": { "id": "NZThDQdM1Ezp" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Qbi74Z0vNT9c" }, "source": [ "Now we can output the optimal parameters." ] }, { "cell_type": "code", "metadata": { "id": "KHDHaikMNWH1" }, "source": [ "# Show best params\n", "print(f'The best AUC is {GridXGB.best_score_:.3f}')\n", "GridXGB.best_params_" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lueLtcRDNZqB" }, "source": [ "It is telling us to use 5% learning rate with a max_depth of 2 and 100 trees. If any parameters were the limit, you need to run the search again increasing the limit so this does not happen. I leave this as an exercise.\n", "\n", "Now we can fit the final model!" ] }, { "cell_type": "code", "metadata": { "id": "r74qw-D7X6LG" }, "source": [ "# Create XGB with best parameters.\n", "XGB_Bankloan = XGBClassifier(max_depth=GridXGB.best_params_.get('max_depth'), # Depth of each tree\n", " learning_rate=GridXGB.best_params_.get('learning_rate'), # How much to shrink error in each subsequent training. Trade-off with no. estimators.\n", " n_estimators=GridXGB.best_params_.get('n_estimators'), # How many trees to use, the more the better, but decrease learning rate if many used.\n", " verbosity=1, # If to show more errors or not.\n", " objective='binary:logistic', # Type of target variable.\n", " booster='gbtree', # What to boost. Trees in this case.\n", " n_jobs=2, # Parallel jobs to run. Set your processor number.\n", " gamma=0.001, # Minimum loss reduction required to make a further partition on a leaf node of the tree. (Controls growth!)\n", " subsample=0.632, # Subsample ratio. Can set lower\n", " colsample_bytree=1, # Subsample ratio of columns when constructing each tree.\n", " colsample_bylevel=1, # Subsample ratio of columns when constructing each level. 0.33 is similar to random forest.\n", " colsample_bynode=1, # Subsample ratio of columns when constructing each split.\n", " reg_alpha=1, # Regularizer for first fit. alpha = 1, lambda = 0 is LASSO.\n", " reg_lambda=0, # Regularizer for first fit.\n", " scale_pos_weight=1, # Balancing of positive and negative weights.\n", " base_score=0.5, # Global bias. Set to average of the target rate.\n", " random_state=20201107, # Seed\n", " tree_method='gpu_hist', # How to train the trees?\n", " gpu_id=0 # With which GPU?\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "H2OLfAk9Nl5_" }, "source": [ "# Train over all training data.\n", "# Create dummy variables for education\n", "categorical_features = [\"Education\"]\n", "categorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n", "preprocessor = ColumnTransformer(\n", " transformers=[(\"cat\", categorical_transformer, categorical_features)],\n", " remainder='passthrough'\n", ")\n", "\n", "# Now we define a Pipeline to process everything\n", "clf = Pipeline(\n", " steps=[(\"preprocessor\", preprocessor), (\"classifier\", XGB_Bankloan)]\n", ")\n", "\n", "# Train the XGB.\n", "clf.fit(bankloan_train_noWoE.drop(columns='Default'), # X \n", " bankloan_train_noWoE['Default'] # y\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "pwbV--RybzOl" }, "source": [ "Now we can evaluate our model. First we calculate the variable importance." ] }, { "cell_type": "code", "metadata": { "id": "2XcNvyXDN3as" }, "source": [ "# Plot variable importance\n", "importances = XGB_Bankloan.feature_importances_\n", "indices = np.argsort(importances)[::-1] \n", "\n", "f, ax = plt.subplots(figsize=(3, 8))\n", "plt.title(\"Variable Importance - XGBoosting\")\n", "sns.set_color_codes(\"pastel\")\n", "sns.barplot(y=[clf[:-1].get_feature_names_out()[i] for i in indices], x=importances[indices], \n", " label=\"Total\", color=\"b\")\n", "ax.set(ylabel=\"Variable\",\n", " xlabel=\"Variable Importance (Entropy)\")\n", "sns.despine(left=True, bottom=True)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "-Z5P991nN-Xm" }, "source": [ "What do you see here? Does it make sense to you?\n", "\n", "Let's finish by plotting the evaluation measures. How does it compare to Random Forest? Why do you think this is?" ] }, { "cell_type": "code", "metadata": { "id": "Ezp9txTJcP24" }, "source": [ "# Calculate probability\n", "XGBClassTest = clf.predict(bankloan_test_noWoE.drop(columns=\"Default\"))\n", "xg_probs_test = clf.predict_proba(bankloan_test_noWoE.drop(columns=\"Default\"))\n", "xg_probs_test = xg_probs_test[:, 1]\n", "\n", "# Calculate confusion matrix\n", "confusion_matrix_xgb = confusion_matrix(y_true = bankloan_test_noWoE['Default'], \n", " y_pred = XGBClassTest)\n", "\n", "# Turn matrix to percentages\n", "confusion_matrix_xgb = confusion_matrix_xgb.astype('float') / confusion_matrix_xgb.sum(axis=1)[:, np.newaxis]\n", "\n", "# Turn to dataframe\n", "df_cm = pd.DataFrame(\n", " confusion_matrix_xgb, index=['good', 'bad'], columns=['good', 'bad'], \n", ")\n", "\n", "# Parameters of the image\n", "figsize = (10,7)\n", "fontsize=14\n", "\n", "# Create image\n", "fig = plt.figure(figsize=figsize)\n", "heatmap = sns.heatmap(df_cm, annot=True, fmt='.2f')\n", "\n", "# Make it nicer\n", "heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0, \n", " ha='right', fontsize=fontsize)\n", "heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45,\n", " ha='right', fontsize=fontsize)\n", "\n", "# Add labels\n", "plt.ylabel('True label')\n", "plt.xlabel('Predicted label')\n", "\n", "# Plot!\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cpyA8KBxcfFH" }, "source": [ "Very similar results. Now there is no chance to use sample weights sadly, so the solution is a bit more unbalanced. This is not too much of an issue though, as we can always change the cutoff point to account for the unbalance.\n", "\n", "Let's check the ROC curve." ] }, { "cell_type": "code", "metadata": { "id": "OzfLTaHmwi-R" }, "source": [ "# Calculate the ROC curve points\n", "fpr, tpr, thresholds = roc_curve(bankloan_test_noWoE['Default'], \n", " xg_probs_test)\n", "\n", "# Save the AUC in a variable to display it. Round it first\n", "auc = np.round(roc_auc_score(y_true = bankloan_test_noWoE['Default'], \n", " y_score = xg_probs_test),\n", " decimals = 3)\n", "\n", "# Create and show the plot\n", "plt.plot(fpr,tpr,label=\"AUC - XGBoosting = \" + str(auc))\n", "plt.legend(loc=4)\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "And finally, let's calculate the AUC confidence interval." ], "metadata": { "id": "0oCF0VxgQLX_" } }, { "cell_type": "code", "source": [ "# Apply bootstrapping\n", "auc_boot = BootstrapPredErr(pred_prob=xg_probs_test, \n", " y_true=bankloan_test_noWoE['Default'])\n", "upper=np.quantile(auc-auc_boot, 0.975, axis=0)\n", "lower=np.quantile(auc-auc_boot, 0.025, axis=0)\n", "\n", "# Compute confidence interval\n", "boot_ci = [auc - upper, \n", " auc - lower]\n", "print(f\"AUC Confidence Interval: [{boot_ci[0]:.3f}, {boot_ci[1]:.3f}]\")" ], "metadata": { "id": "T85RD0KvQPru" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "What do you think? How do these results compare to the Random Forest ones?" ], "metadata": { "id": "dAu54_1VQZ_t" } }, { "cell_type": "markdown", "metadata": { "id": "iJWpe3IR9gpN" }, "source": [ "## TreeSHAP\n", "\n", "The variable importance plots are quite useful to get a first look at the impact of each variable in the model. However, there is a much better way of doing this using the well-known Shapley Values. The method, adapted to tree models, is called [SHAP](https://github.com/slundberg/shap). It works by using a game-theoretical approach to comparing when the variable is in a model versus when it is not. The technical details are published in [this paper](https://www.nature.com/articles/s42256-019-0138-9).\n", "\n", "The method requires an already trained XGB model. The package is fairly comprehensive, but we will focus on these two elements:\n", "\n", "\n", "1. How to measure the average impact of each variables on a sample of cases (the most important application!).\n", "\n", "2. How to measure the impact of each variable on prediction.\n", "\n", "Reading the documentation will allow you to make other plots.\n", "\n", "Let's start with loading the package." ] }, { "cell_type": "code", "metadata": { "id": "sZmGgqaXGTfY" }, "source": [ "!pip install shap" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Cjn3VIaMKuCu" }, "source": [ "import shap\n", "shap.initjs() # Import Java engine." ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "4aarAeRRMZue" }, "source": [ "Now, let's calculate the Shapley scores." ] }, { "cell_type": "code", "metadata": { "id": "9lsX1QN5LxRt" }, "source": [ "# Create transformed data from previous preprocessor\n", "data = pd.DataFrame(preprocessor.fit_transform(bankloan_test_noWoE.drop(columns=\"Default\")))\n", "data.columns = preprocessor.get_feature_names_out()\n", "\n", "\n", "# Trains the game-theoretic model. Really complex so requires sampling.\n", "explainer = shap.TreeExplainer(XGB_Bankloan, # The model \n", " data = shap.sample(data, 100) # Create a sample of 100 cases\n", " )\n", "\n", "# Applies model ot the full dataset.\n", "shap_values = explainer.shap_values(data, check_additivity=False)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "5x88J-onOOgA" }, "source": [ "Now we can run the first plot. We will calculate the contribution to explanations for all points in the sample." ] }, { "cell_type": "code", "metadata": { "id": "We409GI_NCUt" }, "source": [ "shap.summary_plot(shap_values, # The Shapley values.\n", " data, # The training sample\n", " show=False) # Whether to print the model or not\n", "\n", "# Let's save this as a PDF for later use.\n", "plt.savefig('ShapSummaryPlot.pdf', dpi=300, bbox_inches='tight')\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "xjgFZlXXQZcg" }, "source": [ "How to read this plot? It follows a similar logic to reading WoEs#\n", "\n", "1. The absolute value of the Shapley value (x-axis) is how important the variable is for predicting, while the sign says whether the prediction is towards a positive value (Defaulters in this case) or a zero value (non-defaulters).\n", "\n", "2. The colour in the plot says whether the feature value is high or low for that Shapley score.\n", "\n", "3. The y-axis shows all variables ordered from more useful to less useful in predicting.\n", "\n", "So, for example, let's consider Employ. This variable represents years of employment. We can see that high values are related to very high Shapley values, meaning that high values are very helpful (in fact, the most helpful of all sets of values in the dataset) to predict non-defaulters, while low values are helpful to predict defaulters, but not as much. Here the non-linear behaviour of the variable is shown quite clearly!\n", "\n", "Now, let's dig in the impact of each variable." ] }, { "cell_type": "code", "metadata": { "id": "ewWOPbIOSi0S" }, "source": [ "shap.dependence_plot(\"remainder__Employ\", # The variable to study\n", " shap_values, # The Shapley values.\n", " data, # The training sample\n", " show=False) # Whether to print the model or not\n", "\n", "plt.savefig('ShapEmploy.pdf', dpi=300, bbox_inches='tight')\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "gAM07AXLS1Ch" }, "source": [ "This plot shows the value of the variable in the X-axis and compares it with the value of the Shap coefficient in the Y-axis. The colour now represents the value of the most correlated variable in the dataset (in this case, MonthlyLoad) and it is given for reference only.\n", "\n", "In this plot, we can see how after 10 years of employment the people are strongly non-defaulters (with Shap values of around -1.5 to -2) while the neutral point (Shap value of 0) is around 3 years. Anything below 3 years helps the most in predicting defaulters. There does not seem to be much correlation with MonthlyLoad, so we are fairly sure this variable is independently predicting what it does.\n", "\n", "Now you can study variables in detail and make fairly detailed explanations!\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Se_ELvsedrdy" }, "source": [ "## Plotting multiple ROC curves\n", "\n", "The last thing we would like to do is to plot multiple ROC curves in one graph. This is fairly straightforward, we just pass the ```plt.plot``` command each of the ROC curves. I'll do it dynamically using a dictionary and a for loop. " ] }, { "cell_type": "code", "metadata": { "id": "s9h7TVw_dWuJ" }, "source": [ "# Predict probabilities of scorecard.\n", "logreg_probs_test = bankloan_logreg.predict_proba(bankloan_test_WoE.iloc[:, 1:])" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "RC-nLoGbfFjF" }, "source": [ "# Set models and probabilities. This structure is called a dictionary.\n", "models = [\n", "{\n", " 'label': 'Logistic Regression',\n", " 'probs': logreg_probs_test[:,1]\n", "},\n", "{\n", " 'label': 'Gradient Boosting',\n", " 'probs': xg_probs_test\n", "},\n", "{\n", " 'label': 'Random Forest',\n", " 'probs': rf_probs_test[:,1]\n", "}\n", "]\n", "\n", "# Loop that creates the plot. I will pass each ROC curve one by one.\n", "for m in models:\n", " auc = roc_auc_score(y_true = bankloan_test_noWoE['Default'], \n", " y_score = m['probs'])\n", " fpr, tpr, thresholds = roc_curve(bankloan_test_WoE['Default'], \n", " m['probs'])\n", " plt.plot(fpr, tpr, label=f'{m[\"label\"]} ROC (area = {auc:.3f})')\n", " \n", "\n", " \n", "# Settings\n", "plt.plot([0, 1], [0, 1],'r--')\n", "plt.xlim([0.0, 1.0])\n", "plt.ylim([0.0, 1.05])\n", "plt.xlabel('1-Specificity(False Positive Rate)')\n", "plt.ylabel('Sensitivity(True Positive Rate)')\n", "plt.title('Receiver Operating Characteristic')\n", "plt.legend(loc=\"lower right\")\n", " \n", "# Plot! \n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "n7-RF4xvQfik" }, "source": [ "Interesting results, no?" ] }, { "cell_type": "markdown", "metadata": { "id": "GCGjuwb2hW04" }, "source": [ "I introduced several new concepts here. First, a dictionary. A dictionary is a very useful structure, which allows to have values indexed by a name. Every item will have their own values for its name, here the 'label' and 'probs'. I use this as an input for the for loop.\n", "\n", "Second, check the part\n", "\n", "```\n", "label=f'{m[\"label\"]} ROC (area = {auc:.3f})'\n", "```\n", "\n", "of the plot definition. That is an [f-string](https://zetcode.com/python/fstring/), a very powerful form of formatting text with variables. To format the numbers, we can directly give the variables we want and with a colon set the format, such as ```auc:.3f``` to the ```auc```, which tells Python \"use three decimal digits for the auc object\". " ] } ] }