{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/CBravoR/AdvancedAnalyticsLabs/blob/master/notebooks/python/Lab_LGD_Modelling.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KCdDMmB_v44r"
      },
      "source": [
        "# LGD Modelling\n",
        "\n",
        "In this lab, we will model the LGD using two techniques: A linear regression, a fitted distribution regression, and a random forest. LGD models are particularly tricky as they tend to have oddly-shaped distributions, thus traditional methods do not tend to create good fit for the models.\n",
        "\n",
        "First, we will load and study the data.\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# bug in gdown\n",
        "!pip install gdown==v4.6.3"
      ],
      "metadata": {
        "id": "Pzt8Rt47mXiA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lHDwxH75v9Uy"
      },
      "source": [
        "!gdown https://drive.google.com/uc?id=1nldxUFNGDziLZgE-fJv5KmNjnbdM29na"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bxDyuCDMwU_L"
      },
      "source": [
        "import numpy as np\n",
        "import pandas as pd\n",
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "import matplotlib.pyplot as plt\n",
        "import seaborn as sns\n",
        "%matplotlib inline\n",
        "\n",
        "from sklearn.metrics import mean_squared_error\n",
        "from sklearn.linear_model import ElasticNetCV\n",
        "from sklearn.model_selection import GridSearchCV\n",
        "\n",
        "sns.set_theme(style=\"darkgrid\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "CY0FqCqBzKTX"
      },
      "source": [
        "LGD_data = pd.read_csv('LGD.csv')\n",
        "LGD_data.describe()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PYkfGBd13SjV"
      },
      "source": [
        "Let's create a test / train split."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wGmz9JwdzixC"
      },
      "source": [
        "x_train, x_test, y_train, y_LGD_test = train_test_split( LGD_data.iloc[:, 0:13], # Predictors\n",
        "                                                    LGD_data['LGD'],         # Target variable\n",
        "                                                    test_size=0.33,          # Test size percentage\n",
        "                                                    random_state=20201209    # Seed\n",
        "                                                    )"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tKwHZthd3Rj_"
      },
      "source": [
        "And finally let's plot the LGD distribution."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QvYbqyUM3jkU"
      },
      "source": [
        "# Create the figure with the density\n",
        "fig = sns.displot(LGD_data['LGD'], kind = 'kde')\n",
        "\n",
        "# Create a density histogram\n",
        "sns.histplot(LGD_data['LGD'], stat = 'density')\n",
        "\n",
        "# Plot the whole thing\n",
        "plt.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4pSmK4FX4A_h"
      },
      "source": [
        "As we can see, the LGD has an unbalanced bimodal distribution between 0 and 1."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SUnsKjEn4m-7"
      },
      "source": [
        "## Linear regression\n",
        "\n",
        "We will now try to fit a basic linear regression and see its performance. For this we will use the linear regression implementation of scikit-learn, [```linear_model```](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model). We will also regularize using ElasticNet."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "39HgtUAi6mAJ"
      },
      "source": [
        "LGD_linear_model = ElasticNetCV(l1_ratio=np.arange(0.01, 1.01, 0.05),  # l1_ratios to try.\n",
        "                                n_alphas=10,                        # How many alphas to try per l1_ratio\n",
        "                                fit_intercept=True,                 # Use constant?\n",
        "                                max_iter=1000,                      # Iterations\n",
        "                                tol=0.0001,                         # Parameter tolerance\n",
        "                                cv=3,                               # Number of cross_validation folds\n",
        "                                verbose=True,                       # Explicit or silent training\n",
        "                                n_jobs=2,                           # Cores to use\n",
        "                                random_state=20201209               # Random seed\n",
        "                                )"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RzygFoI09Kg8"
      },
      "source": [
        "LGD_linear_model.fit(x_train, y_train)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dIP3m8f_9iUE"
      },
      "source": [
        "Let's check the output."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2WgcXjW_9X71"
      },
      "source": [
        "coef_df = pd.concat([pd.DataFrame({'column': x_train.columns}),\n",
        "                    pd.DataFrame(np.transpose(LGD_linear_model.coef_))],\n",
        "                    axis = 1\n",
        "                   )\n",
        "\n",
        "coef_df"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eQrIy18ksCSS"
      },
      "source": [
        "LGD_linear_model.l1_ratio_"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BxaF2PEe9W9e"
      },
      "source": [
        "The model is highly leaning towards LASSO. We can see some variables are not relevant. Let's check the goodness of fit over the test set."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bQN8rVEc-TzP"
      },
      "source": [
        "# Predict over test set\n",
        "linear_pred_test = LGD_linear_model.predict(x_test)\n",
        "\n",
        "# Calculate the error\n",
        "linear_error = np.abs(linear_pred_test - y_LGD_test)\n",
        "\n",
        "# Print a scatter plot with distributions.\n",
        "fig, ax = plt.subplots(figsize=(11, 8.5))\n",
        "sns.scatterplot(x = y_LGD_test,            # The x is the real value\n",
        "                y = linear_pred_test,  # The y value is the predictor\n",
        "                hue = linear_error,    # The colour represents the error\n",
        "                legend = False\n",
        "                )\n",
        "\n",
        "# Overlay a diagonal line\n",
        "X_plot = np.linspace(0, 1, 100)\n",
        "Y_plot = X_plot\n",
        "\n",
        "plt.plot(X_plot, Y_plot, color='r')\n",
        "\n",
        "plt.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "f8kWdEpr_k5H"
      },
      "source": [
        "We can see several values are predicted to be below 0, while many are shown to be below its correct value, particularly for large graphs. How dark a point shows the error magnitud. Let's see the average error magnitud."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2FjVPXQFDTt3"
      },
      "source": [
        "linear_mse = mean_squared_error(y_LGD_test, linear_pred_test)\n",
        "print(f'The MSE for the linear model is {linear_mse:.4f}')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qJajm_OExOJO"
      },
      "source": [
        "## General Linear Regression Fit\n",
        "\n",
        "General regressions are not implemented in Python yet. This means we should use the GLM trick that we saw in the lectures to estimate a regression model that has an appropriate output distribution. Let's see how this would work out.\n",
        "\n",
        "The first step is to look for the best distribution for our data. For this we can use the [```fitter```](https://github.com/cokelaer/fitter) package, that tries to find the best distribution among all available in scipy. Let's install it and load it."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Wczyeihp2jGW"
      },
      "source": [
        "!pip install fitter\n",
        "import fitter"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Iebre4IN5PoS"
      },
      "source": [
        "Now we can look for the best distribution. The process is:\n",
        "1.  Create the fitter object.\n",
        "2. Fit it over our LGD data.\n",
        "3. Pick the best distribution between all available."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oThNl3VToPlE"
      },
      "source": [
        "# Generate the fitter object.\n",
        "dists_LGD = fitter.Fitter(LGD_data['LGD'],      # The data\n",
        "                          timeout = 30,         # How long to wait before timeout. Some distributions are very hard to fit!\n",
        "                          distributions = None, # Optionally you can give distributions. None means all of them, ironically.\n",
        "                          )\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "W_Z9dNRO5ifs"
      },
      "source": [
        "Not all distributions are good for our problem. This can greatly increase fitting time too. Let's restrict distributions to those we believe might be adequate for our case."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iHiOH8nLu-tW"
      },
      "source": [
        "# Get the full list of distributions.\n",
        "dists_LGD.distributions"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1tGU5SxYvD9M"
      },
      "source": [
        "# Pick a few.\n",
        "dists_LGD.distributions = ['beta', 'gamma', 'mielke', 'lognorm', 'burr12']"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "myeBlKIru9-f"
      },
      "source": [
        "# Fit it\n",
        "dists_LGD.fit(n_jobs = 2,      # How many cores to use.\n",
        "              progress = True  # Show progress bar\n",
        "              )"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_Vqq4YuzrL97"
      },
      "source": [
        "dists_LGD.summary()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OKL3Gi_awXFs"
      },
      "source": [
        "We can see the best distributions are the Mielke distribution (a mix between a beta and an F function common in physical phenomena) and the gamma distribution, a generalization of the beta distribution.\n",
        "\n",
        "Let's use [Mielke's distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mielke.html) for this example. Every dataset can have its own distribution!"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sZn0q88XrzHm"
      },
      "source": [
        "dists_LGD.get_best()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LGqLV4YpySRW"
      },
      "source": [
        "The function for Mielke's density distribution are k, s, loc and scale, which are conveniently given in the above dataframe. We see there is no translation made (loc is close to 0), but we do need to scale the distribution a bit (the fourth parameter). The k and s parameters give the shape of the distribution.\n",
        "\n",
        "What functions do we need? Well, the general process for a regression of this type is:\n",
        "\n",
        "1. Get where on the original cumulative distribution a point (the LGD) falls. For this we need the cumulative Mielke distribution, called the ```cdf``` function in scipy.\n",
        "2. Get where on the normal distribution that particular point falls. For this we need the inverse of the cumulative function, also called the **percent point function** or ```ppf``` in scipy.\n",
        "3. Apply this to all points in the dataset. Now everything is mapped to a normal variable.\n",
        "4. Run a linear regression between our regressors and the z-transformed variable. You can use LASSO, ElasticNet, etc to get the best model.\n",
        "5. Go back. For this you need the cumulative normal distribution function (cdf) and the inverse cumulative distribution function for our target distribution (Mielke's ppf function).\n",
        "\n",
        "Let's import all required functions."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VQtm7F12w_MB"
      },
      "source": [
        "# Import the functions\n",
        "from scipy.stats import mielke, norm"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sfXXfJdQ56nV"
      },
      "source": [
        "# Set the parameters to particular values.\n",
        "LGD_mielke = mielke(*dists_LGD.fitted_param['mielke']) # Asterisk means split the array into parameters for the function\n",
        "LGD_normal = norm()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_3xv_i7Jvrin"
      },
      "source": [
        "dists_LGD.fitted_param['mielke']"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "FyOLEWXy8VNi"
      },
      "source": [
        "Let's begin the calculations. The first step is to get the CDF of all elements in the Mielke distribution and finding its corresponding z-value in the normal distribution."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "V2bTnVAf8RGG"
      },
      "source": [
        "# Get the Mielke cdf point.\n",
        "LGD_data['MielkeCDF'] = [LGD_mielke.cdf(x) for x in LGD_data['LGD']]\n",
        "\n",
        "# Get the corresponding z-value in the normal function\n",
        "LGD_data['Z-Mielke'] = [norm.ppf(x) for x in LGD_data['MielkeCDF']]\n",
        "LGD_data['Z-Mielke'].describe()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "50opUBGy9JfN"
      },
      "source": [
        "Our data is perfectly mapped to a normal regression. Now we are ready to run the regression! We can use the same code as before, but our target now will be the newly calculate Z-Mielke variable."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LSfFU9Ng9FaP"
      },
      "source": [
        "LGD_mielke_model = ElasticNetCV(l1_ratio=np.arange(0.01, 1.01, 0.05),  # l1_ratios to try.\n",
        "                                n_alphas=10,                        # How many alphas to try per l1_ratio\n",
        "                                fit_intercept=True,                 # Use constant?\n",
        "                                max_iter=1000,                      # Iterations\n",
        "                                tol=0.0001,                         # Parameter tolerance\n",
        "                                cv=3,                               # Number of cross_validation folds\n",
        "                                verbose=True,                       # Explicit or silent training\n",
        "                                n_jobs=2,                           # Cores to use\n",
        "                                random_state=20201209               # Random seed\n",
        "                                )"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Z0btkFRI-RM_"
      },
      "source": [
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "x_train, x_test, y_train, y_mielke_test = train_test_split( LGD_data.iloc[:, 0:13], # Predictors\n",
        "                                                    LGD_data['Z-Mielke'],         # Target variable\n",
        "                                                    test_size=0.33,          # Test size percentage\n",
        "                                                    random_state=20201209    # Seed\n",
        "                                                    )"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "h5R6Kwhj-c2o"
      },
      "source": [
        "LGD_mielke_model.fit(x_train, y_train)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4M--M4_R-mQo"
      },
      "source": [
        "coef_df = pd.concat([pd.DataFrame({'column': x_train.columns}),\n",
        "                    pd.DataFrame(np.transpose(LGD_mielke_model.coef_))],\n",
        "                    axis = 1\n",
        "                   )\n",
        "\n",
        "coef_df"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "A3n2-j0q-vVk"
      },
      "source": [
        "Similar variables are relevant now, but the weights have clearly changed! We can now apply this model to the test data and then calculate the corresponding LGD by reversing our procedure."
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Generate the fitter object.\n",
        "dists_LGD = fitter.Fitter(LGD_data['LGD'],      # The data\n",
        "                          timeout = 30,         # How long to wait before timeout. Some distributions are very hard to fit!\n",
        "                          distributions = None, # Optionally you can give distributions. None means all of them, ironically.\n",
        "                          )\n",
        "dists_LGD.distributions = ['beta', 'gamma', 'mielke', 'lognorm', 'burr12']\n",
        "# Fit it\n",
        "dists_LGD.fit(n_jobs = 2,      # How many cores to use.\n",
        "              progress = True  # Show progress bar\n",
        "              )\n",
        "\n",
        "dists_LGD.summary()\n",
        "dists_LGD.get_best()\n",
        "# Set the parameters to particular values.\n",
        "LGD_mielke = mielke(*dists_LGD.fitted_param['mielke']) # Asterisk means split the array into parameters for the function\n",
        "LGD_normal = norm()\n",
        "\n",
        "# Get the Mielke cdf point.\n",
        "LGD_data['MielkeCDF'] = [LGD_mielke.cdf(x) for x in LGD_data['LGD']]\n",
        "\n",
        "# Get the corresponding z-value in the normal function\n",
        "LGD_data['Z-Mielke'] = [norm.ppf(x) for x in LGD_data['MielkeCDF']]\n",
        "LGD_data['Z-Mielke'].describe()\n",
        "\n",
        "LGD_mielke_model = ElasticNetCV(l1_ratio=np.arange(0.01, 1.01, 0.05),  # l1_ratios to try.\n",
        "                                n_alphas=10,                        # How many alphas to try per l1_ratio\n",
        "                                fit_intercept=True,                 # Use constant?\n",
        "                                max_iter=1000,                      # Iterations\n",
        "                                tol=0.0001,                         # Parameter tolerance\n",
        "                                cv=3,                               # Number of cross_validation folds\n",
        "                                verbose=True,                       # Explicit or silent training\n",
        "                                n_jobs=2,                           # Cores to use\n",
        "                                random_state=20201209               # Random seed\n",
        "                                )\n",
        "from sklearn.model_selection import train_test_split\n",
        "\n",
        "x_train, x_test, y_train, y_mielke_test = train_test_split( LGD_data.iloc[:, 0:13], # Predictors\n",
        "                                                    LGD_data['Z-Mielke'],         # Target variable\n",
        "                                                    test_size=0.33,          # Test size percentage\n",
        "                                                    random_state=20201209    # Seed\n",
        "                                                    )\n",
        "\n",
        "LGD_mielke_model.fit(x_train, y_train)\n",
        "\n",
        "coef_df = pd.concat([pd.DataFrame({'column': x_train.columns}),\n",
        "                    pd.DataFrame(np.transpose(LGD_mielke_model.coef_))],\n",
        "                    axis = 1\n",
        "                   )\n",
        "\n",
        "# Predict over test set\n",
        "mielke_pred_test = LGD_mielke_model.predict(x_test)\n",
        "mielke_pred_test = norm.cdf(mielke_pred_test)\n",
        "mielke_pred_test = LGD_mielke.ppf(mielke_pred_test)\n",
        "\n",
        "# Calculate the error\n",
        "mielke_error = np.abs(mielke_pred_test - y_LGD_test)\n"
      ],
      "metadata": {
        "id": "GEbXk8txMlin"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "XnrJtV8R_dpW"
      },
      "source": [
        "Now that we have the estimates and the error, we can plot our results and calculate the MSE."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "veNyYE_n_iGo"
      },
      "source": [
        "\n",
        "# Print a scatter plot with distributions.\n",
        "fig, ax = plt.subplots(figsize=(11, 8.5))\n",
        "sns.scatterplot(x = y_LGD_test,            # The x is the real value\n",
        "                y = mielke_pred_test,  # The y value is the predictor\n",
        "                hue = mielke_error,    # The colour represents the error\n",
        "                legend = False\n",
        "                )\n",
        "\n",
        "# Overlay a diagonal line\n",
        "X_plot = np.linspace(0, 1, 100)\n",
        "Y_plot = X_plot\n",
        "\n",
        "plt.plot(X_plot, Y_plot, color='r')\n",
        "\n",
        "plt.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0FztiCekAAlx"
      },
      "source": [
        "from sklearn.metrics import mean_squared_error\n",
        "\n",
        "linear_mse = mean_squared_error(y_LGD_test, mielke_pred_test)\n",
        "print('The MSE for the Mielke-distributed model is %.4f' % linear_mse)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2pPgB6hP-uFo"
      },
      "source": [
        "# Predict over test set\n",
        "mielke_pred_test = LGD_mielke_model.predict(x_test)\n",
        "mielke_pred_test = norm.cdf(mielke_pred_test)\n",
        "mielke_pred_test = LGD_mielke.ppf(mielke_pred_test)\n",
        "\n",
        "# Calculate the error\n",
        "mielke_error = np.abs(mielke_pred_test - y_LGD_test)\n",
        "\n",
        "# Print a scatter plot with distributions.\n",
        "fig, ax = plt.subplots(figsize=(11, 8.5))\n",
        "sns.scatterplot(x = y_LGD_test,            # The x is the real value\n",
        "                y = mielke_pred_test,  # The y value is the predictor\n",
        "                hue = mielke_error,    # The colour represents the error\n",
        "                legend = False\n",
        "                )\n",
        "\n",
        "# Overlay a diagonal line\n",
        "X_plot = np.linspace(0, 1, 100)\n",
        "Y_plot = X_plot\n",
        "\n",
        "plt.plot(X_plot, Y_plot, color='r')\n",
        "\n",
        "plt.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "E2qHzEMbBHO3"
      },
      "source": [
        "So we got a lower error! The improvement is not extreme in this dataset, but besides getting a better error we also get a better distribution: Our model starts at 0 and covers most of the original range. We can use this trick to create a regression for any distribution we want. Let's train now an XGBoosting method to compare."
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## XGBoosting\n",
        "\n",
        "The only difference between the models we ran in the XGB lab is the fact that this is a regression problem instead of a classification one. There is no issue with bounds (as opposed to a linear regression) as an XGB always produces estimates within the bounds of the model.\n",
        "\n",
        "First, we recover the original dataset."
      ],
      "metadata": {
        "id": "EEizz1-TXzXX"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "x_train, x_test, y_train, y_LGD_test = train_test_split( LGD_data.iloc[:, 0:13], # Predictors\n",
        "                                                    LGD_data['LGD'],         # Target variable\n",
        "                                                    test_size=0.33,          # Test size percentage\n",
        "                                                    random_state=20201209    # Seed\n",
        "                                                    )"
      ],
      "metadata": {
        "id": "YoBwwZa7q7XQ"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from xgboost import XGBRegressor\n",
        "\n",
        "#Define the classifier.\n",
        "XGB_LGD = XGBRegressor(max_depth=2,                 # Depth of each tree\n",
        "                      learning_rate=0.1,            # How much to shrink error in each subsequent training. Trade-off with no. estimators.\n",
        "                      n_estimators=50,             # How many trees to use, the more the better, but decrease learning rate if many used.\n",
        "                      verbosity=1,                  # If to show more errors or not.\n",
        "                      objective='reg:squarederror',  # Type of target variable.\n",
        "                      booster='gbtree',             # What to boost. Trees in this case.\n",
        "                      n_jobs=2,                    # Parallel jobs to run. Set your processor number.\n",
        "                      gamma=0.001,                  # Minimum loss reduction required to make a further partition on a leaf node of the tree. (Controls growth!)\n",
        "                      subsample=0.632,              # Subsample ratio. Can set lower\n",
        "                      colsample_bytree=1,           # Subsample ratio of columns when constructing each tree.\n",
        "                      colsample_bylevel=1,          # Subsample ratio of columns when constructing each level. 0.33 is similar to random forest.\n",
        "                      colsample_bynode=1,           # Subsample ratio of columns when constructing each split.\n",
        "                      reg_alpha=1,                  # Regularizer for first fit. alpha = 1, lambda = 0 is LASSO.\n",
        "                      reg_lambda=0,                 # Regularizer for first fit.\n",
        "                      random_state=20230331,        # Seed\n",
        "                      tree_method='hist',           # How to train the trees?\n",
        "                      #gpu_id=0                     # With which GPU?\n",
        "                      eval_metric=mean_squared_error, # Metric used for monitoring the training result and early stopping.\n",
        "                      early_stopping_rounds=5 #  Validation metric needs to improve at least once in every early_stopping_rounds round(s) to continue training.\n",
        "                      )"
      ],
      "metadata": {
        "id": "zf164-bYN8mr"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now we'll look for the best parameters following a grid-search."
      ],
      "metadata": {
        "id": "2p_vgUBTlh3B"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Define the parameters. Play with this grid!\n",
        "param_grid = dict({'n_estimators': [50, 100, 150],\n",
        "                   'max_depth': [2, 3, 4],\n",
        "                 'learning_rate' : [0.01, 0.05, 0.1, 0.15]\n",
        "                  })"
      ],
      "metadata": {
        "id": "HDVbSW0QfGt7"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Define grid search object.\n",
        "GridXGB = GridSearchCV(XGB_LGD,        # Original XGB.\n",
        "                       param_grid,          # Parameter grid\n",
        "                       cv = 3,              # Number of cross-validation folds.\n",
        "                       scoring = 'neg_mean_squared_error', # How to rank outputs.\n",
        "                       n_jobs = 2,          # Parallel jobs. -1 is \"all you have\"\n",
        "                       refit = True,       # If refit at the end with the best.\n",
        "                       verbose = 1          # If to show what it is doing.\n",
        "                      )"
      ],
      "metadata": {
        "id": "ZC2pnYddipdw"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "And we'll also create a validation sample for the early stopping."
      ],
      "metadata": {
        "id": "1Cx9QxM8luUK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "x_train_xgb, x_val_xgb, y_train_xgb, y_val_xgb = train_test_split(x_train, y_train, test_size=0.33, random_state=20230331)"
      ],
      "metadata": {
        "id": "FeTt-Ai1k9r4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now we train! To see the training error both on the train and validation samples, we will provide both sets as evaluation sets."
      ],
      "metadata": {
        "id": "koGKZJSrmjPo"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create train and validation sets\n",
        "\n",
        "GridXGB.fit(x_train_xgb, y_train_xgb,\n",
        "            eval_set=[(x_train_xgb, y_train_xgb), (x_val_xgb, y_val_xgb)])"
      ],
      "metadata": {
        "id": "mJYTH9Yiixfg"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "GridXGB.best_estimator_"
      ],
      "metadata": {
        "id": "JTVLZQPTjROB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "results = GridXGB.best_estimator_.evals_result()\n",
        "\n",
        "plt.figure(figsize=(10,7))\n",
        "plt.plot(results[\"validation_0\"][\"rmse\"], label=\"Training loss\")\n",
        "plt.plot(results[\"validation_1\"][\"rmse\"], label=\"Validation loss\")\n",
        "# Check if early stopping happened.\n",
        "if hasattr(GridXGB.best_estimator_, 'best_ntree_limit'):\n",
        "  plt.axvline(GridXGB.best_estimator_.best_ntree_limit, color=\"gray\",\n",
        "              label=\"Optimal tree number\")\n",
        "# Label the axis and show the plot.\n",
        "plt.xlabel(\"Number of trees\")\n",
        "plt.ylabel(\"Loss (RMSE)\")\n",
        "plt.legend()\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "XMwYiUwll9mU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Predict over test set\n",
        "XGB_pred_test = GridXGB.best_estimator_.predict(x_test)\n",
        "\n",
        "# Calculate the error\n",
        "XGB_errors = np.abs(XGB_pred_test - y_LGD_test)\n",
        "XGB_mse = mean_squared_error(XGB_pred_test, y_LGD_test)\n",
        "\n",
        "# Print it!\n",
        "print(f'The MSE for the XGB model is {XGB_mse:.3f}')"
      ],
      "metadata": {
        "id": "C5DKbAk3nPgD"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "XGB achieves, as expected, the lowest error of the models we are testing! This is great as LGD can legally be modelled using these techniques, as they are not used for customer-facing decision-making. Let's plot the error now and see exactly where the gains are."
      ],
      "metadata": {
        "id": "CLOhKCattgJB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Print a scatter plot with distributions.\n",
        "fig, ax = plt.subplots(figsize=(11, 8.5))\n",
        "sns.scatterplot(x = y_LGD_test,            # The x is the real value\n",
        "                y = XGB_pred_test,  # The y value is the predictor\n",
        "                hue = XGB_errors,    # The colour represents the error\n",
        "                legend = False\n",
        "                )\n",
        "\n",
        "# Overlay a diagonal line\n",
        "X_plot = np.linspace(0, 1, 100)\n",
        "Y_plot = X_plot\n",
        "\n",
        "plt.plot(X_plot, Y_plot, color='r')\n",
        "\n",
        "plt.show()"
      ],
      "metadata": {
        "id": "U2eOa6H1njHr"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Much better! The model is able to capture much better the patterns, specially in the middle sector that the other two models failed. That said, the high LGDs are still not very well modelled, which probably means they are structurally different."
      ],
      "metadata": {
        "id": "ZtPYS1jCrAix"
      }
    }
  ]
}