{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "wzkn3hcmOgJb" }, "source": [ "# Weight of Evidence Transformation\n", "\n", "In this lab we will apply a Weight of Evidence transformation to our data. The idea is to:\n", "\n", "- Split the data into a train/test set.\n", "- Generate a relevant set of cuts to our data.\n", "- Calculate the WoE for each variable.\n", "- Save the data we just created.\n", "\n", "We are assuming we have already cleaned the date of outliers and null values.\n", "\n", "In order to do this we will use the fantastic [```scorecardpy```](https://github.com/ShichenXie/scorecardpy) Python package. First we need to install it, as it is not a standard package.\n", "\n", "We use the OS python software ```pip``` for this." ] }, { "cell_type": "code", "metadata": { "id": "3o0aNJJAO-vO" }, "source": [ "!pip install git+https://github.com/CBravoR/scorecardpy" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "u2ENa3Yqmb3g" }, "source": [ "Now we import the data to use. I have created a clean version of the dataset from our last lab.\n", "\n", "---\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "8dhhvMQIPqNb" }, "source": [ "import pandas as pd\n", "!gdown 'https://drive.google.com/uc?id=1-RiFAF4zU27N9MnoSYUlNuqFhR3VcuWs'\n", "bankloan_data = pd.read_pickle('BankloanClean.pkl')\n", "bankloan_data.describe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "sum(bankloan_data.Default==0)" ], "metadata": { "id": "_09n-DMxhfDh" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "sum(bankloan_data.Default==1)" ], "metadata": { "id": "zmpAKnHhho6p" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "bankloan_data.isnull().any()" ], "metadata": { "id": "704Qu-PNugit" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "bankloan_data.OthDebt.fillna(value = 0, inplace=True)\n", "bankloan_data.fillna(bankloan_data.median(), inplace=True)" ], "metadata": { "id": "7xqKqp01uTGk" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Krr0JdtkUK3I" }, "source": [ "It is always a good idea to create new variables to extract information from our models. For example, the following variable improves OthDebt." ] }, { "cell_type": "code", "metadata": { "id": "hiDgfEa_9uQU" }, "source": [ "bankloan_data['OthDebtRatio'] = bankloan_data['OthDebt'] / bankloan_data['Income']" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "Dl36EUQ4mkH2" }, "source": [ "## Binning\n", "\n", "The first step is to properly bin the data. Usually, we will run a tree and manually adjust those cases that do not follow a logical pattern.\n", "\n", "However, as calculating WoE means we need to use the objective variable, we need to first create a train and test split. The scorecard package comes with a function to do so easily, ```split_df```, which takes as an argument the ratio and the seed.\n", "\n", "**Note: A random seed is used to generate a random split that will be reproducible (is there such as thing as randomness in a computer?). I expect for all coursework for you to use your student ID.**" ] }, { "cell_type": "markdown", "metadata": { "id": "tMCGh9cYD88G" }, "source": [] }, { "cell_type": "code", "metadata": { "id": "r7lfFEDQZIzL" }, "source": [ "import scorecardpy as sc\n", "import numpy as np" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Zc_TCHdZX6W6" }, "source": [ "# Split in train and test BEFORE we apply WoE\n", "# Use your Student ID as seed!\n", "\n", "train, test = sc.split_df(bankloan_data.iloc[:,1:],\n", " y='Default',\n", " ratio=0.7,\n", " seed=20190227).values()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "ynoVNOjoYm_t" }, "source": [ "test.describe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "nm5wr5atqZyi" }, "source": [ "train.describe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "uMyuKFBbnUE4" }, "source": [ "Now we can bin the variables. The function ```woebin``` will do this automatically for us. It will use trees sequentially given the constraints we decide. It is good practice to not leave less than 5% of cases in each bin, and I am using 20 starting bins.\n", "\n", "**Tip: For larger datasets, use a relatively large number of bins (50 to 100), for smaller ones, use less.**" ] }, { "cell_type": "code", "metadata": { "id": "BnV_3pcoZNuW" }, "source": [ "bins = sc.woebin(train, y='Default',\n", " min_perc_fine_bin=0.02, # How many bins to cut initially into\n", " min_perc_coarse_bin=0.05, # Minimum percentage per final bin\n", " stop_limit=0.02, # Minimum information value\n", " max_num_bin=8, # Maximum number of bins\n", " method='tree'\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ZMspXWGknuk1" }, "source": [ "Now we can plot the results. We need to be able to explain the results for each variable. We should be able to explain every trend. They do not need to necessarilly be linear, we just need a good explanation for the trend.\n", "\n", "**If you cannot explain the trend you need to adjust the bins.**" ] }, { "cell_type": "code", "metadata": { "id": "lbZ2ZjM2aJ5p" }, "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import seaborn as sns\n", "\n", "sc.woebin_plot(bins)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "vOgBTMR01ipI" }, "source": [ "bins" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "5jTW3xtwadEi" }, "source": [ "## Manual adjustment\n", "\n", "In this case, Creddebt and Income don't follow an explainable trend, so we need to make manual adjustments.\n", "\n", "Note that OthDebt has an IV of 0.0927. **The output of the binning function is the best possible binning from a statistical point of view. Changing it will only make the IV go down.** As the IV of OthDebt is really close to 1, you need to make a call. How many variables do you have available? If plenty, then you can drop the variable, otherwise, you should leave it and see if this can be sorted out later.\n", "\n", "Now we need to adjust Creddebt and Income. We can make this interactively within the package with the excelent function ```woebin_adj``` that allows us to make adjustments one by one, only for those variables with irregular trends (this can be changed in the options). We start by invoking the function, note the bar at the end." ] }, { "cell_type": "code", "metadata": { "id": "MKBTy6ALabkH" }, "source": [ "breaks_adj = sc.woebin_adj(train, \"Default\", bins, adj_all_var=True)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "XL3NLMtJbc9i" }, "source": [ "For Creddebt we need to enter the following breaks:\n", "\n", "```1.0,6.0```\n", "\n", "For OthDebt we enter\n", "\n", "```0.4, 1.6, 3.2, 11.4```\n", "\n", "For Income, we enter the following breaks:\n", "\n", "```30.0,40.0,80.0,140.0```\n", "\n", "And for OthDebtRatio we enter these ones:\n", "\n", "```0.01,0.05,0.07,0.09,0.14```\n", "\n", "The variables are now much better behaved. Play around with the breaks to see if you can get to better solutions. OthDebt is a tricky variable: Its IV is 0.13 in the original (incorrect) binning, but the corrected one is (really slightly) below 0.1 so it would be discarded. Should we with a variable that close to 1? We could play around a bit more to see if we can get it to be rational.\n", "\n", "Now that we are happy with the binnings, **we need to apply it to both of our datasets**." ] }, { "cell_type": "code", "metadata": { "id": "t7IGq2pQhr4W" }, "source": [ "breaks_adj" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "2UsTbIEGd3zz" }, "source": [ "bins_adj = sc.woebin(train, y=\"Default\", breaks_list=breaks_adj) # Apply new cuts\n", "train_woe = sc.woebin_ply(train, bins_adj) # Calculate WoE dataset (train)\n", "test_woe = sc.woebin_ply(test, bins_adj) # Calculate WoE dataset (test)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "umItG0yverAA" }, "source": [ "train_woe.head()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "jpPrAh0cTH1c" }, "source": [ "The ```bins``` object is a dictionary. You can get each WoE table by calling the function ```get``` with the name of the variable as an argument." ] }, { "cell_type": "code", "metadata": { "id": "5-k9XXje3kYE" }, "source": [ "bins.get('MonthlyLoad')" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "DrlgxlWrS2vn" }, "source": [ "You can see the adjusted breaks by calling the object. It is a good idea to save it somewhere as it saves time if you need to recalculate everything again (see next lab)." ] }, { "cell_type": "code", "metadata": { "id": "2qxIScZ54LF7" }, "source": [ "breaks_adj" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "lw-uVPSVdTb8" }, "source": [ "## Information Value Filtering\n", "\n", "Now we can check the information value of our variables and remove those who are not predictive. We use the function ```iv```. In general:\n", "\n", "- $IV < 0.02$: No predictive ability, remove.\n", "- $0.02 \\le IV < 0.1$: Small predictive ability, suggest to remove.\n", "- $0.1 \\le IV < 0.3$: Medium predictive ability, leave.\n", "- $0.3 \\le IV < 1$: Good predictive ability, leave.\n", "- $1 \\le IV $: Strong predictive ability. Suspicious variable. Study if error in calculation (i.e. WoE leaves a category with 100% goods or bads) or if variable is capturing future information." ] }, { "cell_type": "code", "metadata": { "id": "EgYus86QbcH-" }, "source": [ "# Calculate variable importance\n", "iv_df = pd.DataFrame(sc.iv(train_woe, 'Default'))\n", "iv_df" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Plot variable importance\n", "sns.barplot(x=\"variable\", y=\"info_value\", data=iv_df)\n", "plt.xticks(rotation=90)\n", "plt.show()" ], "metadata": { "id": "La7SoUZUyCrq" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "cE324-TnexOF" }, "source": [ "As we can see, Education, and OthDebt are below the threshold. Income and OthDebt are a bit of an odd ones as they have a very close to one IV. **I recommend you leave Income as OthDebt was not explainable. Step 2 of variable selection can get rid of them if necessary**. We can easily remove the other variable manually." ] }, { "cell_type": "code", "metadata": { "id": "HLp4t1bAfsgi" }, "source": [ "# Check column order.\n", "train_woe.columns" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "tuu960ELiMpM" }, "source": [ "# Create range of accepted variables\n", "accepted_range = np.r_[0:3,4:7,8:10] # Note the last in each subrange is not used\n", "train_woe = train_woe.iloc[:, accepted_range]\n", "test_woe = test_woe.iloc[:, accepted_range]\n", "train_woe.head()\n", "\n", "# Alternative\n", "# train_woe.drop(axis = 0, index = ['Education_woe', any others], inplace = True)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "apX9iD-Fi7UG" }, "source": [ "test_woe.head()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "## Final correlation analysis\n", "\n", "Now we need to do a final correlation analysis. WoE might introduce correlation effects which were originally not present. For this, we can use seaborn and the corr function in numpy." ], "metadata": { "id": "p2XcgGhqydzx" } }, { "cell_type": "code", "source": [ "from string import ascii_letters\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ], "metadata": { "id": "Vs38kwraycSy" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "# Compute the correlation matrix\n", "corr = train_woe.corr()\n", "corr = np.abs(corr)\n", "\n", "# Generate a mask for the upper triangle\n", "mask = np.triu(np.ones_like(corr, dtype=bool))\n", "\n", "# Set up the matplotlib figure\n", "f, ax = plt.subplots(figsize=(11, 9))\n", "\n", "# Generate a custom diverging colormap\n", "cmap = sns.diverging_palette(230, 20, as_cmap=True)\n", "\n", "# Draw the heatmap with the mask and correct aspect ratio\n", "sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1, center=0,\n", " square=True, linewidths=.5, cbar_kws={\"shrink\": .5})\n", "\n", "# Show the plot\n", "plt.show()" ], "metadata": { "id": "H_SLzs7yyk7y" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "source": [ "corr" ], "metadata": { "id": "yq-jyYKbywUK" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "There are a few variables that we need to key an eye on. First is Age and Address, with a correlation of 0.85. We can remove them at this stage, but let's keep an eye on the coefficients these variables give. In a logistic regression with WoE transformation, **all coefficients should be positive**, otherwise there are correlations effects at play." ], "metadata": { "id": "BLtbFtFdy7ab" } }, { "cell_type": "code", "source": [ "corr" ], "metadata": { "id": "_ccB6TIFXZMY" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "B0ivWL8Ui_QP" }, "source": [ "## Saving the results.\n", "\n", "Now we are ready to apply models! We have our train and test datasets ready. We can now save the csv into our local file system or our Google Drive. In the latter case, the process is a bit complicated, as it requires us to connect our accounts. The detailed instructions are [here](https://colab.research.google.com/notebooks/io.ipynb#scrollTo=vz-jH8T_Uk2c).\n", "\n", "We will download the data to our own hard drive. First we need to save our data as csv, using the function [```to_csv```](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)." ] }, { "cell_type": "code", "metadata": { "id": "1Op95-IDY-zM" }, "source": [ "train_woe.drop(columns=['Address_woe', 'OthDebtRatio_woe'], inplace=True)\n", "test_woe.drop(columns=['Address_woe', 'OthDebtRatio_woe'], inplace=True)\n", "train_woe.to_csv(\"train_woe.csv\", index = False)\n", "test_woe.to_csv(\"test_woe.csv\", index = False)\n", "bankloan_data.to_pickle('BankloanCleanNewVars.pkl')\n", "!ls # Linux commands to check what files are in the computer." ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "source": [ "Now you can download this data using the sidebar." ], "metadata": { "id": "F0wE-1pvzmME" } } ] }