{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [], "include_colab_link": true }, "kernelspec": { "name": "python3", "display_name": "Python 3" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "view-in-github", "colab_type": "text" }, "source": [ "\"Open" ] }, { "cell_type": "markdown", "metadata": { "id": "IccG7h1OAa0R" }, "source": [ "# Logistic Regression and Scorecards\n", "\n", "In this lab we will finally start running models! For this we will use the excellent [```scikit-learn```](https://scikit-learn.org/stable/) package, which implements many, many data science methods. This is the go-to tool for any structured data analysis package.\n", "\n", "First, we will import the data from last week. We will download them from my Google Drive." ] }, { "cell_type": "code", "metadata": { "id": "FGwyxAAuBYZ3" }, "source": [ "# Import the csv files from last week.\n", "!gdown 'https://drive.google.com/uc?id=1LWRFLpJtTopAlRqTuUd9XZvGB6CoHa2z'\n", "!gdown 'https://drive.google.com/uc?id=1IvY78EGu-eizec_9agJUsQWDLT-wmSHF'\n", "!gdown 'https://drive.google.com/uc?id=1aDraDSR2OQbIMjIY07s-rD5cel2x_iS-'" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "DGdEWvmjBaWo" }, "source": [ "!ls" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "pEzFvUnCPlqs" }, "source": [ "Now we install the scorecardpy package and clean our data." ] }, { "cell_type": "code", "metadata": { "id": "6leC6-rRPrs7" }, "source": [ "!pip install git+https://github.com/CBravoR/scorecardpy" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "Jr3dMH2OS3X0" }, "source": [ "# Package loading\n", "import pandas as pd\n", "import numpy as np\n", "import scorecardpy as sc\n", "from string import ascii_letters\n", "\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "pp0no9a1Ps4S" }, "source": [ "# Import the files as Pandas datasets\n", "bankloan_train_WoE = pd.read_csv('train_woe.csv')\n", "bankloan_test_WoE = pd.read_csv('test_woe.csv')\n", "bankloan_data = pd.read_pickle('BankloanCleanNewVars.pkl')\n", "\n", "# Eliminate unused variables\n", "# bankloan_data.drop(columns=['Education'], inplace = True)\n", "\n", "# Same train-test split as before (because of seed!)\n", "bankloan_train_noWoE, bankloan_test_noWoE = sc.split_df(bankloan_data.iloc[:, 1:],\n", " y = 'Default',\n", " ratio = 0.7,\n", " seed = 20190227).values()\n", "\n", "# Give breaks for WoE\n", "breaks_adj = {'Address': [1.0,2.0,8.0,17.0],\n", " 'Age': [30.0,45.0,50.0],\n", " 'Creddebt': [1.0, 6.0],\n", " 'Employ': [4.0,14.0,22.0],\n", " 'Income': [30.0,40.0,80.0,140.0],\n", " 'Leverage': [8.0,16.0,22.0],\n", " 'MonthlyLoad': [0.1,0.2,0.30000000000000004,0.7000000000000001],\n", " 'OthDebt': [1.0,2.0,3.0],\n", " 'OthDebtRatio': [0.01,0.05,0.07,0.09,0.14]\n", " }\n", "\n", "# Apply breaks.\n", "bins_adj = sc.woebin(bankloan_train_noWoE, y=\"Default\",\n", " breaks_list=breaks_adj)\n", "" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "gxHzupctVYi3" }, "source": [ "## Generating a logistic regression object\n", "\n", "To train a logistic regression, we first need to create an object that stores how we want the model to be trained. In general, all of scikit-learn models work this way:\n", "\n", "- We create the model we want to train, with all required parameters. This model is **not trained yet**, it just keeps the logic we will use.\n", "\n", "- We apply the ```fit``` function to the object we just created. This takes as input the training set and the targets (if the model is supervised), and will update our model with trained parameters.\n", "\n", "- We then used our trained model to apply it to a test set, and calculate outputs.\n", "\n", "Logistic regression is included in the [```linear_model subpackage```](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model) and it comes pre-packaged with all regularization algorithms: the LASSO penalization, the Ridge penalization and the ElasticNet method (refer to the lectures for the explanation of these, or read this [excellent tutorial](https://codingstartups.com/practical-machine-learning-ridge-regression-vs-lasso/)).\n", "\n", "In a nutshell, LASSO and Ridge are going to penalize including variables by adding either a linear (LASSO) or quadratic (Ridge) term to the minimization algorithm, or a combination of the two if using Elastic Net.\n", "\n", "These methods have hypermparameters that need to be optimized. For this we will use a cross-validation procedure (again, refer to the lectures). Luckily for us, scikit-learn already comes with an object that will allow cross-validated optimization of the penalization parameter. The function to call is[```LogisticRegressionCV```](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV)\n", "\n", "Let's start by creating this object.\n", "\n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "KB56nPSfVU5E" }, "source": [ "from sklearn.linear_model import LogisticRegressionCV\n", "\n", "bankloan_logreg = LogisticRegressionCV(penalty='elasticnet', # Type of penalization l1 = lasso, l2 = ridge, elasticnet\n", " Cs = 10, # How many parameters to try. Can also be a vector with parameters to try.\n", " tol=0.000001, # Tolerance for parameters\n", " cv = 3, # How many CV folds to try. 3 or 5 should be enough.\n", " fit_intercept=True, # Use constant?\n", " class_weight='balanced', # Weights, see below\n", " random_state=20190301, # Random seed\n", " max_iter=100, # Maximum iterations\n", " verbose=2, # Show process. 1 is yes.\n", " solver = 'saga', # How to optimize.\n", " n_jobs = 2, # Processes to use. Set to number of physical cores.\n", " refit = True, # If to retrain with the best parameter and all data after finishing.\n", " l1_ratios = np.arange(0, 1.01, 0.1), # The LASSO / Ridge ratios.\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "weIQIKZEaP6o" }, "source": [ "Let's dig deeper into what is needed.\n", "\n", "**Penalty**\n", "\n", "'l1' penalty refers to LASSO regression (great at selecting variables), 'l2' to Ridge regression (not very good at selecting variables), and 'elasticnet'. My advice: As long as you have more samples than variables, start with LASSO, if it doesn't work or you are not happy with the results, move to elasticnet.\n", "\n", "**Penalty constants to try (```Cs```)**\n", "\n", "This refers to how many LASSO or Ridge parameters to try. These parameters measure the weight of the error in prediction versus the regularization (penalty) error. When optimizing the parameters, a penalization constant will try to optimise the following:\n", "\n", "$$\n", "Error = Error_{prediction} + \\frac{1}{C} \\times Error_{penalty}\n", "$$\n", "\n", "So the $C$ constant will balance both objectives. By giving a Cs larger than 1, it will try as many parameters as given.\n", "\n", "**Class weighting**\n", "\n", "Most interesting problems are unbalanced. This means the interesting class (Default in our case) has less cases than the opposite class. Models optimise the sum over **all** cases, so if we minimize the error, which class do you think will be better classified?\n", "\n", "This means we need to balance the classes to make them equal. Luckily for us, Scikit-Learn includes automatic weighting that assigns the same error to both classes. The error becomes the following:\n", "\n", "$$\n", "Error = Weight_1 \\times Error_{predictionClass1} + Weight_2 \\times Error_{predictionClass2} + \\frac{1}{C} \\times Error_{penalty}\n", "$$\n", "\n", "The weights are selected so the theoretical maximum error in both classes is the same (see the help for the exact equation).\n", "\n", "**Random State**\n", "\n", "The random seed. Remember to use your student ID.\n", "\n", "**Iterations**\n", "\n", "The solution comes from an iterative model, thus we specify a maximum number of iterations. Remember to check for convergence after it has been solved!\n", "\n", "**Solver**\n", "\n", "Data science functions are complex ones, with thousands, millions, or even billions of parameters. Thus we need to use the best possible solver for our problems. Several are implemented in scikit-learn. The help states that:\n", "\n", "\n", "- For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and ‘saga’ are faster for large ones.\n", "- For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’ handle multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes.\n", "- ‘newton-cg’, ‘lbfgs’ and ‘sag’ only handle L2 penalty, whereas ‘liblinear’ and ‘saga’ handle L1 penalty.\n", "\n", "We will use 'saga', a very efficient solver. You can read all about it [here](https://www.di.ens.fr/~fbach/Defazio_NIPS2014.pdf).\n", "\n", "**refit**\n", "\n", "If your data is sufficiently small to fit in memory, you will be able to use all of the training data for the cross-validation process. If so, then with ```refit=True``` you will retrain the model after the parameter search, using the optimal parameter found.\n", "\n", "However, in large datasets this might not be possible. In this case:\n", "\n", "1. Obtain a **validation sample** from the original training data. Usually 20% of data is used, but it depends on memory and time constraints.\n", "\n", "2. Run the Cross-validation process over this validation data and find the optimal parameter. Let's call it $C^*$.\n", "\n", "3. Train a logistic regression with all training data, but with a fixed parameter $C^*$. For this you need to use the function [```LogisticRegression```](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression) in scikit-learn and give the parameter ```C=YOUR_OPTIMAL_C```. The rest of the parameters are similar to ```LogisticRegressionCV```.\n", "\n", "The ```LogisticRegression``` object has another interesting interesting parameter for big data models. ```warm_start```.\n", "\n", "**Warm start**\n", "\n", "Scikit-learn allows for multiple adjustments to the training. For example, you can try first with a little bit of data just to check if everything is working, and then, if you set ```warm_start = True``` before, it will retrain starting from the original parameters. Allows for dynamic updating as well. ```warm_start = False``` means whenever we give it new data, it will start from scratch, forgetting what it previously learned.\n", "\n", "**l1_ratios**\n", "\n", "These are the balance parameters between LASSO and Ridge for the ElasticNet optimization, with 0 <= l1_ratio <= 1. A value of 0 is equivalent to using penalty='l2', while 1 is equivalent to using penalty='l1'." ] }, { "cell_type": "markdown", "metadata": { "id": "QDmULPDedbKx" }, "source": [ "## Training!\n", "\n", "Now we are ready to train. We simply apply the method ```fit``` to our data, giving it the training set and the target variable as inputs." ] }, { "cell_type": "code", "source": [ "bankloan_train_WoE.columns" ], "metadata": { "id": "2tZTn7OMprvp" }, "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "aGzK11UOdUDE" }, "source": [ "bankloan_logreg.fit(X = bankloan_train_WoE.iloc[:, 1:].drop( columns=\"Income_woe\"), # All rows and from the second var to end\n", " y = bankloan_train_WoE['Default'] # The target\n", " )" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "hpaX6Nz5fS31" }, "source": [ "Let's read the output:\n", "\n", "```convergence after 25 epochs took 0 seconds```\n", "\n", "The method was able to find a solution at the given tolerance, and it took 16 iterations and almost no time. **If the method says it did not converge then you need to increase iterations, change C or both!**.\n", "\n", "The rest of the output refers to what it did, it is not relevant at this stage.\n", "\n", "Done! We have a logistic regression! Let's check the parameters, sorted into a nice table.\n" ] }, { "cell_type": "code", "metadata": { "id": "l7Y-z7naeZk9" }, "source": [ "coef_df = pd.concat([pd.DataFrame({'column': bankloan_train_WoE.columns[np.r_[1:6,7]]}),\n", " pd.DataFrame(np.transpose(bankloan_logreg.coef_))],\n", " axis = 1\n", " )\n", "\n", "coef_df" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "51iSCJE6gPo0" }, "source": [ "We can see the parameter for each variable now. This does not include the constant. We can get it with" ] }, { "cell_type": "code", "metadata": { "id": "NpNZLhOOfytV" }, "source": [ "bankloan_logreg.intercept_" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "W4HYfdaXnJPk" }, "source": [ "We can see all variables are being used, and the intercept is really close to 0. This is expected in a balanced logistic regression that uses WoE transform and is a way to check everything is working as intended.\n", "\n", "We can see that all coefficients are correctly determined, even in the presence of correlations. This happens because the **ElasticNet penalty deals with correlations gracefully**. This is NOT the case if we had a LASSO regression. Try it yourself and see. In that case, you would need to manually eliminate the variables so everything works correctly." ] }, { "cell_type": "markdown", "metadata": { "id": "19LLh3CzGEyP" }, "source": [ "We can also check the optimal hyperparameters found." ] }, { "cell_type": "code", "metadata": { "id": "LS6iDSe0M3Fy" }, "source": [ "print(bankloan_logreg.l1_ratio_)\n", "print(bankloan_logreg.C_)" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "tyhGIFcfn1sO" }, "source": [ "## Applying to the test set\n", "\n", "We can now apply our results to the test set, and check our results. Most models in scikit-learn have the ```predict``` method which applies the model to new data, this gives the 0-1 prediction. Alternatively (and more usefully) we can use the ```predict_proba``` method that gives the probability." ] }, { "cell_type": "code", "metadata": { "id": "L8ckdDCGntCt" }, "source": [ "pred_class_test = bankloan_logreg.predict(bankloan_test_WoE.iloc[:, 1:].drop(columns=\"Income_woe\"))\n", "probs_test = bankloan_logreg.predict_proba(bankloan_test_WoE.iloc[:, 1:].drop(columns=\"Income_woe\"))\n", "print(probs_test[0:5], pred_class_test[0:5])" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "q3BPKP7Zope6" }, "source": [ "Scikit-learn will give, by default, one probability per class. The second column is the one that applies for class Default = 1.\n", "\n", "We will get the confusion matrix to check our accuracy. These are included in the subpackage ```sklearn.metrics``` and we will plot it using seaborn." ] }, { "cell_type": "code", "metadata": { "id": "DvllRUyvoj8-" }, "source": [ "from sklearn.metrics import confusion_matrix" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "seZjBboipd6G" }, "source": [ "\n", "# Calculate confusion matrix\n", "confusion_matrix_cs = confusion_matrix(y_true = bankloan_test_WoE['Default'],\n", " y_pred = pred_class_test)\n", "\n", "\n", "# Turn matrix to percentages\n", "confusion_matrix_cs = confusion_matrix_cs.astype('float') / confusion_matrix_cs.sum(axis=1)[:, np.newaxis]\n", "\n", "# Turn to dataframe\n", "df_cm = pd.DataFrame(\n", " confusion_matrix_cs, index=['Non Defaulter', 'Defaulter'],\n", " columns=['Non Defaulter', 'Defaulter'],\n", ")\n", "\n", "# Parameters of the image\n", "figsize = (10,7)\n", "fontsize=14\n", "\n", "# Create image\n", "fig = plt.figure(figsize=figsize)\n", "heatmap = sns.heatmap(df_cm, annot=True, fmt='.2f')\n", "\n", "# Make it nicer\n", "heatmap.yaxis.set_ticklabels(heatmap.yaxis.get_ticklabels(), rotation=0,\n", " ha='right', fontsize=fontsize)\n", "heatmap.xaxis.set_ticklabels(heatmap.xaxis.get_ticklabels(), rotation=45,\n", " ha='right', fontsize=fontsize)\n", "\n", "# Add labels\n", "plt.ylabel('True label')\n", "plt.xlabel('Predicted label')\n", "\n", "# Plot!\n", "plt.show()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "FwMDtGwdq35F" }, "source": [ "Pretty good model! Let's create a scorecard now!" ] }, { "cell_type": "markdown", "metadata": { "id": "smaB3Mq3OKAk" }, "source": [ "## Scorecards\n", "\n", "The package ```scorecardpy``` has the function ```scorecard``` which receives a trained logistic regression model trained over WoE-transformed data, a trained scorecard **over the same variables** and a list of matched columns (that is, the order of columns in the scorecard). As optional arguments it receives a PDO, a base score, and decimal base odds (so instead of 50:1, it receives 0.02).\n", "\n", "You should adjust these values so the score is in a range that's acceptable. Typically between 0 and 1000." ] }, { "cell_type": "code", "metadata": { "id": "OCipEb4CO-RJ" }, "source": [ "bankloan_sc = sc.scorecard(bins_adj, # bins from the WoE\n", " bankloan_logreg, # Trained logistic regression\n", " bankloan_train_WoE.columns[np.r_[1:6,7]], # The column names in the trained LR\n", " points0=750, # Base points\n", " odds0=0.01, # Base odds bads:goods\n", " pdo=50\n", " ) # PDO\n" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "5m82EUFKPDTR" }, "source": [ "bankloan_sc" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "2z72hOjuPD1p" }, "source": [ "# Applying the credit score. Applies over the original data!\n", "train_score = sc.scorecard_ply(bankloan_train_noWoE, bankloan_sc,\n", " print_step=0)\n", "test_score = sc.scorecard_ply(bankloan_test_noWoE, bankloan_sc,\n", " print_step=0)" ], "execution_count": null, "outputs": [] }, { "cell_type": "code", "metadata": { "id": "5XxmR_1SPIBP" }, "source": [ "train_score.describe()" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ftudtXeaPMEI" }, "source": [ "And that's it! We have a fully functional credit scorecard. In later labs we will contrast this with two more models: a Random Forest and an XGBoost model." ] } ] }