{ "cells": [ { "cell_type": "markdown", "id": "3da7c691-7c3c-4ca9-826d-a2e3ed5dc510", "metadata": {}, "source": [ "# Xepelein Challenge \n", "## Respuesta 2: Entrenamiento del modelo" ] }, { "cell_type": "markdown", "id": "cf6f976f-03ca-4717-abdf-e31e01d37f36", "metadata": {}, "source": [ "- Se entrena un modelo de clasificación lightGBM.\n", "- El método prepare_data transforma la variable dependiente overdueDays en 0 / 1 dependiendo de si la mora será mayor a 30 días, de modo que 1, implica una alerta. También elimina los ids de business y payer ya que considero que tenemos información relevante sobre estos en las otras variables (y porque los datos son muy pocos para usar los ids como variables categóricas). \n", "- Se elimina el método impute_missing (sugerido en el template) porque este tipo de modelos no tiene problemas para gestionar nulos y porque no tengo una mejor heurística para rellenarlos (además que los datos proporcionados no tienen nulos). \n", "- En el método fit entrena el modelo LGBM.\n", "- El método model_summary incluye métricas de accuracy, precision, recall, F1 así como una matriz de confusión, todas relevantes para analizar la calidad de un modelo de clasificación\n", "- El método predict realiza predicciones\n", "\n", "- Métricas obtenidas:\n", " - Model Evaluation:\n", " - Accuracy: 0.85\n", " - Precision: 0.32\n", " - Recall: 0.15\n", " - F1 Score: 0.20\n", "\n", "- Comentario: El modelo tiene una buena métrica de accuracy. Sin embargo, dada la naturaleza del problema de negocio, la métrica de precisión es muy importante en este modelo, ya que es relativamente más costoso no detectar un caso de alerta positivo de manera correcta (es decir, prestar dinero y que no te lo devuelvan con menos de 30 días de mora). En este caso el modelo tiene una precisión bastante baja (0.32) pero no me he concentrado en mejorar esta métrica dadas las instrucciones del ejercicio, más centradas en la puesta en producción del modelo que en sus métricas de rendimiento. \n" ] }, { "cell_type": "code", "execution_count": 5, "id": "bef9dcc3-b957-4175-a6aa-ab71c9936617", "metadata": {}, "outputs": [], "source": [ "import pickle\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score\n", "from lightgbm import LGBMClassifier\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 6, "id": "6f5eeef6-336f-4cf1-b5db-4e97ba0d3b41", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "count 3000.000000\n", "mean 18.477000\n", "std 77.756463\n", "min -169.000000\n", "25% -7.000000\n", "50% 0.000000\n", "75% 10.000000\n", "max 539.000000\n", "Name: overdueDays, dtype: float64" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv(\"data/dataTest.csv\")\n", "df = df.drop([\"Unnamed: 0\"], axis = 1)\n", "df.overdueDays.describe()" ] }, { "cell_type": "code", "execution_count": 7, "id": "6f2b16d5-abdd-4282-bb86-fbfb09f7fee4", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "invoiceId 0\n", "businessId 0\n", "payerId 0\n", "receiptAmount 0\n", "relationDays 0\n", "relationRecurrence 0\n", "issuerInvoicesAmount 0\n", "issuerCancelledInvoices 0\n", "activityDaysPayer 0\n", "clients12Months 0\n", "overdueDays 0\n", "dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "code", "execution_count": 8, "id": "193fb0d7-46c7-42c4-90ec-99dc674a4039", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "overdye_days_alert\n", "0 2583\n", "1 417\n", "Name: count, dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v_dep = pd.Series( np.where(df[\"overdueDays\"] > 30, 1, 0), name = \"overdye_days_alert\")\n", "v_dep.value_counts()" ] }, { "cell_type": "code", "execution_count": 9, "id": "a2e49ccb-e67c-48e6-96e0-3876c3bd2199", "metadata": {}, "outputs": [], "source": [ "# guardo algunos datos para validar al final: \n", "df_train, df_validation = train_test_split(df, test_size=0.2, random_state=42)\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "5dd10d85-0b27-4318-899f-93f48a578c5a", "metadata": {}, "outputs": [], "source": [ "class Model:\n", " def __init__(self, sample_df: pd.DataFrame):\n", " \"\"\"\n", " Initialize the class.\n", " \"\"\"\n", " self.data = self.prepare_data(sample_df)\n", " self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(\n", " self.data.drop('overdueDays', axis=1),\n", " self.data['overdueDays'],\n", " test_size=0.2,\n", " random_state=42\n", " )\n", " self.model = None\n", "\n", " def prepare_data(self, df: pd.DataFrame = None) -> pd.DataFrame:\n", " \"\"\"\n", " Prepare data.\n", " \"\"\"\n", " df = df.drop([\"businessId\", \"payerId\"], axis=1)\n", " df = df.set_index(\"invoiceId\")\n", " df[\"overdueDays\"] = np.where(df[\"overdueDays\"] > 30, 1, 0)\n", " return df.copy()\n", "\n", " def fit(self) -> None:\n", " \"\"\"\n", " Fit the model on the training data passed in the constructor, assuming it has\n", " been prepared by the function prepare_data \n", " \"\"\"\n", " self.model = LGBMClassifier()\n", " self.model.fit(self.X_train, self.y_train)\n", "\n", " def model_summary(self) -> str:\n", " \"\"\"\n", " Create a short summary of the model you have fit.\n", " \"\"\"\n", " if self.model is not None:\n", " y_pred = self.model.predict(self.X_test)\n", " confusion_mat = confusion_matrix(self.y_test, y_pred)\n", " accuracy = accuracy_score(self.y_test, y_pred)\n", " precision = precision_score(self.y_test, y_pred)\n", " recall = recall_score(self.y_test, y_pred)\n", " f1 = f1_score(self.y_test, y_pred)\n", "\n", " summary = f\"Model Evaluation:\\nAccuracy: {accuracy}\\nPrecision: {precision}\\nRecall: {recall}\\nF1 Score: {f1}\\n\\nConfusion Matrix:\\n{confusion_mat}\"\n", " return summary\n", " else:\n", " return \"Model not fitted. Call the fit method first.\"\n", "\n", " def predict(self, df: pd.DataFrame = None) -> pd.Series:\n", " \"\"\"\n", " Make a set of predictions with the model.\n", " \"\"\"\n", " if self.model is not None:\n", " predictions = pd.Series(self.model.predict(df), index=df.index)\n", " return predictions\n", " else:\n", " raise ValueError(\"Model not fitted. Call the fit method first.\")\n", "\n", " def save(self, path: str) -> None:\n", " \"\"\"\n", " Save the model as .pkl\n", " \"\"\"\n", " if self.model is not None:\n", " with open(path, 'wb') as file:\n", " pickle.dump(self.model, file)\n", " else:\n", " raise ValueError(\"Model not fitted. Call the fit method first.\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "4374aadd-6b9a-43d6-ad61-e5ed8a6f8596", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[LightGBM] [Info] Number of positive: 275, number of negative: 1645\n", "[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000389 seconds.\n", "You can set `force_row_wise=true` to remove the overhead.\n", "And if memory is not enough, you can set `force_col_wise=true`.\n", "[LightGBM] [Info] Total Bins 1672\n", "[LightGBM] [Info] Number of data points in the train set: 1920, number of used features: 7\n", "[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.143229 -> initscore=-1.788725\n", "[LightGBM] [Info] Start training from score -1.788725\n", "Model Evaluation:\n", "Accuracy: 0.8541666666666666\n", "Precision: 0.32142857142857145\n", "Recall: 0.15\n", "F1 Score: 0.20454545454545456\n", "\n", "Confusion Matrix:\n", "[[401 19]\n", " [ 51 9]]\n" ] } ], "source": [ "model = Model(df_train)\n", "model.fit()\n", "print(model.model_summary())" ] }, { "cell_type": "code", "execution_count": 12, "id": "73e75a06-d071-473b-b0a2-cf9cf285f7ef", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.15" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "9 / 60" ] }, { "cell_type": "code", "execution_count": 13, "id": "48461cc9-b482-4e55-9838-9a9574d80b3b", "metadata": {}, "outputs": [], "source": [ "df_validation = model.prepare_data(df_validation)\n", "X_validation = df_validation.drop(\"overdueDays\", axis=1)\n", "y_true = df_validation[\"overdueDays\"]\n", "y_pred = model.predict(X_validation)\n" ] }, { "cell_type": "code", "execution_count": 14, "id": "82ed07b1-1016-4f0c-bc8c-bd269e3858fb", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "conf_matrix = confusion_matrix(y_true, y_pred)\n", "plt.figure(figsize=(8, 6))\n", "sns.heatmap(conf_matrix, annot=True, fmt=\"d\", cmap=\"Blues\", xticklabels=[\"Not Overdue\", \"Overdue\"], yticklabels=[\"Not Overdue\", \"Overdue\"])\n", "plt.xlabel(\"Predicted\")\n", "plt.ylabel(\"True\")\n", "plt.title(\"Confusion Matrix\")\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "e2f7bc84-a889-4de9-b1af-82821be3db34", "metadata": {}, "source": [ "Comentario: Los colores aquí pueden ser engañosos. Es preciso prestar atención a la fila de abajo, que es la que muestra los casos de mora, y en partticular, al cuadrante izquierdo, que muestra los casos en que tuvieron mora pero se predijo que no la tendrían, ya que estos son cruciales para nuestro problema de negocio. " ] }, { "cell_type": "code", "execution_count": 15, "id": "e807f315-7519-4ca7-b167-eaab45107159", "metadata": {}, "outputs": [], "source": [ "model.save(\"model/lightgbm_deuda.pkl\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 5 }