{ "cells": [ { "cell_type": "code", "execution_count": null, "source": [ "# let's set things up\r\n", "from IPython.core.interactiveshell import InteractiveShell\r\n", "import matplotlib as mpl\r\n", "import matplotlib.pyplot as plt\r\n", "import numpy as np\r\n", "import pandas as pd\r\n", "import seaborn as sns\r\n", "\r\n", "InteractiveShell.ast_node_interactivity = \"all\"\r\n", "%matplotlib inline\r\n", "plt.style.use('default')\r\n", "sns.set()\r\n", "pd.options.display.float_format = '{:,.2f}'.format" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# original data source: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt\r\n", "db_loc = \"./data/diabetes.tab.txt\"\r\n", "db_data = pd.read_csv(db_loc, sep='\\t', header=(0))" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# let's add categorical column based on TSH level\r\n", "def lbl_tsh(row):\r\n", " if row['S4'] < 0.5:\r\n", " return \"low\"\r\n", " elif row['S4'] <= 2.0:\r\n", " return \"ref low\"\r\n", " elif row['S4'] <= 4.0:\r\n", " return \"ref high\"\r\n", " elif row['S4'] <= 6.0:\r\n", " return \"high\"\r\n", " else:\r\n", " return \"very high\"\r\n", "\r\n", "db_data['tsh'] = db_data.apply(lbl_tsh, axis=1)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# let's also classify/categorize BMI\r\n", "\r\n", "def lbl_bmi(row):\r\n", " if row['BMI'] < 18.5:\r\n", " return \"underweight\"\r\n", " elif row['BMI'] < 25.0:\r\n", " return \"normal\"\r\n", " elif row['BMI'] < 30.0:\r\n", " return \"overweight\"\r\n", " else:\r\n", " return \"obese\"\r\n", "\r\n", "db_data['bmi_class'] = db_data.apply(lbl_bmi, axis=1)" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# for reference\r\n", "db_data.head()" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# let's start with something simple\r\n", "# I am going to show both regplot and lmplot, but as regplot's features are a subset of those of lmplot\r\n", "# I will use lmplot in future and possibly other Seaborn plots\r\n", "# Note the use of semicolons to prevent output from the plot functions\r\n", "sns.regplot(x='BMI', y=\"Y\", data=db_data);\r\n", "sns.lmplot(x='BMI', y=\"Y\", data=db_data);" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Other than the shape of the plot area, you will note that they are identical.\r\n", "\r\n", "Now what about our crazyily distributed S4 attribute." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# note the use of jitter, try it without to see what happens\r\n", "sns.lmplot(x='S4', y=\"Y\", data=db_data, x_jitter=.075);" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# let's try splitting by sex and using an exstimator\r\n", "# for comparison let's first display lmplot split by sex\r\n", "sns.lmplot(x='S4', y=\"Y\", data=db_data, col='SEX', hue='SEX', x_jitter=.075);" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# let's try splitting by sex and using an exstimator\r\n", "sns.lmplot(x='S4', y=\"Y\", data=db_data, x_estimator=np.mean, col='SEX', hue='SEX');" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "In the last plot, you can see where some of the data was collapsed into a mean along with a confidence interval.\r\n", "\r\n", "In this case a goodly number of means are along the fitted regression line. Not quite sure how to interpret that information, but it likely means something. Especially at higher values.\r\n", "\r\n", "I did look at trying to reduce the extra points, but didn't like any of the results. Don't know enough to understand the consequences of the paramters I tried (*x_bins* and *x_ci*). But here's a look using *bins=* and *x_bins*." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# just so we can see something different, let's try 40 bins for each\r\n", "sns.lmplot(x='S4', y=\"Y\", data=db_data, x_estimator=np.mean, col='SEX', hue='SEX', x_bins=40);" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# now let's try 80 bins\r\n", "b_cntr = [2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]\r\n", "sns.lmplot(x='S4', y=\"Y\", data=db_data, x_estimator=np.mean, col='SEX', hue='SEX', x_bins=b_cntr);" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Well, that's better. Though still not sure exactly what it is telling us." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# let's draw lmplot for each continuous variable with respect to sex\r\n", "# I'll do this in batches for easier viewing\r\n", "batch1 = ['AGE', 'BMI', 'BP']\r\n", "batch2 = ['S1', 'S2', 'S3']\r\n", "batch3 = ['S4', 'S5', 'S6']" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# couldn't get it to work with a facetgrid or with matplotlib subplots\r\n", "for var in batch1:\r\n", " sns.lmplot(data=db_data, x=var, y='Y', col='SEX', hue='SEX', palette=\"Set2\", height=4, aspect=1.3);" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "for var in batch2:\r\n", " sns.lmplot(data=db_data, x=var, y='Y', col='SEX', hue='SEX', palette=\"Set2\", height=4, aspect=1.3);" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "for var in batch3:\r\n", " sns.lmplot(data=db_data, x=var, y='Y', col='SEX', hue='SEX', palette=\"Set2\", height=4, aspect=1.3);" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# took a bit of research, but sorted something\r\n", "fig, axs = fig, axes = plt.subplots(3, 2, figsize=(24,20), sharey=True)\r\n", "parameters = {'axes.labelsize': 18,\r\n", " 'axes.titlesize': 22}\r\n", "plt.rcParams.update(parameters)\r\n", "y_s1 = db_data.loc[db_data['SEX'] == 1, \"Y\"]\r\n", "y_s2 = db_data.loc[db_data['SEX'] == 2, \"Y\"]\r\n", "# y_s1.head()\r\n", "# y_s2.head()\r\n", "sns.regplot(x=db_data.loc[db_data['SEX'] == 1, \"AGE\"], y=y_s1, color='g', ax=axs[0,0]);\r\n", "# axs[0,0].set_title('SEX = 1', fontsize=18);\r\n", "axs[0,0].set_title('SEX = 1');\r\n", "# axs[0,0].set_ylabel(\"Y \", rotation=\"horizontal\", fontsize=\"large\");\r\n", "axs[0,0].set_ylabel(\"Y \", rotation=\"horizontal\");\r\n", "sns.regplot(x=db_data.loc[db_data['SEX'] == 2, \"AGE\"], y=y_s2, color='orange', ax=axs[0,1]);\r\n", "axs[0,1].set_title('SEX = 2');\r\n", "sns.regplot(x=db_data.loc[db_data['SEX'] == 1, \"BMI\"], y=y_s1, color='g', ax=axs[1,0]);\r\n", "sns.regplot(x=db_data.loc[db_data['SEX'] == 2, \"BMI\"], y=y_s2, color='orange', ax=axs[1,1]);\r\n", "sns.regplot(x=db_data.loc[db_data['SEX'] == 1, \"BP\"], y=y_s1, color='g', ax=axs[2,0]);\r\n", "sns.regplot(x=db_data.loc[db_data['SEX'] == 2, \"BP\"], y=y_s2, color='orange', ax=axs[2,1]);\r\n", "axs[1,0].set_ylabel(\"Y \", rotation=\"horizontal\");\r\n", "axs[2,0].set_ylabel(\"Y \", rotation=\"horizontal\");" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "\r\n", "fig1, axs1 = plt.subplots(len(batch2), 2, figsize=(24,20), sharey=True)\r\n", "# y_s1.head()\r\n", "# y_s2.head()\r\n", "for i, var in enumerate(batch2):\r\n", " sns.regplot(x=db_data.loc[db_data['SEX'] == 1, var], y=y_s1, color='g', ax=axs1[i,0]);\r\n", " sns.regplot(x=db_data.loc[db_data['SEX'] == 2, var], y=y_s2, color='orange', ax=axs1[i,1]);\r\n", "axs1[0,0].set_title('SEX = 1');\r\n", "axs1[0,1].set_title('SEX = 2');\r\n", "axs1[0,0].set_ylabel(\"Y \", rotation=\"horizontal\");\r\n", "axs1[1,0].set_ylabel(\"Y \", rotation=\"horizontal\");\r\n", "axs1[2,0].set_ylabel(\"Y \", rotation=\"horizontal\");" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "fig2, axs2 = plt.subplots(len(batch2), 2, figsize=(24,20), sharey=True)\r\n", "# y_s1.head()\r\n", "# y_s2.head()\r\n", "for i, var in enumerate(batch3):\r\n", " sns.regplot(x=db_data.loc[db_data['SEX'] == 1, var], y=y_s1, color='g', ax=axs2[i,0]);\r\n", " sns.regplot(x=db_data.loc[db_data['SEX'] == 2, var], y=y_s2, color='orange', ax=axs2[i,1]);\r\n", " axs2[i,0].set_ylabel(\"Y \", rotation=\"horizontal\");\r\n", "axs2[0,0].set_title('SEX = 1');\r\n", "axs2[0,1].set_title('SEX = 2');\r\n" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Well those plots all look a lot alike to me. Not sure I am getting a lot of info from them.\r\n", "\r\n", "But let's try something else. Mostly as an example of what can be done." ], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "sns.set(font_scale=1.4)\r\n", "g = sns.lmplot(data=db_data, x=\"BP\", y=\"Y\", col=\"bmi_class\", col_order=['underweight', 'normal', 'overweight', 'obese'], row=\"SEX\", hue=\"SEX\", facet_kws={'margin_titles':True})" ], "outputs": [], "metadata": {} }, { "cell_type": "code", "execution_count": null, "source": [ "# let's try that using a facetgrid and regplot()\r\n", "sns.set(font_scale=1)\r\n", "g = sns.FacetGrid(db_data, col=\"tsh\", col_order=['ref low', 'ref high', 'high', 'very high'], row=\"SEX\", hue=\"SEX\", margin_titles=True)\r\n", "g.map_dataframe(sns.regplot, x=\"BMI\", y=\"Y\");" ], "outputs": [], "metadata": {} }, { "cell_type": "markdown", "source": [ "Find some of those charts interesting, but I really don't know what they are trying to tell me.\r\n", "\r\n", "So, I think I am going to leave it here for now.\r\n", "\r\n", "Started *Machine Learning with Python-From Linear Models to Deep Learning* (MITx) last week. The math is driving me nuts and taking a lot of time. Have to admit I was not really well focused on the subject at hand in this notebook. Sorry.\r\n", "\r\n", "Am concerned I'll be missing a post or two as a result." ], "metadata": {} } ], "metadata": { "orig_nbformat": 4, "language_info": { "name": "python", "version": "3.9.2", "mimetype": "text/x-python", "codemirror_mode": { "name": "ipython", "version": 3 }, "pygments_lexer": "ipython3", "nbconvert_exporter": "python", "file_extension": ".py" }, "kernelspec": { "name": "python3", "display_name": "Python 3.9.2 64-bit ('ds-3.9': conda)" }, "interpreter": { "hash": "a27d3f2bf68df5402465348834a2195030d3fc5bfc8e594e2a17c8c7e2447c85" } }, "nbformat": 4, "nbformat_minor": 2 }