{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# let's set things up\r\n",
    "from IPython.core.interactiveshell import InteractiveShell\r\n",
    "import matplotlib as mpl\r\n",
    "import matplotlib.pyplot as plt\r\n",
    "import numpy as np\r\n",
    "import pandas as pd\r\n",
    "import seaborn as sns\r\n",
    "\r\n",
    "InteractiveShell.ast_node_interactivity = \"all\"\r\n",
    "%matplotlib inline\r\n",
    "plt.style.use('default')\r\n",
    "sns.set()\r\n",
    "pd.options.display.float_format = '{:,.2f}'.format"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# original data source: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt\r\n",
    "db_loc = \"./data/diabetes.tab.txt\"\r\n",
    "db_data = pd.read_csv(db_loc, sep='\\t', header=(0))"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "boxp = pd.DataFrame(data=db_data, columns=['AGE', 'BMI', 'S3'])\r\n",
    "sns.violinplot(x=\"SEX\", y=\"Y\", data=db_data, hue='SEX', orient='v', palette=\"Set3\");"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# let's look at some variations\r\n",
    "ax = sns.violinplot(x=\"SEX\", y=\"Y\", inner='quartile', data=db_data)\r\n",
    "ax.set_title('Distribution of disease progression', fontsize=16);"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "We can clearly see the medians are almost equal. As is the 1st quartile. The upper quartile is a little higher for sex 2 than for sex 1, implying slightly more dispersed results for sex 2. And, the overall distribution is very similar for both.\r\n",
    "\r\n",
    "Unfortunately this kind of assessment only works when catergorical features are present in the dataset. Would be nice if we had more than one. So, let's add another one."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# let's add categorical column based on TSH level\r\n",
    "def lbl_tsh(row):\r\n",
    "  if row['S4'] < 0.5:\r\n",
    "    return \"low\"\r\n",
    "  elif row['S4'] <= 2.0:\r\n",
    "    return \"ref low\"\r\n",
    "  elif row['S4'] <= 4.0:\r\n",
    "    return \"ref high\"\r\n",
    "  elif row['S4'] <= 6.0:\r\n",
    "    return \"high\"\r\n",
    "  else:\r\n",
    "    return \"very high\"\r\n",
    "\r\n",
    "db_data['tsh'] = db_data.apply(lbl_tsh, axis=1)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "ax = sns.violinplot(x=\"tsh\", y=\"Y\", hue=\"SEX\", split=True, data=db_data, order=['ref low', 'ref high', 'high', 'very high'])\r\n",
    "ax.set_title('Distribution of progression by TSH level', fontsize=16);\r\n",
    "plt.legend(loc='lower right');"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "Okay. The median of disease progression appears to increase with tsh level. But given the overlapping distributions hardly seems conclusive.\r\n",
    "\r\n",
    "Distributions for both sexes similar for the higher three classifications. But significantly different for the sex 2 cases with TSH in the bottom half of the reference range. Seems to imply that, for sex 2 cases, TSH in the *ref low* class has no connection with disease progression."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "# let's try classifying BMI\r\n",
    "\r\n",
    "def lbl_bmi(row):\r\n",
    "  if row['BMI'] < 18.5:\r\n",
    "    return \"underweight\"\r\n",
    "  elif row['BMI'] < 25.0:\r\n",
    "    return \"normal\"\r\n",
    "  elif row['BMI'] < 30.0:\r\n",
    "    return \"overweight\"\r\n",
    "  else:\r\n",
    "    return \"obese\"\r\n",
    "\r\n",
    "db_data['bmi_class'] = db_data.apply(lbl_bmi, axis=1)"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [
    "ax = sns.violinplot(x=\"bmi_class\", y=\"Y\", hue=\"SEX\", split=True, data=db_data, order=['underweight', 'normal', 'overweight', 'obese'])\r\n",
    "ax.set_title('Distribution of progression by BMI level', fontsize=16);\r\n",
    "plt.legend(loc='upper left');"
   ],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "markdown",
   "source": [
    "Wow, no cases of underweight sex 1 individuals. Again, median disease progression appeart to increase in the overweight and obese classes. Identical median for both of the two lower classifications (underweight and normal).\r\n",
    "\r\n",
    "Also looks to be slightly different distributions for the 2 sexes in the two higher classes. However in the three higher classes, would appear that the peaks (whether uni- or multi-modal) go up with the classification.\r\n",
    "\r\n",
    "That was fun, but I think I will leave it there. Lots more to learn; but, perhaps, a step in the right direction."
   ],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [],
   "outputs": [],
   "metadata": {}
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "source": [],
   "outputs": [],
   "metadata": {}
  }
 ],
 "metadata": {
  "orig_nbformat": 4,
  "language_info": {
   "name": "python",
   "version": "3.9.2",
   "mimetype": "text/x-python",
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "pygments_lexer": "ipython3",
   "nbconvert_exporter": "python",
   "file_extension": ".py"
  },
  "kernelspec": {
   "name": "python3",
   "display_name": "Python 3.9.2 64-bit ('ds-3.9': conda)"
  },
  "interpreter": {
   "hash": "a27d3f2bf68df5402465348834a2195030d3fc5bfc8e594e2a17c8c7e2447c85"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}