Machine-Learning-Models/ProstateCancerClassificationModel/main_inWorking.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Prostate Cancer Classification Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction + Set-up"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This tutorial will explain how to utilize TPUs efficiently, load in image data, build and train a convolution neural network, finetune and regularize the model, and predict results. Data augmentation is not included in the model because X-ray scans are only taken in a specific orientation, and variations such as flips and rotations will not exist in real X-ray images.\n",
    "\n",
    "Run the following cell to load the necessary packages. Make sure to change the Accelerator on the right to TPU."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "TensorFlow is a powerful tool to develop any machine learning pipeline, and today we will go over how to load Image+CSV combined datasets, how to use Keras preprocessing layers for image augmentation, and how to use pre-trained models for image classification.\n",
    "\n",
    "Skeleton code for the DataGenerator Sequence subclass is credited to Xie29's NB.\n",
    "\n",
    "Run the following cell to import the necessary packages. We will be using the GPU accelerator to efficiently train our model. Remember to change the accelerator on the right to GPU. We won't be using a TPU for this notebook because data generators are not safe to run on multiple replicas. If a TPU is not used, change the TPU_used variable to False."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Checking System\n",
      "\n",
      "Tensorflow version : 2.11.0-dev20220812\n",
      "Number of replicas: 1\n",
      "2.11.0-dev20220812\n",
      "\n",
      "System check done in 0.0063 seconds\n",
      "System Checked!\n"
     ]
    }
   ],
   "source": [
    "from email.mime import image\n",
    "import os\n",
    "from PIL import Image\n",
    "import time\n",
    "import math\n",
    "import warnings\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import tensorflow as tf\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split\n",
    "from keras.utils import load_img\n",
    "\n",
    "start_time = time.perf_counter()\n",
    "print(\"Checking System\\n\")\n",
    "\n",
    "SEED = 1337\n",
    "print('Tensorflow version : {}'.format(tf.__version__))\n",
    "\n",
    "try:\n",
    "    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()\n",
    "    tf.config.experimental_connect_to_cluster(tpu)\n",
    "    tf.tpu.experimental.initialize_tpu_system(tpu)\n",
    "    strategy = tf.distribute.experimental.TPUStrategy(tpu)\n",
    "except ValueError:\n",
    "    strategy = tf.distribute.get_strategy() # for CPU and single GPU\n",
    "    print('Number of replicas:', strategy.num_replicas_in_sync)\n",
    "    \n",
    "print(tf.__version__)\n",
    "\n",
    "end_time = time.perf_counter()\n",
    "\n",
    "print(f\"\\nSystem check done in {end_time - start_time:0.4f} seconds\")\n",
    "print(\"System Checked!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data loading"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Chest X-ray data we are using from Cell divides the data into train, val, and test files. There are only 16 files in the validation folder, and we would prefer to have a less extreme division between the training and the validation set. We will append the validation files and create a new split that resembes the standard 80:20 division instead.\n",
    "\n",
    "The first step is to load in our data. The original PANDA dataset contains large images and masks that specify which area of the mask led to the ISUP grade (determines the severity of the cancer). Since the original images contain a lot of white space and extraneous data that is not necessary for our model, we will be using tiles to condense the images. Basically, the tiles are small sections of the masked areas, and these tiles can be concatenated together so the only the masked sections of the original image remains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loading Files...\n",
      "\n",
      "Directories detected:\n",
      "./DataFrames/ProstateCancerGradeAssessment\n",
      "./DataFrames/PandaTiles/train\n",
      "./DataFrames/PandaTiles/masks\n",
      "####DATAFRAME INFO####\n",
      "                               image_id data_provider  isup_grade  \\\n",
      "0      0005f7aaab2800f6170c399693a96917    karolinska           0   \n",
      "1      000920ad0b612851f8e01bcc880d9b3d    karolinska           0   \n",
      "2      0018ae58b01bdadc8e347995b69f99aa       radboud           4   \n",
      "3      001c62abd11fa4b57bf7a6c603a11bb9    karolinska           4   \n",
      "4      001d865e65ef5d2579c190a0e0350d8f    karolinska           0   \n",
      "...                                 ...           ...         ...   \n",
      "10611  ffd2841373b39792ab0c84cccd066e31       radboud           0   \n",
      "10612  ffdc59cd580a1468eac0e6a32dd1ff2d       radboud           5   \n",
      "10613  ffe06afd66a93258f8fabdef6044e181       radboud           0   \n",
      "10614  ffe236a25d4cbed59438220799920749       radboud           2   \n",
      "10615  ffe9bcababc858e04840669e788065a1       radboud           4   \n",
      "\n",
      "      gleason_score  \n",
      "0               0+0  \n",
      "1               0+0  \n",
      "2               4+4  \n",
      "3               4+4  \n",
      "4               0+0  \n",
      "...             ...  \n",
      "10611      negative  \n",
      "10612           4+5  \n",
      "10613      negative  \n",
      "10614           3+4  \n",
      "10615           4+4  \n",
      "\n",
      "[10616 rows x 4 columns]\n",
      "####DATAFRAME INFO####\n",
      "\n",
      "Directories detected in 0.1114 seconds\n",
      "Files are loaded!\n"
     ]
    }
   ],
   "source": [
    "print(\"Loading Files...\\n\")\n",
    "\n",
    "start_time = time.perf_counter()\n",
    "\n",
    "MAIN_DIR = './DataFrames/ProstateCancerGradeAssessment'\n",
    "TRAIN_IMG_DIR = './DataFrames/PandaTiles/train'\n",
    "TRAIN_MASKS_DIR = './DataFrames/PandaTiles/masks'\n",
    "train_csv = pd.read_csv('./DataFrames/ProstateCancerGradeAssessment/train.csv')\n",
    "\n",
    "print(\"Directories detected:\")\n",
    "print(MAIN_DIR)\n",
    "print(TRAIN_IMG_DIR)\n",
    "print(TRAIN_MASKS_DIR)\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "print(train_csv)\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "\n",
    "end_time = time.perf_counter()\n",
    "\n",
    "print(f\"\\nDirectories detected in {end_time - start_time:0.4f} seconds\")\n",
    "print(\"Files are loaded!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some of the images could not be converted to tiles because the masks were too small or the image was too noisy. We need to take these images out of our DataFrame so that we do not run into a `FileNotFoundError`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Renaming files for categorisation...\n",
      "####DATAFRAME INFO####\n",
      "                               image_id data_provider  isup_grade  \\\n",
      "0      0005f7aaab2800f6170c399693a96917    karolinska           0   \n",
      "1      000920ad0b612851f8e01bcc880d9b3d    karolinska           0   \n",
      "2      0018ae58b01bdadc8e347995b69f99aa       radboud           4   \n",
      "3      001c62abd11fa4b57bf7a6c603a11bb9    karolinska           4   \n",
      "4      001d865e65ef5d2579c190a0e0350d8f    karolinska           0   \n",
      "...                                 ...           ...         ...   \n",
      "10611  ffd2841373b39792ab0c84cccd066e31       radboud           0   \n",
      "10612  ffdc59cd580a1468eac0e6a32dd1ff2d       radboud           5   \n",
      "10613  ffe06afd66a93258f8fabdef6044e181       radboud           0   \n",
      "10614  ffe236a25d4cbed59438220799920749       radboud           2   \n",
      "10615  ffe9bcababc858e04840669e788065a1       radboud           4   \n",
      "\n",
      "      gleason_score  \n",
      "0               0+0  \n",
      "1               0+0  \n",
      "2               4+4  \n",
      "3               4+4  \n",
      "4               0+0  \n",
      "...             ...  \n",
      "10611      negative  \n",
      "10612           4+5  \n",
      "10613      negative  \n",
      "10614           3+4  \n",
      "10615           4+4  \n",
      "\n",
      "[10616 rows x 4 columns]\n",
      "####DATAFRAME INFO####\n",
      "\n",
      "File Categorization done in 24.9017 seconds\n",
      "File categorisation completed.\n"
     ]
    }
   ],
   "source": [
    "print(\"Renaming files for categorisation...\")\n",
    "\n",
    "start_time = time.perf_counter()\n",
    "\n",
    "valid_images = tf.io.gfile.glob(TRAIN_IMG_DIR + '/*_0.png')\n",
    "img_ids = train_csv['image_id']\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "print(train_csv)\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "\n",
    "end_time = time.perf_counter()\n",
    "\n",
    "print(f\"\\nFile Categorization done in {end_time - start_time:0.4f} seconds\")\n",
    "print(\"File categorisation completed.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Filing started\n",
      "####DATAFRAME INFO####\n",
      "                               image_id data_provider  isup_grade  \\\n",
      "0      0005f7aaab2800f6170c399693a96917    karolinska           0   \n",
      "1      000920ad0b612851f8e01bcc880d9b3d    karolinska           0   \n",
      "2      0018ae58b01bdadc8e347995b69f99aa       radboud           4   \n",
      "3      001c62abd11fa4b57bf7a6c603a11bb9    karolinska           4   \n",
      "4      001d865e65ef5d2579c190a0e0350d8f    karolinska           0   \n",
      "...                                 ...           ...         ...   \n",
      "10611  ffd2841373b39792ab0c84cccd066e31       radboud           0   \n",
      "10612  ffdc59cd580a1468eac0e6a32dd1ff2d       radboud           5   \n",
      "10613  ffe06afd66a93258f8fabdef6044e181       radboud           0   \n",
      "10614  ffe236a25d4cbed59438220799920749       radboud           2   \n",
      "10615  ffe9bcababc858e04840669e788065a1       radboud           4   \n",
      "\n",
      "      gleason_score  \n",
      "0               0+0  \n",
      "1               0+0  \n",
      "2               4+4  \n",
      "3               4+4  \n",
      "4               0+0  \n",
      "...             ...  \n",
      "10611      negative  \n",
      "10612           4+5  \n",
      "10613      negative  \n",
      "10614           3+4  \n",
      "10615           4+4  \n",
      "\n",
      "[10616 rows x 4 columns]\n",
      "####DATAFRAME INFO####\n"
     ]
    },
    {
     "ename": "AttributeError",
     "evalue": "'DataFrame' object has no attribute 'labels'",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[1;32mf:\\Machine Learning\\ML Project\\ProstateCancerClassificationModel\\main.ipynb Cell 11\u001b[0m in \u001b[0;36m<cell line: 10>\u001b[1;34m()\u001b[0m\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X13sZmlsZQ%3D%3D?line=10'>11</a>\u001b[0m     file_name \u001b[39m=\u001b[39m TRAIN_IMG_DIR \u001b[39m+\u001b[39m \u001b[39m'\u001b[39m\u001b[39m/\u001b[39m\u001b[39m'\u001b[39m \u001b[39m+\u001b[39m img_id \u001b[39m+\u001b[39m \u001b[39m'\u001b[39m\u001b[39m_0.png\u001b[39m\u001b[39m'\u001b[39m\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X13sZmlsZQ%3D%3D?line=11'>12</a>\u001b[0m     \u001b[39mif\u001b[39;00m file_name \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m valid_images:\n\u001b[1;32m---> <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X13sZmlsZQ%3D%3D?line=12'>13</a>\u001b[0m         train_csv \u001b[39m=\u001b[39m df\u001b[39m.\u001b[39mdrop(df[(df[\u001b[39m'\u001b[39;49m\u001b[39mimage_id\u001b[39;49m\u001b[39m'\u001b[39;49m] \u001b[39m!=\u001b[39;49m img_id)]\u001b[39m.\u001b[39;49mlabels, inplace\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m)\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X13sZmlsZQ%3D%3D?line=13'>14</a>\u001b[0m         \u001b[39m# train_csv = train_csv[train_csv['image_id'] != img_id]\u001b[39;00m\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X13sZmlsZQ%3D%3D?line=14'>15</a>\u001b[0m         \n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X13sZmlsZQ%3D%3D?line=15'>16</a>\u001b[0m \u001b[39m# radboud_csv = train_csv[train_csv['data_provider'] == 'radboud']\u001b[39;00m\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X13sZmlsZQ%3D%3D?line=16'>17</a>\u001b[0m \u001b[39m# karolinska_csv = train_csv[train_csv['data_provider'] != 'radboud']\u001b[39;00m\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X13sZmlsZQ%3D%3D?line=18'>19</a>\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39m####DATAFRAME INFO####\u001b[39m\u001b[39m\"\u001b[39m)\n",
      "File \u001b[1;32my:\\Anaconda\\lib\\site-packages\\pandas\\core\\generic.py:5575\u001b[0m, in \u001b[0;36mNDFrame.__getattr__\u001b[1;34m(self, name)\u001b[0m\n\u001b[0;32m   5568\u001b[0m \u001b[39mif\u001b[39;00m (\n\u001b[0;32m   5569\u001b[0m     name \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_internal_names_set\n\u001b[0;32m   5570\u001b[0m     \u001b[39mand\u001b[39;00m name \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_metadata\n\u001b[0;32m   5571\u001b[0m     \u001b[39mand\u001b[39;00m name \u001b[39mnot\u001b[39;00m \u001b[39min\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_accessors\n\u001b[0;32m   5572\u001b[0m     \u001b[39mand\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_info_axis\u001b[39m.\u001b[39m_can_hold_identifiers_and_holds_name(name)\n\u001b[0;32m   5573\u001b[0m ):\n\u001b[0;32m   5574\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39m[name]\n\u001b[1;32m-> 5575\u001b[0m \u001b[39mreturn\u001b[39;00m \u001b[39mobject\u001b[39;49m\u001b[39m.\u001b[39;49m\u001b[39m__getattribute__\u001b[39;49m(\u001b[39mself\u001b[39;49m, name)\n",
      "\u001b[1;31mAttributeError\u001b[0m: 'DataFrame' object has no attribute 'labels'"
     ]
    }
   ],
   "source": [
    "print(\"Filing started\")\n",
    "start_time = time.perf_counter()\n",
    "\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "print(train_csv)\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "\n",
    "df = pd.DataFrame(train_csv)\n",
    "\n",
    "for img_id in img_ids:\n",
    "    file_name = TRAIN_IMG_DIR + '/' + img_id + '_0.png'\n",
    "    if file_name not in valid_images:\n",
    "        train_csv = train_csv[train_csv['image_id'] != img_id]\n",
    "        \n",
    "radboud_csv = train_csv[train_csv['data_provider'] == 'radboud']\n",
    "karolinska_csv = train_csv[train_csv['data_provider'] != 'radboud']\n",
    "\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "print(train_csv)\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "print(radboud_csv)\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "print(karolinska_csv)\n",
    "print(\"####DATAFRAME INFO####\")\n",
    "\n",
    "end_time = time.perf_counter()\n",
    "print(f\"\\nFiling done in {end_time - start_time:0.4f} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want both our training dataset and our validation dataset to contain images from both the Karolinska Institute and Radboud University Medical Center data providers. The following cell will split the each datafram into a 80:20 training:validation split."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Spliting Data...\n"
     ]
    },
    {
     "ename": "ValueError",
     "evalue": "With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.",
     "output_type": "error",
     "traceback": [
      "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[1;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[1;32mf:\\Machine Learning\\ML Project\\ProstateCancerClassificationModel\\main.ipynb Cell 13\u001b[0m in \u001b[0;36m<cell line: 4>\u001b[1;34m()\u001b[0m\n\u001b[0;32m      <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=0'>1</a>\u001b[0m \u001b[39mprint\u001b[39m(\u001b[39m\"\u001b[39m\u001b[39mSpliting Data...\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m      <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=1'>2</a>\u001b[0m start_time \u001b[39m=\u001b[39m time\u001b[39m.\u001b[39mperf_counter()\n\u001b[1;32m----> <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=3'>4</a>\u001b[0m r_train, r_test \u001b[39m=\u001b[39m train_test_split(\n\u001b[0;32m      <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=4'>5</a>\u001b[0m     radboud_csv,\n\u001b[0;32m      <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=5'>6</a>\u001b[0m     test_size\u001b[39m=\u001b[39;49m\u001b[39m0.2\u001b[39;49m, random_state\u001b[39m=\u001b[39;49mSEED\n\u001b[0;32m      <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=6'>7</a>\u001b[0m )\n\u001b[0;32m      <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=8'>9</a>\u001b[0m k_train, k_test \u001b[39m=\u001b[39m train_test_split(\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=9'>10</a>\u001b[0m     karolinska_csv,\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=10'>11</a>\u001b[0m     test_size\u001b[39m=\u001b[39m\u001b[39m0.2\u001b[39m, random_state\u001b[39m=\u001b[39mSEED\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=11'>12</a>\u001b[0m )\n\u001b[0;32m     <a href='vscode-notebook-cell:/f%3A/Machine%20Learning/ML%20Project/ProstateCancerClassificationModel/main.ipynb#X15sZmlsZQ%3D%3D?line=13'>14</a>\u001b[0m end_time \u001b[39m=\u001b[39m time\u001b[39m.\u001b[39mperf_counter()\n",
      "File \u001b[1;32my:\\Anaconda\\lib\\site-packages\\sklearn\\model_selection\\_split.py:2433\u001b[0m, in \u001b[0;36mtrain_test_split\u001b[1;34m(test_size, train_size, random_state, shuffle, stratify, *arrays)\u001b[0m\n\u001b[0;32m   2430\u001b[0m arrays \u001b[39m=\u001b[39m indexable(\u001b[39m*\u001b[39marrays)\n\u001b[0;32m   2432\u001b[0m n_samples \u001b[39m=\u001b[39m _num_samples(arrays[\u001b[39m0\u001b[39m])\n\u001b[1;32m-> 2433\u001b[0m n_train, n_test \u001b[39m=\u001b[39m _validate_shuffle_split(\n\u001b[0;32m   2434\u001b[0m     n_samples, test_size, train_size, default_test_size\u001b[39m=\u001b[39;49m\u001b[39m0.25\u001b[39;49m\n\u001b[0;32m   2435\u001b[0m )\n\u001b[0;32m   2437\u001b[0m \u001b[39mif\u001b[39;00m shuffle \u001b[39mis\u001b[39;00m \u001b[39mFalse\u001b[39;00m:\n\u001b[0;32m   2438\u001b[0m     \u001b[39mif\u001b[39;00m stratify \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n",
      "File \u001b[1;32my:\\Anaconda\\lib\\site-packages\\sklearn\\model_selection\\_split.py:2111\u001b[0m, in \u001b[0;36m_validate_shuffle_split\u001b[1;34m(n_samples, test_size, train_size, default_test_size)\u001b[0m\n\u001b[0;32m   2108\u001b[0m n_train, n_test \u001b[39m=\u001b[39m \u001b[39mint\u001b[39m(n_train), \u001b[39mint\u001b[39m(n_test)\n\u001b[0;32m   2110\u001b[0m \u001b[39mif\u001b[39;00m n_train \u001b[39m==\u001b[39m \u001b[39m0\u001b[39m:\n\u001b[1;32m-> 2111\u001b[0m     \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\n\u001b[0;32m   2112\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39mWith n_samples=\u001b[39m\u001b[39m{}\u001b[39;00m\u001b[39m, test_size=\u001b[39m\u001b[39m{}\u001b[39;00m\u001b[39m and train_size=\u001b[39m\u001b[39m{}\u001b[39;00m\u001b[39m, the \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m   2113\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39mresulting train set will be empty. Adjust any of the \u001b[39m\u001b[39m\"\u001b[39m\n\u001b[0;32m   2114\u001b[0m         \u001b[39m\"\u001b[39m\u001b[39maforementioned parameters.\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m.\u001b[39mformat(n_samples, test_size, train_size)\n\u001b[0;32m   2115\u001b[0m     )\n\u001b[0;32m   2117\u001b[0m \u001b[39mreturn\u001b[39;00m n_train, n_test\n",
      "\u001b[1;31mValueError\u001b[0m: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters."
     ]
    }
   ],
   "source": [
    "print(\"Spliting Data...\")\n",
    "start_time = time.perf_counter()\n",
    "\n",
    "r_train, r_test = train_test_split(\n",
    "    radboud_csv,\n",
    "    test_size=0.2, random_state=SEED\n",
    ")\n",
    "\n",
    "k_train, k_test = train_test_split(\n",
    "    karolinska_csv,\n",
    "    test_size=0.2, random_state=SEED\n",
    ")\n",
    "\n",
    "end_time = time.perf_counter()\n",
    "print(f\"\\nData split done in {end_time - start_time:0.4f} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Concatenate the dataframes from the two different providers and we have our training dataset and our validation dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Concatenating data\\n\")\n",
    "start_time = time.perf_counter()\n",
    "\n",
    "train_df = pd.concat([r_train, k_train])\n",
    "valid_df = pd.concat([r_test, k_test])\n",
    "\n",
    "print(\"Training data shape: \",train_df.shape)\n",
    "print(\"Validation data shape: \",valid_df.shape)\n",
    "\n",
    "end_time = time.perf_counter()\n",
    "print(f\"\\nConcatenating data done in {end_time - start_time:0.4f} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generally, it is better practice to specify constant variables than it is to hard-code numbers. This way, changing parameters is more efficient and complete. Specfiy some constants below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Specifying variables\\n\")\n",
    "start_time = time.perf_counter()\n",
    "\n",
    "IMG_DIM = (1536, 128)\n",
    "CLASSES_NUM = 6\n",
    "BATCH_SIZE = 32\n",
    "EPOCHS = 100\n",
    "N=12\n",
    "\n",
    "LEARNING_RATE = 1e-4\n",
    "FOLDED_NUM_TRAIN_IMAGES = train_df.shape[0]\n",
    "FOLDED_NUM_VALID_IMAGES = valid_df.shape[0]\n",
    "STEPS_PER_EPOCH = FOLDED_NUM_TRAIN_IMAGES // BATCH_SIZE\n",
    "VALIDATION_STEPS = FOLDED_NUM_VALID_IMAGES // BATCH_SIZE\n",
    "\n",
    "end_time = time.perf_counter()\n",
    "print(f\"\\nVariables specified in {end_time - start_time:0.4f} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `tf.keras.utils.Sequence` is a base object to fit a dataset. Since our dataset is stored both as images and as a csv, we will have to write a DataGenerator that is a subclass of the Sequence class. The DataGenerator will concatenate all the tiles from each original image into a newer image of just the masked areas. It will also get the label from the ISUP grade column and convert it to a one-hot encoding. One-hot encoding is necessary because the ISUP grade is not a continuous datatype but a categorical datatype."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Generating Data...\")\n",
    "start_time = time.perf_counter()\n",
    "\n",
    "class DataGenerator(tf.compat.v1.keras.utils.Sequence):\n",
    "    \n",
    "    def __init__(self,\n",
    "                image_shape,\n",
    "                batch_size, \n",
    "                df,\n",
    "                img_dir,\n",
    "                mask_dir,\n",
    "                is_training=True\n",
    "                ):\n",
    "        \n",
    "        self.image_shape = image_shape\n",
    "        self.batch_size = batch_size\n",
    "        self.df = df\n",
    "        self.img_dir = img_dir\n",
    "        self.mask_dir = mask_dir\n",
    "        self.is_training = is_training\n",
    "        self.indices = range(df.shape[0])\n",
    "        \n",
    "    def __len__(self):\n",
    "        return self.df.shape[0] // self.batch_size\n",
    "    \n",
    "    def on_epoch_start(self):\n",
    "        if self.is_training:\n",
    "            np.random.shuffle(self.indices)\n",
    "    \n",
    "    def __getitem__(self, index):\n",
    "        batch_indices = self.indices[index * self.batch_size : (index+1) * self.batch_size]\n",
    "        image_ids = self.df['image_id'].iloc[batch_indices].values\n",
    "        batch_images = [self.__getimages__(image_id) for image_id in image_ids]\n",
    "        batch_labels = [self.df[self.df['image_id'] == image_id]['isup_grade'].values[0] for image_id in image_ids]\n",
    "        batch_labels = tf.one_hot(batch_labels, CLASSES_NUM)\n",
    "        \n",
    "        return np.squeeze(np.stack(batch_images).reshape(-1, 1536, 128, 3)), np.stack(batch_labels)\n",
    "        \n",
    "    def __getimages__(self, image_id):\n",
    "        fnames = [image_id+'_'+str(i)+'.png' for i in range(N)]\n",
    "        images = []\n",
    "        for fn in fnames:\n",
    "            img = np.array(Image.open(os.path.join(self.img_dir, fn)).convert('RGB'))[:, :, ::-1]\n",
    "            images.append(img)\n",
    "        result = np.stack(images).reshape(1, 1536, 128, 3) / 255.0\n",
    "        return result\n",
    "\n",
    "end_time = time.perf_counter()\n",
    "print(f\"\\nData generated in {end_time - start_time:0.4f} seconds\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will use the DataGenerator to create a generator for our training dataset and for our validation dataset. At each iteration of the generator, the generator will return a batch of images."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_generator = DataGenerator(image_shape=IMG_DIM,\n",
    "                                batch_size=BATCH_SIZE,\n",
    "                                df=train_df,\n",
    "                                img_dir=TRAIN_IMG_DIR,\n",
    "                                mask_dir=TRAIN_MASKS_DIR)\n",
    "\n",
    "valid_generator = DataGenerator(image_shape=IMG_DIM,\n",
    "                                batch_size=BATCH_SIZE,\n",
    "                                df=valid_df,\n",
    "                                img_dir=TRAIN_IMG_DIR,\n",
    "                                mask_dir=TRAIN_MASKS_DIR)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize our input data\n",
    "\n",
    "Run the following cell to define the method to visualize our input data. This method displays the new images and their corresponding ISUP grade."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_tiles(image_batch, label_batch):\n",
    "    plt.figure(figsize=(20,20))\n",
    "    for n in range(10):\n",
    "        ax = plt.subplot(1,10,n+1)\n",
    "        plt.imshow(image_batch[n])\n",
    "        decoded = np.argmax(label_batch[n])\n",
    "        plt.title(decoded)\n",
    "        plt.axis(\"off\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "image_batch, label_batch = next(iter(train_generator))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The following 12 tiles were from a single image but has been converted to 12 tiles to reduce white space. We see that only the sections that led to the ISUP grade has been preserved."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "show_tiles(image_batch, label_batch)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build our model + Data augmentation\n",
    "\n",
    "We will be utilizing the Xception pre-trained model to classify our data. The PANDA competition scores submissions using the quadratic weighted kappa. The TensorFlow add-on API contains the Cohen Kappa loss and metric functions. Since we want to use the newest version of TensorFlow through tf-nightly to utilize the pretrained EfficientNet model, we will refrain from using the TFA API as it has not been moved over yet to the tf-nightly version. However, feel free to create your own Cohen Kappa Metric and Loss class using the TensorFlow API."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data augmentation is helpful when dealing with image data as it prevents overfitting. Data augmentation introduces artificial but realistic variance in our images so that our model can learn from more features. Keras has recently implemented `keras.layers.preprocessing` that allows the model to streamline the data augmentation process."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the base model has already been trained with imagenet weights, we do not want to weights to change, so the base mode must not be trainable. However, the number of classes that our model has differs from the original model. Therefore, we do not want to include the top layers because we will add our own Dense layer that has the same number of nodes as our output class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_model():\n",
    "    data_augmentation = tf.keras.Sequential([\n",
    "        tf.keras.layers.experimental.preprocessing.RandomContrast(0.15, seed=SEED),\n",
    "        tf.keras.layers.experimental.preprocessing.RandomFlip(\"horizontal\", seed=SEED),\n",
    "        tf.keras.layers.experimental.preprocessing.RandomFlip(\"vertical\", seed=SEED),\n",
    "        tf.keras.layers.experimental.preprocessing.RandomTranslation(0.1, 0.1, seed=SEED)\n",
    "    ])\n",
    "    \n",
    "    base_model = tf.keras.applications.VGG16(input_shape=(*IMG_DIM, 3),\n",
    "                                            include_top=False,\n",
    "                                            weights='imagenet')\n",
    "    \n",
    "    base_model.trainable = True\n",
    "    \n",
    "    model = tf.keras.Sequential([\n",
    "        data_augmentation,\n",
    "        \n",
    "        base_model,\n",
    "        \n",
    "        tf.keras.layers.GlobalAveragePooling2D(),\n",
    "        tf.keras.layers.Dense(16, activation='relu'),\n",
    "        tf.keras.layers.BatchNormalization(),\n",
    "        tf.keras.layers.Dense(CLASSES_NUM, activation='softmax'),\n",
    "    ])\n",
    "    \n",
    "    model.compile(optimizer=tf.keras.optimizers.RMSprop(),\n",
    "                    loss='categorical_crossentropy',\n",
    "                    metrics=tf.keras.metrics.AUC(name='auc'))\n",
    "    \n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's build our model!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the model\n",
    "\n",
    "And now let's train it! Learning rate is a very important hyperparameter, and it can be difficult to choose the \"right\" one. A learning rate that it too high will prevent the model from converging, but one that is too low will be far too slow. We will utilize multiple callbacks, using the `tf.keras` API to make sure that we are using an ideal learning rate and to prevent the model from overfitting. We can also save our model so that we do not have to retrain it next time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def exponential_decay(lr0, s):\n",
    "    def exponential_decay_fn(epoch):\n",
    "        return lr0 * 0.1 **(epoch / s)\n",
    "    return exponential_decay_fn\n",
    "\n",
    "exponential_decay_fn = exponential_decay(0.01, 20)\n",
    "\n",
    "lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn)\n",
    "\n",
    "checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(\"panda_model.h5\",\n",
    "                                                    save_best_only=True)\n",
    "\n",
    "early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,\n",
    "                                                    restore_best_weights=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "history = model.fit(\n",
    "    train_generator, epochs=EPOCHS,\n",
    "    steps_per_epoch=STEPS_PER_EPOCH,\n",
    "    validation_data=valid_generator,\n",
    "    validation_steps=VALIDATION_STEPS,\n",
    "    callbacks=[checkpoint_cb, early_stopping_cb, lr_scheduler]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predict results\n",
    "\n",
    "For this competition, the test dataset is not available to us. But I wish you all the best of luck, and hopefully this NB served as a helpful tutorial to help you get started."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.13 ('base')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "5819c1eaf6d552792a1bbc5e8998e6c2149ab26a1973a0d78107c0d9954e5ba0"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}