[UPDATE] Notebook Inputs only

2022-08-08 17:04:13 +05:30
parent 077a54f454
commit 12c3b23515
1 changed files with 359 additions and 0 deletions
--- a/PneumoniaClassificationModel/main.ipynb
+++ b/PneumoniaClassificationModel/main.ipynb
@@ -107,6 +107,365 @@
    "TRAIN_MASKS_DIR = '../input/panda-tiles/masks'\n",
    "train_csv = pd.read_csv(os.path.join(MAIN_DIR, 'train.csv'))"
   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Some of the images could not be converted to tiles because the masks were too small or the image was too noisy. We need to take these images out of our DataFrame so that we do not run into a `FileNotFoundError`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "valid_images = tf.io.gfile.glob(TRAIN_IMG_DIR + '/*_0.png')\n",
+    "img_ids = train_csv['image_id']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for img_id in img_ids:\n",
+    "    file_name = TRAIN_IMG_DIR + '/' + img_id + '_0.png'\n",
+    "    if file_name not in valid_images:\n",
+    "        train_csv = train_csv[train_csv['image_id'] != img_id]\n",
+    "        \n",
+    "radboud_csv = train_csv[train_csv['data_provider'] == 'radboud']\n",
+    "karolinska_csv = train_csv[train_csv['data_provider'] != 'radboud']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We want both our training dataset and our validation dataset to contain images from both the Karolinska Institute and Radboud University Medical Center data providers. The following cell will split the each datafram into a 80:20 training:validation split."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "r_train, r_test = train_test_split(\n",
+    "    radboud_csv,\n",
+    "    test_size=0.2, random_state=SEED\n",
+    ")\n",
+    "\n",
+    "k_train, k_test = train_test_split(\n",
+    "    karolinska_csv,\n",
+    "    test_size=0.2, random_state=SEED\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Concatenate the dataframes from the two different providers and we have our training dataset and our validation dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_df = pd.concat([r_train, k_train])\n",
+    "valid_df = pd.concat([r_test, k_test])\n",
+    "\n",
+    "print(train_df.shape)\n",
+    "print(valid_df.shape)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Generally, it is better practice to specify constant variables than it is to hard-code numbers. This way, changing parameters is more efficient and complete. Specfiy some constants below."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "IMG_DIM = (1536, 128)\n",
+    "CLASSES_NUM = 6\n",
+    "BATCH_SIZE = 32\n",
+    "EPOCHS = 100\n",
+    "N=12\n",
+    "\n",
+    "LEARNING_RATE = 1e-4\n",
+    "FOLDED_NUM_TRAIN_IMAGES = train_df.shape[0]\n",
+    "FOLDED_NUM_VALID_IMAGES = valid_df.shape[0]\n",
+    "STEPS_PER_EPOCH = FOLDED_NUM_TRAIN_IMAGES // BATCH_SIZE\n",
+    "VALIDATION_STEPS = FOLDED_NUM_VALID_IMAGES // BATCH_SIZE"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `tf.keras.utils.Sequence` is a base object to fit a dataset. Since our dataset is stored both as images and as a csv, we will have to write a DataGenerator that is a subclass of the Sequence class. The DataGenerator will concatenate all the tiles from each original image into a newer image of just the masked areas. It will also get the label from the ISUP grade column and convert it to a one-hot encoding. One-hot encoding is necessary because the ISUP grade is not a continuous datatype but a categorical datatype."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "class DataGenerator(tf.keras.utils.Sequence):\n",
+    "    \n",
+    "    def __init__(self,\n",
+    "                image_shape,\n",
+    "                batch_size, \n",
+    "                df,\n",
+    "                img_dir,\n",
+    "                mask_dir,\n",
+    "                is_training=True\n",
+    "                ):\n",
+    "        \n",
+    "        self.image_shape = image_shape\n",
+    "        self.batch_size = batch_size\n",
+    "        self.df = df\n",
+    "        self.img_dir = img_dir\n",
+    "        self.mask_dir = mask_dir\n",
+    "        self.is_training = is_training\n",
+    "        self.indices = range(df.shape[0])\n",
+    "        \n",
+    "    def __len__(self):\n",
+    "        return self.df.shape[0] // self.batch_size\n",
+    "    \n",
+    "    def on_epoch_start(self):\n",
+    "        if self.is_training:\n",
+    "            np.random.shuffle(self.indices)\n",
+    "    \n",
+    "    def __getitem__(self, index):\n",
+    "        batch_indices = self.indices[index * self.batch_size : (index+1) * self.batch_size]\n",
+    "        image_ids = self.df['image_id'].iloc[batch_indices].values\n",
+    "        batch_images = [self.__getimages__(image_id) for image_id in image_ids]\n",
+    "        batch_labels = [self.df[self.df['image_id'] == image_id]['isup_grade'].values[0] for image_id in image_ids]\n",
+    "        batch_labels = tf.one_hot(batch_labels, CLASSES_NUM)\n",
+    "        \n",
+    "        return np.squeeze(np.stack(batch_images).reshape(-1, 1536, 128, 3)), np.stack(batch_labels)\n",
+    "        \n",
+    "    def __getimages__(self, image_id):\n",
+    "        fnames = [image_id+'_'+str(i)+'.png' for i in range(N)]\n",
+    "        images = []\n",
+    "        for fn in fnames:\n",
+    "            img = np.array(PIL.Image.open(os.path.join(self.img_dir, fn)).convert('RGB'))[:, :, ::-1]\n",
+    "            images.append(img)\n",
+    "        result = np.stack(images).reshape(1, 1536, 128, 3) / 255.0\n",
+    "        return result"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We will use the DataGenerator to create a generator for our training dataset and for our validation dataset. At each iteration of the generator, the generator will return a batch of images."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "train_generator = DataGenerator(image_shape=IMG_DIM,\n",
+    "                                batch_size=BATCH_SIZE,\n",
+    "                                df=train_df,\n",
+    "                                img_dir=TRAIN_IMG_DIR,\n",
+    "                                mask_dir=TRAIN_MASKS_DIR)\n",
+    "\n",
+    "valid_generator = DataGenerator(image_shape=IMG_DIM,\n",
+    "                                batch_size=BATCH_SIZE,\n",
+    "                                df=valid_df,\n",
+    "                                img_dir=TRAIN_IMG_DIR,\n",
+    "                                mask_dir=TRAIN_MASKS_DIR)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Visualize our input data\n",
+    "\n",
+    "Run the following cell to define the method to visualize our input data. This method displays the new images and their corresponding ISUP grade."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def show_tiles(image_batch, label_batch):\n",
+    "    plt.figure(figsize=(20,20))\n",
+    "    for n in range(10):\n",
+    "        ax = plt.subplot(1,10,n+1)\n",
+    "        plt.imshow(image_batch[n])\n",
+    "        decoded = np.argmax(label_batch[n])\n",
+    "        plt.title(decoded)\n",
+    "        plt.axis(\"off\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "image_batch, label_batch = next(iter(train_generator))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following 12 tiles were from a single image but has been converted to 12 tiles to reduce white space. We see that only the sections that led to the ISUP grade has been preserved."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "show_tiles(image_batch, label_batch)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Build our model + Data augmentation\n",
+    "\n",
+    "We will be utilizing the Xception pre-trained model to classify our data. The PANDA competition scores submissions using the quadratic weighted kappa. The TensorFlow add-on API contains the Cohen Kappa loss and metric functions. Since we want to use the newest version of TensorFlow through tf-nightly to utilize the pretrained EfficientNet model, we will refrain from using the TFA API as it has not been moved over yet to the tf-nightly version. However, feel free to create your own Cohen Kappa Metric and Loss class using the TensorFlow API."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Data augmentation is helpful when dealing with image data as it prevents overfitting. Data augmentation introduces artificial but realistic variance in our images so that our model can learn from more features. Keras has recently implemented `keras.layers.preprocessing` that allows the model to streamline the data augmentation process."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since the base model has already been trained with imagenet weights, we do not want to weights to change, so the base mode must not be trainable. However, the number of classes that our model has differs from the original model. Therefore, we do not want to include the top layers because we will add our own Dense layer that has the same number of nodes as our output class."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def make_model():\n",
+    "    data_augmentation = tf.keras.Sequential([\n",
+    "        tf.keras.layers.experimental.preprocessing.RandomContrast(0.15, seed=SEED),\n",
+    "        tf.keras.layers.experimental.preprocessing.RandomFlip(\"horizontal\", seed=SEED),\n",
+    "        tf.keras.layers.experimental.preprocessing.RandomFlip(\"vertical\", seed=SEED),\n",
+    "        tf.keras.layers.experimental.preprocessing.RandomTranslation(0.1, 0.1, seed=SEED)\n",
+    "    ])\n",
+    "    \n",
+    "    base_model = tf.keras.applications.VGG16(input_shape=(*IMG_DIM, 3),\n",
+    "                                            include_top=False,\n",
+    "                                            weights='imagenet')\n",
+    "    \n",
+    "    base_model.trainable = True\n",
+    "    \n",
+    "    model = tf.keras.Sequential([\n",
+    "        data_augmentation,\n",
+    "        \n",
+    "        base_model,\n",
+    "        \n",
+    "        tf.keras.layers.GlobalAveragePooling2D(),\n",
+    "        tf.keras.layers.Dense(16, activation='relu'),\n",
+    "        tf.keras.layers.BatchNormalization(),\n",
+    "        tf.keras.layers.Dense(CLASSES_NUM, activation='softmax'),\n",
+    "    ])\n",
+    "    \n",
+    "    model.compile(optimizer=tf.keras.optimizers.RMSprop(),\n",
+    "                    loss='categorical_crossentropy',\n",
+    "                    metrics=tf.keras.metrics.AUC(name='auc'))\n",
+    "    \n",
+    "    return model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's build our model!"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training the model\n",
+    "\n",
+    "And now let's train it! Learning rate is a very important hyperparameter, and it can be difficult to choose the \"right\" one. A learning rate that it too high will prevent the model from converging, but one that is too low will be far too slow. We will utilize multiple callbacks, using the `tf.keras` API to make sure that we are using an ideal learning rate and to prevent the model from overfitting. We can also save our model so that we do not have to retrain it next time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def exponential_decay(lr0, s):\n",
+    "    def exponential_decay_fn(epoch):\n",
+    "        return lr0 * 0.1 **(epoch / s)\n",
+    "    return exponential_decay_fn\n",
+    "\n",
+    "exponential_decay_fn = exponential_decay(0.01, 20)\n",
+    "\n",
+    "lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn)\n",
+    "\n",
+    "checkpoint_cb = tf.keras.callbacks.ModelCheckpoint(\"panda_model.h5\",\n",
+    "                                                    save_best_only=True)\n",
+    "\n",
+    "early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,\n",
+    "                                                    restore_best_weights=True)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "history = model.fit(\n",
+    "    train_generator, epochs=EPOCHS,\n",
+    "    steps_per_epoch=STEPS_PER_EPOCH,\n",
+    "    validation_data=valid_generator,\n",
+    "    validation_steps=VALIDATION_STEPS,\n",
+    "    callbacks=[checkpoint_cb, early_stopping_cb, lr_scheduler]\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Predict results\n",
+    "\n",
+    "For this competition, the test dataset is not available to us. But I wish you all the best of luck, and hopefully this NB served as a helpful tutorial to help you get started."
+   ]
  }
 ],
 "metadata": {