Intro
According to U.S. National Cancer Institute, histopathology analysis is description of a tumor based on how abnormal the cancer cells and tissue look under a microscope. Histopathological examination is usually considered the best way to tell if cancer is present . Besides laboriousness of manual identification of tumor tissues, individual pathologist’s previous experience may limit his or her ability to identify certain tumorigenesis given the wide heterogeneity of histologic images. This project aims to train a convolutional neural network (CNN) to aid pathologists to identify presence of tumor tissues from a histologic imagery, as well as to analyze sensitivity, specificity and overall accuracy attained by CNN-based automated histopathological processing.
Description of Technology
The core technology used in this project is convolutional neural network. Specifically, the model was built based on a trained image recognition neural network, VGG16 . This network is comprised of 16 convolutional layers, 14,714,688 parameters. By leveraging the trained VGG16 network, we are expecting to attain better predictive power with relatively small computational burden. Model training was conducted using keras and tensorflow-gpu. The model was trained on Amazon Web Service (AWS p3.2xlarge instance).
Data source and description
The data images and labels were attained from the Histopathologic Cancer Detection challenge presented by Kaggle
Examples of tumor positive images:
Examples of tumor negative images:
label file. train_labels.csv
id | label | |
---|---|---|
1 | f38a6374c348f90b587e046aac6079959adf3835 | 0 |
2 | c18f2d887b7ae4f6742ee445113fa1aef383ed77 | 1 |
3 | 755db6279dae599ebb4d39a9123cce439965282d | 0 |
4 | bc3f0c64fb968ff4a8bd33af6971ecae77c75e08 | 0 |
5 | 068aba587a4950175d04c680d38943fd488d6a9d | 0 |
6 | acfe80838488fae3c89bd21ade75be5c34e66be7 | 0 |
Preprocess data
Two steps were taken in data preprocessing. First, we need to separate tumor positive and negative images to different folders (named ‘pos’ and ‘neg’); second, we need to separate out data into training, validation and test sections. Here we employed a 70:20:10 ratio for the three sections.
import os
import pandas as pd
data_dir = r'C:\Users\zkuang\Desktop\histopathologic-cancer-detection'
os.chdir(data_dir)
labels = pd.read_csv('train_labels.csv',header=0)
os.makedirs(r'.\train')
os.makedirs(r'.\neg')
train_cut = int(0.7*labels.shape[0]) # first cutpoint at 0.7
val_cut = int(0.9*labels.shape[0]) # second cutpoint at 0.9
def split_data(start,stop,labels,data_dir,sub_dir):
if not os.path.exists(os.path.join(data_dir,sub_dir,'pos')):
os.makedirs(os.path.join(data_dir,sub_dir,'pos'))
if not os.path.exists(os.path.join(data_dir, sub_dir, 'neg')):
os.makedirs(os.path.join(data_dir, sub_dir, 'neg'))
for i in range(start, stop):
filename = labels.loc[i, 'id']
if labels.loc[i, 'label'] == 1:
try:
os.rename(os.path.join(data_dir, 'raw', filename + '.tif'),
os.path.join(data_dir, sub_dir,'pos', filename + '.tif'))
except FileNotFoundError:
print(filename + ' not found!')
else:
try:
os.rename(os.path.join(data_dir, 'raw', filename + '.tif'),
os.path.join(data_dir, sub_dir,'neg',filename + '.tif'))
except FileNotFoundError:
print(filename + ' not found!')
split_data(0,train_cut,labels,data_dir,'train')# create training data
split_data(train_cut+1,val_cut,labels,data_dir,'validation')# create validation data
split_data(val_cut+1,labels.shape[0],labels,data_dir,'test')# create validation data
Breakdown of training data set-up
Training with pre-trained CNN
import keras
keras.__version__
import logging
import os
Using TensorFlow backend.
#from google.colab import drive
#drive.mount('/content/gdrive')
#os.chdir(r'/content/gdrive/My Drive/FinalProject')
basedir = data_dir
#os.chdir(r'Ubuntu/FinalProject')
#basedir = r'/home/ubuntu/FinalProject'
#logging.basicConfig(filename = r'/content/gdrive/My Drive/FinalProject/test.log',level = logging.INFO)
#logging.basicConfig(filename = r'C:\Users\zkuang\Google Drive\FinalProject\test.log',level = logging.INFO)
logging.basicConfig(filename = os.path.join(basedir,'test2.log'),level = logging.DEBUG)
this is the cell we announce some set numbers with regard to our data
im_size = 96
n_channel = 3
batch_size = 64
Using Keras ImageDataGenerator
Load pretrained model and specify trainable blocks. We allow the last block of VGG16 to be trainable.
from keras.applications import VGG16
conv_base = VGG16(weights='imagenet',
include_top=False,
input_shape=(im_size,im_size,n_channel))
logging.debug(print(conv_base.summary()))
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 96, 96, 3) 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 96, 96, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 96, 96, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 48, 48, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 48, 48, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 48, 48, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 24, 24, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 24, 24, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 24, 24, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 24, 24, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 12, 12, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 12, 12, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 12, 12, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 12, 12, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 6, 6, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 6, 6, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 6, 6, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 6, 6, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 3, 3, 512) 0
=================================================================
Total params: 14,714,688
Trainable params: 14,714,688
Non-trainable params: 0
_________________________________________________________________
None
This is the part that allows for fine tuning
conv_base.trainable = True
set_trainable = False
for layer in conv_base.layers:
if layer.name == 'block5_conv1':
set_trainable = True
if set_trainable:
layer.trainable = True
else:
layer.trainable = False
Model set up
Added flattening layer and dense network. The last layer has a sigmoid activation function, since our output is binary
from keras import models
from keras import layers
from keras import optimizers
from keras.preprocessing.image import ImageDataGenerator
from keras.models import load_model
import json
model = models.Sequential()
model.add(conv_base) # how you used a trained model
model.add(layers.Flatten())
model.add(layers.Dense(256, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
Set up data generator. Notice that images were standardized before training (rescale = 1./255). Batch size used here is the global batch size 64.
datagen = ImageDataGenerator(rescale=1./255)
train_dir = os.path.join(basedir,'train')
validation_dir = os.path.join(basedir,'validation')
train_generator = datagen.flow_from_directory(
# This is the target directory
train_dir,
# All images will be resized to 150x150
target_size=(im_size, im_size),
batch_size=batch_size,
class_mode='binary')
validation_generator = datagen.flow_from_directory(
validation_dir,
target_size=(im_size, im_size),
batch_size=batch_size,
class_mode='binary')
Found 154017 images belonging to 2 classes.
Found 44004 images belonging to 2 classes.
Model compilation and fitting. We saved our model as an h5 object. This if – else statement made sure that model fitting only takes place when there isn’t a trained model available. This was for the convenience of debugging and model diagnostics. We ran 80 epochs for training and each epoch took 100 steps.
model.compile(loss='binary_crossentropy',
optimizer=optimizers.RMSprop(lr=1e-5),
metrics=['acc'])
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
vgg16 (Model) (None, 3, 3, 512) 14714688
_________________________________________________________________
flatten_1 (Flatten) (None, 4608) 0
_________________________________________________________________
dense_1 (Dense) (None, 256) 1179904
_________________________________________________________________
dense_2 (Dense) (None, 1) 257
=================================================================
Total params: 15,894,849
Trainable params: 8,259,585
Non-trainable params: 7,635,264
_________________________________________________________________
Model run
if os.path.isfile(os.path.join(basedir,'hist_path.h5')): # only train the model when it's not already existent
model = load_model(os.path.join(basedir,'hist_path.h5'))
with open('model_call_back.json', 'r') as f:
history = json.load(f)
else:
history = model.fit_generator(
train_generator,
steps_per_epoch=100,
epochs=120,
validation_data=validation_generator,
validation_steps=50)
history = history.history
model.save(os.path.join(basedir,'hist_path.h5'))
with open('model_call_back.json', 'w') as f:
json.dump(history, f)
Plotting cost and accuracy. This was the part we conducted internal model diagnostics. Loss and accuracy in training samples and validation samples were plotted against epochs.
import matplotlib.pyplot as plt
%matplotlib inline
acc = history['acc']
val_acc = history['val_acc']
loss = history['loss']
val_loss = history['val_loss']
epochs = range(len(acc))
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.savefig('accuracy.png')
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()
plt.savefig('loss.png')
<Figure size 432x288 with 0 Axes>
Investigate accuracy, sensitivity and specificity
The verification steps yielded pictures as follow, indicating proper data maneuvering for the test step.
from PIL import Image
import numpy as np
test_dir = os.path.join(basedir,'test')
test_n_pos = len(os.listdir(os.path.join(test_dir,'pos')))
pos_features = np.zeros(shape=(test_n_pos, im_size, im_size, 3))
i = 0
for file in os.listdir(os.path.join(test_dir,'pos')):
pos_features[i] = np.divide(np.array(Image.open(os.path.join(test_dir,'pos',file))),255)
i+=1
test_n_neg = len(os.listdir(os.path.join(test_dir,'neg')))
neg_features = np.zeros(shape=(test_n_neg, im_size, im_size, 3))
i = 0
for file in os.listdir(os.path.join(test_dir,'neg')):
neg_features[i] = np.divide(np.array(Image.open(os.path.join(test_dir,'neg',file))),255)
i+=1
last_tif = (pos_features[1]*255).astype(int)
plt.imshow(last_tif)
last_tif = (neg_features[1]*255).astype(int)
plt.imshow(last_tif)
if os.path.isfile(os.path.join(basedir,'y_pos.txt')):
y_pos = np.loadtxt("y_pos.txt")
y_neg = np.loadtxt("y_neg.txt")
else:
y_pos = model.predict(pos_features)
y_pos = (y_pos>0.5) *1
np.savetxt("y_pos.txt", y_pos, delimiter=",")
y_neg = model.predict(neg_features)
y_neg = (y_neg>0.5) *1
np.savetxt("y_neg.txt", y_neg, delimiter=",")
print('true positive = ')
print(sum(y_pos))
print('false negative = ')
print(sum(1-y_pos))
print('true negative = ')
print(sum(1-y_neg))
print('false positive = ')
print(sum(y_neg))
true positive =
7987.0
false negative =
911.0
true negative =
12212.0
false positive =
892.0
print('sensitivity =')
print(sum(y_pos)/len(y_pos))
print('specificity =')
print(sum(1-y_neg)/len(y_neg))
print('overral accuracy = ')
print((sum(y_pos) + sum(1-y_neg)) /
(len(y_pos)+len(y_neg)))
sensitivity =
0.8976174421218251
specificity =
0.931929181929182
overral accuracy =
0.9180529042814289
The model attained sensitivity of 0.899, specificity of 0.932. The overall accuracy of our model is 0.918, which is comparable to state-of-art supervised machine learning histopathological image recognition results. Given our images are of lower resolution than most studies use (often ultrahigh resolution imageries with more than 1 million pixels ), the results should be deemed particularly remarkable. It is no surprising that we attained better specificity than sensitivity, given our data is slightly skewed towards negative cases.
# visualization of misclassified images
print('Mis-classified positive images')
j = 0
for i in range(len(y_pos)):
if y_pos[i] == 0:
tif = (pos_features[i]*255).astype(int)
plt.figure()
plt.imshow(tif)
j += 1
if j == 5:
break
Mis-classified positive images
print('Mis-classified negative images')
j = 0
for i in range(len(y_neg)):
if y_neg[i] == 1:
tif = (neg_features[i]*255).astype(int)
plt.figure()
plt.imshow(tif)
j += 1
if j == 5:
break
Mis-classified negative images
it’s is difficult to intuitively elucidate why some of the images are mis-classified (such as shown in the previous section). One can conjecture that less stained imageries such as 1) and 5) in misclassified negative images are more likely to be considered negative by our algorithm, versus slides with high internal heterogeneity as seen in misclassified positive images are more likely to be falsely classified as cancerous. Therefore, it will be interesting to bring together human pathologist and data scientist to better illuminate the strengths and weakness of CNN-based histopathological analyses.
Due to the limitations of computational power, we didn’t try hyperparameter tuning for our model. It was conjectured a deeper convolutional neural net and relatively large epoch size (n= 120) should suffice to attain a reasonable approximation of the optimal model. Indeed, from the internal diagnostics we saw valuation accuracy tapering off after roughly 80 epochs. Whether or not fine-tuning learning rate can improve model performance remained to be examined. Similarly, we didn’t experiment with setting different VGG16 blocks trainable, the effect of which also remained to be examined.