USA +1 650 469 3205 | IND +91 120 4239596 info@scoredata.com

Using Machine Learning Predictive Models for Breast Cancer Diagnosis

by | Sep 27, 2016 | Blog | 0 comments

I am a 11th grade student in Monta Vista High School, Cupertino, CA

Machine Learning is an up and coming field with many applications in the fields of finance, healthcare and technology. I used my internship at ScoreData, a very early stage startup in Palo Alto, CA this summer to explore the different applications of Machine Learning (ML) in healthcare. ScoreData has a cloud based ML platform called ScoreFast™ with an interface that was easy to explore.

My first step was to get familiarized with the ScoreFast(™) platform. I used the public data sets on the platform to build, understand, and predict models. I had to understand training/validation data sets, the ROC curve, the Confusion Matrix (CM), variable importance and understand at a high level what each algorithm does. I also wrote Javascript and python code to test the REST Predict API. After feeling comfortable using the platform, I wanted to do a specific project end-to-end. I searched online for public datasets and found it on the UCI machine learning website; the wisconsin breast cancer dataset and Pima Indian diabetes dataset. Both the datasets were interesting to me and also had enough data points for building a machine learning model.

Breast Cancer Project

The Wisconsin Breast Cancer dataset (WBC) was created in 1992, by Dr. William H. Wolberg, by examining fine needle aspirations of cells. It consists of 10 observations (features) on 699 patients. The features are: Sample code number (unique id of the patient), Clump Thickness (1-10), Uniformity of Cell Size (1-10), Uniformity of Cell Shape (1-10), Marginal Adhesion (1-10), Single Epithelial Cell Size (1-10), Bare Nuclei (1-10), Bland Chromatin (1-10), Normal Nucleoli (1-10), Mitoses (1-10), and Diagnosis (0 for benign, 1 for malignant). The number in the parenthesis (1-10) represents the range of values. Please refer to [1] for a more detailed description.

Here are the steps I followed:

Data Preparation:

Since the dataset was small enough, I used excel to look at the data. I did some pivoting to understand the data better with respect to the target variable. I uploaded the data in the ScoreFast(™) platform and performed a data quality audit and received a quality score of 81.3%. I also looked at various statistical properties of each features such as mean, standard deviation, and a histogram to try to understand the data better. Understanding the data is very important part in building any model.

Model Building:

Building a machine learning model requires three chunks of data: training, validation, and test. I used the split functionality of the ScoreFast(™) platform to split the WBC dataset into training/validation/test, using ratio of 50/30/20. I used the ScoreFast(™) platform to build the four models as shown below.

Figure 1: Models in ScoreFast™ Platform

Figure 1: Models in ScoreFast™ Platform

All the models have a good AUC (Area Under the Curve) and accuracy. Although, I could build the model using the ScoreFast(™) console easily, I wanted to explore writing some code. So, I wrote a python program to build the GBM (Gradient Boosting Machine) model.

During the process, I found a research paper [1] where the authors have used the same data and compared performance of various classification algorithm. The following graph shows the model accuracy of models in [1] and the ones I built using ScoreFast™. All the four models fared equal or better to the SMO Model (Weka SMO SVM Model) used in [1]. The following graph shows model accuracy of the 9 models, the first 5 are from the paper [1] and the last 4 were built using the ScoreFast™ platform.

Figure 2: Model accuracy of different algorithms

Figure 2: Model accuracy of different algorithms

Compared to the multi classifiers described in the research paper [1], our models did just as well, if not better. Specifically, our Generalized Linear Machine model (GLM) had a higher accuracy than all of the other models.

Predict

By clicking a button, the models were pushed into run-time system of ScoreFast™ ready to test. There is a simple UI to check, but I decided to use the API to test the data. After I configured the model, I wrote a simple call to access it.

screen-shot-2016-09-27-at-8-30-53-am

I scored the model with the 20% dataset (which is also called a test or out of bag dataset) and the following table shows the confusion metrics at threshold = 0.4081. The Confusion Matrix (CF) is a one of the ways to get a good idea on the impact of model scoring for classification problems. The CF is different at different thresholds.

screen-shot-2016-09-27-at-8-31-04-am
Both precision and recall are very good at this threshold. For healthcare related models, getting high recall is very important. A good discussion is captured in in this blog.

Figure 3: Variable Importance

Figure 3: Variable Importance

Building a dashboard

When I began to build the simple UI for the model, I initially made it a simple client side script that used JQuery to send the POST request to the REST API, but after researching I realized I would need to make a server side script to send this request. I decided to use nodeJS to create the server, and handle all GET/POST requests. The UI is very simple; I used MaterializeCSS to create a material form to input the different features for diagnosis. Once the user has inputted the features, they can click submit and will be taken to a new page where they are diagnosed either benign or malignant. As an added feature, it shows the accuracy of the diagnosis so that the user would know how sure the model was.

Figure 4: Predictor form

Figure 4: Predictor form

Response

Figure 5: Breast Cancer Predictor Response

Figure 5: Breast Cancer Predictor Response

Alongside the breast cancer diagnosis model, I also developed a diabetic predictor model. I used the Pima Indian Diabetic dataset to train the model on the ScoreFast(™) platform and also created a web form that anyone can fill out to find out whether they have diabetes. Note both the diabetic and cancer dataset are limited and not comprehensive. I used a public dataset which is good for study, but not for actual use. Always consult your doctor.

Summary

Learning to apply ML algorithms to different medical issues and creating something that could potentially be used in a hospital was an amazing experience. I am very thankful to Prasanta Behera from ScoreData for this wonderful opportunity to create and learn. As a next step, I plan to learn other concepts such as unsupervised learning and deep learning during the coming year. I will be sharing what I learn.

[1] Salama, G., Abdelhalim, M., Zeid, M., (2012), Breast Cancer Diagnosis on Three Different Datasets Using Multi-Classifiers, International Journal of Computer and Information Technology (2277-0764), Sept. 2012