Using Machine Learning Predictive Models for Breast Cancer Diagnosis
I am a 11th grade student in Monta Vista High School, Cupertino, CA
Machine Learning is an up and coming field with many applications in the fields of finance, healthcare and technology. I used my internship at ScoreData, a very early stage startup in Palo Alto, CA this summer to explore the different applications of Machine Learning (ML) in healthcare. ScoreData has a cloud based ML platform called ScoreFast™ with an interface that was easy to explore.
Breast Cancer Project
The Wisconsin Breast Cancer dataset (WBC) was created in 1992, by Dr. William H. Wolberg, by examining fine needle aspirations of cells. It consists of 10 observations (features) on 699 patients. The features are: Sample code number (unique id of the patient), Clump Thickness (1-10), Uniformity of Cell Size (1-10), Uniformity of Cell Shape (1-10), Marginal Adhesion (1-10), Single Epithelial Cell Size (1-10), Bare Nuclei (1-10), Bland Chromatin (1-10), Normal Nucleoli (1-10), Mitoses (1-10), and Diagnosis (0 for benign, 1 for malignant). The number in the parenthesis (1-10) represents the range of values. Please refer to  for a more detailed description.
Here are the steps I followed:
Since the dataset was small enough, I used excel to look at the data. I did some pivoting to understand the data better with respect to the target variable. I uploaded the data in the ScoreFast(™) platform and performed a data quality audit and received a quality score of 81.3%. I also looked at various statistical properties of each features such as mean, standard deviation, and a histogram to try to understand the data better. Understanding the data is very important part in building any model.
Building a machine learning model requires three chunks of data: training, validation, and test. I used the split functionality of the ScoreFast(™) platform to split the WBC dataset into training/validation/test, using ratio of 50/30/20. I used the ScoreFast(™) platform to build the four models as shown below.
All the models have a good AUC (Area Under the Curve) and accuracy. Although, I could build the model using the ScoreFast(™) console easily, I wanted to explore writing some code. So, I wrote a python program to build the GBM (Gradient Boosting Machine) model.
During the process, I found a research paper  where the authors have used the same data and compared performance of various classification algorithm. The following graph shows the model accuracy of models in  and the ones I built using ScoreFast™. All the four models fared equal or better to the SMO Model (Weka SMO SVM Model) used in . The following graph shows model accuracy of the 9 models, the first 5 are from the paper  and the last 4 were built using the ScoreFast™ platform.
Compared to the multi classifiers described in the research paper , our models did just as well, if not better. Specifically, our Generalized Linear Machine model (GLM) had a higher accuracy than all of the other models.
By clicking a button, the models were pushed into run-time system of ScoreFast™ ready to test. There is a simple UI to check, but I decided to use the API to test the data. After I configured the model, I wrote a simple call to access it.
I scored the model with the 20% dataset (which is also called a test or out of bag dataset) and the following table shows the confusion metrics at threshold = 0.4081. The Confusion Matrix (CF) is a one of the ways to get a good idea on the impact of model scoring for classification problems. The CF is different at different thresholds.
Both precision and recall are very good at this threshold. For healthcare related models, getting high recall is very important. A good discussion is captured in in this blog.
Building a dashboard
When I began to build the simple UI for the model, I initially made it a simple client side script that used JQuery to send the POST request to the REST API, but after researching I realized I would need to make a server side script to send this request. I decided to use nodeJS to create the server, and handle all GET/POST requests. The UI is very simple; I used MaterializeCSS to create a material form to input the different features for diagnosis. Once the user has inputted the features, they can click submit and will be taken to a new page where they are diagnosed either benign or malignant. As an added feature, it shows the accuracy of the diagnosis so that the user would know how sure the model was.
Alongside the breast cancer diagnosis model, I also developed a diabetic predictor model. I used the Pima Indian Diabetic dataset to train the model on the ScoreFast(™) platform and also created a web form that anyone can fill out to find out whether they have diabetes. Note both the diabetic and cancer dataset are limited and not comprehensive. I used a public dataset which is good for study, but not for actual use. Always consult your doctor.
Learning to apply ML algorithms to different medical issues and creating something that could potentially be used in a hospital was an amazing experience. I am very thankful to Prasanta Behera from ScoreData for this wonderful opportunity to create and learn. As a next step, I plan to learn other concepts such as unsupervised learning and deep learning during the coming year. I will be sharing what I learn.
— Salama, G., Abdelhalim, M., Zeid, M., (2012), Breast Cancer Diagnosis on Three Different Datasets Using Multi-Classifiers, International Journal of Computer and Information Technology (2277-0764), Sept. 2012