Behavior of Various Machine Learning Models in the Face of Noisy Data

Michael D. Blechner, M.D.

MIT HST.951 Final project

Fall 2005

 

Abstract

Although a great deal of attention has been focused on the future potential for molecular-based cancer diagnosis, histologic examination of tissue specimens remains the mainstay of diagnosis. The process of histologic diagnosis entails the identification of visual features from a slide, followed by the recognition of a feature pattern to which the case belongs. The combination of image analysis and machine learning imitates this process and in certain circumstances may be able to aid the pathologist. However, there is a great deal of variability and noise inherent in such an approach. Therefore, a classification model developed from data at one institution is likely to perform acceptably at other institutions, only if the model can handle such variability. This paper compares the performance of machine learning models based on fuzzy rules (FR), fuzzy decision trees (FDT), artificial neural networks (aNN) and logistic regression (LR) and examines how these models behave in the face of noisy and variant data. Results suggest that FDT models may be more resistant to data noise.

 

 

Background

Although a great deal of attention has been focused on the future potential of molecular-based cancer identification, histologic examination of tissue specimens remains the mainstay of diagnosis. The process of histologic diagnosis entails the identification of visual features from a slide, followed by the recognition of a feature pattern to which the case belongs. The pattern is associated with a high or low probability of cancer. For example a pathologist examining a breast biopsy may identify breast epithelial cells with large, irregular shaped nuclei, irregularly clumped chromatin, growing in poorly arranged sheets and showing invasion into the surrounding connective tissue with an associated fibrotic reaction. These findings compose a pattern that is highly correlated with malignancy and would warrant such a diagnosis. 

 

Imaging equipment and image analysis software can partially, and perhaps eventually, completely automate the process of feature extraction.1, 2 Given a list of previously identified visual features for a large number of cases, machine learning techniques can be used to discern patterns relevant to the separation of cancer from benign. The process of discerning such patterns from data results in a model of the domain. Diagnostic predictions can be made by applying such models to the data generated from new cases.

 

Wolberg, etal, demonstrated the correspondence between human histologic diagnosis and the combined techniques of image analysis and machine learning using the cytologic diagnosis of breast cancer for illustration.3 Breast cancer is the most common cancer in women and the second leading cause of female cancer deaths. Cancer screening involves mammography followed by tissue sampling and histologic examination of any mammographically worrisome area. Tissue samples are also obtained without mammography in the setting of palpable breast lumps. Initial tissue sampling in either situation is typically by needle core biopsy or fine needle aspiration (FNA). Core biopsy provides more tissue and retains tissue architecture for evaluation, while FNA typically yields a smaller sample and destroys or severely alters the tissue architecture. Although more invasive, core biopsy is the initial tissue procurement technique of choice in most situations. However, FNA is less invasive, can be performed in the physician’s office at a moments notice, is less expensive and therefore is still widely used. In addition, FNA is used more extensively for cancer diagnosis and screening in many other organ systems.

 

The histologic features used to diagnose breast cancer fall into 2 major categories; architectural and cytologic. Architectural features include those that describe how groups of cells relate to one another and to the surrounding connective tissue. They include characteristics such as the presence or absence of irregular, distorted or excessively cellular glands, too many glandular structures and the presence of single epithelial cells invading into connective tissue. By and large, these features cannot be reliably ascertained in FNA specimens. Cytologic features describe characteristics of single cells and include cell size, nuclear size, nuclear membrane irregularity and nuclear chromatin distribution to name a few. The FNA diagnosis of breast cancer is largely based on the nuclear cytologic features of increased nuclear size, nuclear membrane irregularity and irregularity of chromatin distribution. These features are relatively easily assessed by FNA.

 

Wolberg and his colleagues examined 569 breast FNA specimens.3 Semi-automated image analysis techniques were applied to digital photomicrographs taken from each case. The image analysis process identified the nuclear outline of 10-20 human selected cells within each image. Provided a rough estimate of the location of a cell nucleus, image analysis techniques used variations in pixel values to automatically identify a nuclear contour. For each nucleus, the nuclear outline and pixel values within the nucleus were used to calculate the following 10 values; radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. These attributes are all representations of the 3 key attributes mentioned above; nuclear size, nuclear membrane irregularity and irregularity of chromatin distribution. The values from each of the 10-20 selected cells were used to calculate the mean and standard error for each variable within each case. In addition, the three worst or largest values within a case were used to calculate a worst mean value for each attribute. The resulting data set consists of 30 variables for 569 cases. A 31st variable is the class assignment of benign or malignant, based on the pathologist’s final cytologic diagnosis which was confirmed in subsequent histologic examination of any additional biopsies as well as clinical follow-up. The data set includes 212 cases of cancer and 357 cases of benign breast changes.

 

Wolberg and his colleagues subsequently applied 2 supervised machine learning algorithms to the data and then evaluated the diagnostic performance if these models. The algorithms used were logistic regression and a decision tree algorithm known as Multisurface Method-Tree (MSM-T). In order to avoid over-fitting the training data, a stepwise approach was used to select 3 of the 30 variables, one to represent each of nuclear size, texture and shape. The attributes worst area, worst smoothness and mean texture demonstrated a classification accuracy of 96.2% using logistic regression and 97.5% using MSM-T. Both results represent averages from 10-fold cross validation.

 

Although their purpose was not to develop an actual diagnostic technique for laboratory use, the general idea of combining image analysis and machine learning can be used for the automation of visual classification tasks in medicine. However, there is a great deal of variability and noise inherent in such an approach. The optical components of imaging equipment would likely vary from one laboratory to another, resulting in variability in image capture that could alter the results of feature extraction. Different image analysis software would likely add additional variability. Even if, the imaging equipment and software were standardized, differences in tissue processing from one lab to the next would result in significant variability. For example, the use of different varieties and concentrations of tissue fixatives, as well as variations in fixation times, can significantly alter nuclear size and staining of chromatin., In addition, the biological variability, even within cancer of a single tissue type like breast epithelium, generates a great deal of variability in the histologic features. Therefore, a prediction model developed from data at one institution is likely to perform acceptably at other institutions, only if the model can handle this variability.

 

Fuzzy logic is an extension of Boolean logic that replaces binary truth values with degrees of truth.  It was introduced in 1965 by Prof. Lotfi Zadeh at the University of California, Berkeley.4 Since fuzzy logic allows for set membership values between 0 and 1, arguably it can provide a more realistic representation of biologic, image analysis data that is inherently noisy and imprecise. Fuzzy logic provides a way to arrive at a definitive classification decision based on such ambiguous data.

 

This paper compares the performance of machine learning models based on fuzzy rules (FR), fuzzy decision trees (FDT), artificial neural networks (aNN) and logistic regression (LR). The study hypothesizes that fuzzy-logic-based modeling approaches will exhibit significantly more stable classification performance with increasingly noisy test data. All models were built using an identical training set and evaluated on an unaltered holdout test set as well as multiple versions of the same test set distorted with noise to simulate variance from image analysis and biologic variance.

 

 

Materials & Methods

Data set: The Wisconsin Diagnostic Breast Cancer (WDBC) dataset was obtained from the UCI Machine learning repository.A The dataset was created by  Wolberg, Street and Olvi and consists of data from 569 breast FNA cases containing 30 descriptive attributes and one binary classification variable (benign or malignant). The descriptive attributes were obtained by semi-automated image analysis applied to digital photomicrographs obtained from the FNA slides. The case distribution includes 357 cases of benign breast changes and 212 cases of malignant breast cancer. The descriptive attributes are recorded with four significant digits and include the nuclear radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. The mean, standard deviation and mean of the worst 3 measurements are recorded for each of these ten attributes for a total of 30 variables. There are no missing attribute values.

 

Data pre-processing: The original dataset was divided into a training set containing the first 380 cases and a test set consisting of the remaining 189 cases. Models were constructed and tested using both the full 30 variable data set as well as a limited dataset consisting of only the 3 variables used in the Wolberg models (worst area, worst smoothness and mean texture). Six additional test sets were created by adding increasing amounts of noise to the original test set data. The noise for each variable in each case was generated by selecting at random from a normal distribution with a mean of zero and a standard deviation of 0.001, 0.01, 0.1, 1, 10 and100 for each of the six increasingly noisy data sets respectively. These six data sets attempted to simulate a regular degree of noise that might be the result of variability from image analysis and tissue processing. In the results and discussion, these test datasets will be referred to as “noisyâ€