Convolutional Neural Network (CNN) based classifier for breast density assessment

(1)

Universit`

a degli Studi di Pisa

DIPARTIMENTO DI FISICA Corso di Laurea Magistrale in Fisica

CNN (Convolutional Neural Network) based

classifier for breast density assessment

Candidate: Francesca Lizzi

Supervisor:

Prof. M. Evelina Fantacci

(2)

(3)

A ci`o che d`a il coraggio, di superare le frontiere, le barriere architettoniche e la paura della malattia. All’amore per la vita, alla mia amica Claudia.

(4)

(5)

Acronyms

ALARA - (principle) As Low As Reasonably Achievable ANN - Artificial Neural Network

AOUP - Azienda Ospedaliero Universitaria Pisana ATNO - Azienda USL Toscana Nord Ovest

BIRADS - Breast Imaging Reporting and Data System CIFAR - Canadian Institute For Advanced Research CNN - Convolutional Neural Network

CSAE - Convolutional Sparse AutoEncoder DCIS - Ductal Carcinoma in Situ

FFDM - Full Field Digital Mammography

ICRP - International Commission on Radiological Protection ILSVRC - ImageNet Large Scale Visual Recognition Challenge INFN - Istituto Nazionale di Fisica Nucleare

MLP - MultiLayer Perceptrons

RADIOMA - RADiazioni IOnizzanti in MAmmografia SGD - Stochastic Gradient Descent

(8)

(9)

Introduction

In this master thesis, an automatic method to classify mammograms in density classes (following the BIRADS standard) is presented [1]. The developed method is based on Convolutional Neural Network and was validated on a self-collected clinical dataset. Breast cancer is one of the most diagnosed cancer all over the world. It has been evaluated that one woman in eight is going to develop a breast cancer in her life. It is also widely accepted that early diagnosis is one of the most powerful instrument we have in fighting this cancer [2]. For these reasons, in Tuscany, mammographic screening programs are performed on asymptomatic women at risk every two years in a range between 45 and 74 years. Full Field Digital Mammography (FFDM) is a non-invasive high sensitive method for early stage breast cancer detection and diagnosis, and represents the reference imaging technique to explore the breast in a complete way [3]. Since mammography is a 2D x-ray imaging technique, it suffers from some intrinsic problems: a) breast structures overlapping, b) malignant masses absorb x-rays similarly to the benig-nant ones and c) the sensitivity is lower for masses or microcalcifications clusters detection in denser breasts. In fact, a mammogram with a very high percent-age of fibro-glandular tissue is less readable because dense tissue presents an x-ray absorption coefficient similar to cancer’s one. Furthermore, to have a suffi-cient sensitivity in dense breast, a higher dose has to be delivered to the patient [4]. Moreover, breast density is an intrinsic risk factor in developing cancer [5]. Screening programs present some other aspects that should be considered such as the high number of false positives and the low prevalence of cancer (<0.5%) in a screening population. Since a lot of healthy women are called to partecipate to the screening programs, dose delivering should be carefully controlled. Further-more, the European Directive 59/2013/EURATOM states that patients must be well informed about the amount of received radiation dose. For these reasons, the RADIOMA project (“RADiazioni IOnizzanti in MAmmografia”, funded by Fondazione Pisa, partners “Dipartimento di Fisica” of University of Pisa, Istituto Nazionale di Fisica Nucleare (INFN), “Fisica sanitaria” of “Azienda Ospedalero Universitaria Pisana” (AOUP) and “Dipartimento di Ricerca Traslazionale e delle Nuove Tecnologie in Medicina e Chirurgia” of University of Pisa) was born with the aim of developing a personalized and reliable dosimetric quantitative index for mammographic examination. The purpose of this master thesis was to build a breast density classifier in order to personalize the new dosimetric index accord-ing to breast density. Since the most used density standard has been established

(10)

INTRODUCTION

by the American College of Radiology (ACR) in 2013, we decided to use those classes to train the classifier. This standard is written on the Breast Imaging Reporting and Data System (BIRADS) Atlas [1] and it is made of four qualita-tive classes: almost entirely fatty (“A”), scattered areas of fibroglandular density (“B”), heterogeneously dense (“C”) and extremely dense (“D”). Thanks to the screening programs, a lot of mammograms can be collected and used for analysis software developments. In this work, a technique, based on deep learning meth-ods, has been explored in order to build the classifier. In fact, a deep learning method, specifically a Convolutional Neural Network (CNN), has been used. In the last few years, deep learning-based methods have been developed with suc-cess in a wide range of medical image analysis problems [6]. Since mammographic density assessment made by radiologists suffers from a not negligible intra and inter-observer variability, some automatic classifiers have been produced with this technique in order to make the classification reproducible. The first problem in applying such techniques is due to the lack of huge public mammograms dataset and this makes the comparison among different methods difficult. Furthermore, [7] many previous approches use a two-steps classification, which implies that classification is not completely automatic: first, they extract features from the exams or they apply a segmentation method and, afterwards, they train a clas-sifier with a Support Vector Machine or other deep learning methods. In [8], Carneiro et al. trained a multi-layer perceptrons on extracted texture and his-togram features to produce BIRADS classification. In [9], Fonseca et al. used a convolutional neural network only to extract features in order to train a SVM. In [10], Wu et al. trained a CNN to perform BIRADS classification on a self-collected dataset obtaining good results. Since deep learning needs a huge amount of data, the “Azienda Ospedaliero-Universitaria Pisana” (AOUP) collected about 2000 exams (each consisting in 4 images) from the Senology Department. The exams have been selected by a mammography specialized physician and a radiol-ogy technician. This dataset has been anonymized and extracted from the AOUP database. Once obtained the dataset, first, I tried to solve a simpler classification problem with CNNs using only two of the four BIRADS classes: the A class, made of less dense breasts, and D class, made of densest breasts. After obtaining good results from this classifier, a more complex classifier has been built. Since one of the main problem in mammograms classification is to build a classifier able to discriminate between dense and non-dense breast, I trained a CNN to solve this problem. In fact, some clinical decisions depends on the possible masking effect that dense tissue could produce on a mammogram. In BIRADS density standard, this means that we should classify two classes: the first one is made of mammograms belonging to A and B classes and the second one is made of mam-mograms belonging to C and D classes. Finally, I trained a Convolutional Neural Network with a residual architecture [11] to build a complete BIRADS classifier. In the following text, in Chapter 1, an introduction about mammography and related issues is presented. In Chapter 2, the description of the collected dataset and an overview of deep learning methods are presented. In Chapter 3, training and optimization of the CNN-based classifier and its performances are reported.

(11)

INTRODUCTION

In Chapter 4, an evaluation of the results, a comparison with other correlated works and future developments are presented.

(12)

(13)

Chapter 1 Mammography and density

standard

Breast cancer is one of the most fatal cancer all over the world and it has been demonstrated that early diagnosis reduces mortality [12]. In order to achieve an early diagnosis, breast cancer screening programs are performed in developed countries with mammography as main investigation instrument. Mammography is a radiographic procedure optimized for breast examination, performed with X-rays of appropriate energy and which measures the X-X-rays attenuation through breast tissues. Breast cancer signs are:

• Morphology of the tumor mass, which includes irregular margins or spicu-lation (Figure 1.1, left).

• Mineral deposits of calcium hydroxyapatite or phosphate, which can be seen as little grains called microcalcifications (Figure 1.1, right).

• Architectural distortion of the normal breast pattern, which can be seen as straight lines radiating from a central area and retraction or bulging of a contour (Figure 1.2, left).

• Asymmetry in corresponding regions of the left and right breast (Figure 1.2, right).

To better visualize such signs, a mammogram has to show a high contrast between breast structures and background and contrast is generated by differences between attenuation coefficients among different tissues. In Figure 1.3, X-ray attenuation coefficients over energy is shown for the three main tissue in breast: adipose tissue, fibroglandular tissue and infiltrating ductal carcinoma [13].

As energy increases, differences in attenuation between breast tissues decrease. Furthermore, as shown in Figure 1.3, the attenuation coefficients of fibroglandular tissue and cancer tissue are very similar. This similarity makes cancer detection not easy. In order to have a sufficient diagnostic power, mammography needs high spatial resolution, especially to visualize margins of masses. In fact, the irregular-ities on the edge of masses are in the order of magnitude of 50 µm. Furthermore,

(14)

Figure 1.1: On the left, a hyperdense mass with an irregular shape and a spic-ulated margin. It has been proved to be an invasive ductal carcinoma. On the right, microcalcification clusters which has been proved to be multifocal DCIS (Ductal Carcinoma In Situ) with areas of invasive carcinoma

Figure 1.2: On the left, an example of an architectural distortion. On the right, an asymmetrical distortion between left and right breast that has been proved to be adenocarcinoma.

(15)

CHAPTER 1. MAMMOGRAPHY AND DENSITY STANDARD

Figure 1.3: Attenuation coefficient versus x-ray energy

breast tissue is radiosensitive. For this reason, in an optimal mammographic ex-amination, in particular in screening programs, dose delivering should be kept as low as possible, mantaining a high diagnostic quality of the image.

1.1 X-ray spectra: anode and filters

To produce X-rays with the right energy, the X-ray tube is specifically designed for the task and its combination with filters produces the required energy spectrum [14]. Mammographic X-ray tubes use rotating anodes. The most commonly used materials for anode are molybdenum (Mo, Z=42), rhodium (Rh, Z=45) and tungsten (W, Z=74). The choice of these materials is due to their spectra. The spectra is mainly made of bremsstrahlung radiation and characteristic X-rays specific to the target materials. Characteristic X-rays are particularly important in mammography. In fact, characteristic radiation energy is 17.5 and 19.6 keV for molybdenum and 20.2 and 22.7 keV for rhodium. These energies are the required ones for disciminating cancer and normal tissues. In Figure 1.4, molybdenum spectrum at 25 kVp and 1 mGy of final air kerma is reported.

The low energy bremsstrahlung X-rays deliver a high dose amount with little contribution to the clinical capability of the image. Furthermore, high energy bremsstrahlung X-rays make subject contrast decrease. For these reasons, filters are used on X-rays to reduce the low and high bremsstrahlung photons. In Figure

(16)

1.2. PHOTON INTERACTIONS WITH BREAST TISSUES AND X-RAY ENERGY RANGE

Figure 1.4: Molybdenum spectrum obtained with a simulation on https://health.siemens.com/booneweb/index.html

Figure 1.5: Molybdenum spectrum with filtration obtained with a simulation on https://health.siemens.com/booneweb/index.html

1.5, molybdenum spectrum with a 30 µm molybdenum filter is reported.

These filters are often made with the same material of anode, i.e. molybde-num and rhodium because they stop undesired X-rays and trasmit characteristic X-rays. As showed in Figure 1.5, molybdenum filter attenuates both X-rays in the low energy range and those above its own K-absorption edge, while the character-istic X-rays pass through the filter with high efficiency. Since atomic number of rhodium is higher than molybdenum one, its spectrum is harder. Thus rhodium anodes offer advantages for thicker and denser breast. In Figure 1.6, rhodium spectrum is reported. Multiple targets or filters are commonly included in mod-ern X-rays tubes.

1.2 Photon interactions with breast tissues and

X-ray energy range

As shown in Figure 1.3, X-ray energy range in mammography is different from other X-ray systems [3]. In the energy range between 20-40 keV, the most

(17)

im-CHAPTER 1. MAMMOGRAPHY AND DENSITY STANDARD

Figure 1.6: Rhodium spectrum obtained with a simulation on https://health.siemens.com/booneweb/index.html

portant photon interactions, that occur with breast tissues, are the photoelectric effect and scattering processes. The first is predominant at energy less than 22 keV. It causes most of the energy absorption from the incident X-rays and hence is the main source of the breast dose. There are two scattering processes at this energies that are Compton, inchoerent, and Rayleigh, coherent. In Rayleigh scattering, no energy is transferred between particles involved in the interaction, while there is transfer of energy between interacting particles in Compton scat-tering. Scattering is a degradation source for mammographic images because scattered photons deviate their path and impress the detector in a false posi-tion. To prevent this degradation, anti-scattering grids are commonly used: since scattered photons are not still parallel to the X-ray beam, they should not easly pass through the grid, which is kept in movement in order to have enough signal. Photon interactions between the incident X-rays and the breast tissues lead to the attenuation of X-rays. The different breast tissues composition results in a different X-rays attenuation. However, as showed in Figure 1.3, these differences are small. Specifically, ductal carcinoma and fibroglandular tissue show really lit-tle differences in attenuation. The ratio between the percentage of fibroglandular tissue and fat tissue, as seen in a mammogram, defines breast density and the little difference in attenuation between “dense” tissue and cancer is responsible for the “masking effect”. Masking effect means that, because of these little dif-ferences, sometimes, dense tissue may “mask”, i.e. hide, cancer. The attenuation differences decrease as X-ray energy increases. The highest difference can be seen at very low X-ray energies and this is the reason why energy range in mammog-raphy is substantially different from other X-ray systems. Another important factor of image quality is subject contrast. Subject contrast is the contrast of an object of interest in the scene being imaged with respect to the background. In X-rays imaging, it depends on X-ray spectrum, attenuation of the object and background. An adequate subject contrast is necessary to detect the small differ-ence between fibroglandular tissue and cancer. In Figure 1.7, subject contrast as a function of kVp for a breast 50% adipose, 50% glandular of 5 cm of thickness with a nodule of 2 mm in three different spectra (Mo/Mo, Mo/Rh and Rh/Rh) is reported [15]. Subject contrast strongly depends on X-ray energy and it

(18)

de-1.3. IMAGING SYSTEMS

Figure 1.7: Subject contrast as a function of kVp for a breast 50%, adipose 50%, glandular of 5 cm of thickness with a nodule of 2 mm in three different spectra (Mo/Mo, Mo/Rh and Rh/Rh)

crease as energy increases. Both differential attenuation and subject contrast characteristics between normal and malignant tissues require that mammogra-phy operates at low energy for the best screening and diagnostic capabilities. As said above, probability of photoelectric effect is high at low energy and heavy photoelectric effect means an increase of both absorption dose and exposure time for the patient.

1.3 Imaging Systems

Digital mammography is the widely accepted methods to perform screening pro-grams. In digital mammography, X-rays are captured on a designed digital detec-tor that converts them in an electronic signal. The digital image can be visualized on a high resolution monitor and the physician can use tools to manipulate it. In this work, all the imaging systems used are digital mammography systems. A schematic mammographic system is shown in Figure 1.8. The system is mainly

(19)

Figure 1.8: Scheme of a mammographic system

made of the X-ray tube, a compression plate, a support for breast, an anti-scattering grid and the detector. The goal of mammography is to achieve the image quality required for a given detection task, while keeping the absorbed dose As Low As Reasonably Achievable (ALARA principle). To achieve this goal, mammographic unit is specifically designed for examination of breast tis-sues. The patient may be examinated standing or sitting with her breast resting on a support plate. The X-ray tube and support plate are built on a support which can rotate in order to achieve different projection angles. The two stan-dard projections, shown in Figure 1.9, are the craniocaudal and the mediolateral oblique. An anti-scatter grid is placed between the breast support and the image receptor. It allows to lower scattering effects on images. A compression is applied to breast using a plastic compression plate. Thanks to compression, over-lapping of structures and motion artifacts are minimized. In this work, mammographic systems used are: Giotto Image SDL [16], Selenia Dimensions [17], Senograph DS version ADS 54.11 and Senograph DS version ADS 53.40 [18].

(20)

1.4. RADIOMA PROJECT

Figure 1.9: On the left, a cranio-caudal mammogram. On the right, a medio lat-eral oblique mammogram. These mammograms are taken from AOUP database.

1.4 RADIOMA project

Breast is a highly radiosensitive organ. In fact, the International Commission on Radiological Protection (ICRP) has varied its estimate of the contribution of breast radiation exposure to total body detriment over time, changing the tissue-weighting factor for this organ from 0.05 in 1991 [19] to 0.12 in 2007 [20]. Thus, it is very important to take into account the dose absorbed by the breast during a mammographic exposure, trying to reduce the radiation dose absorbed by patients in every mammographic examination without impairing its diagnostic quality. This follows the guidelines contained in the European Directive 2013/59/EURATOM issued on 5 December 2013 [21], which gives chal-lenging targets to all stakeholders in terms of justification and optimization of the procedures using ionizing radiation. In any mammographic quality assurance program, special care is required for evaluating and monitoring radiation doses delivered to the breast. In addition, article 58 of the above mentioned directive [21] requires that patients are informed about the risk associated with ionizing radiation and that detailed information on the exposure of the patient is included in the report of the radiological procedure. RADIOMA (RADiazioni IOnizzanti in MAmmografia) [22] is a project co-financed by “Fondazione Pisa” which part-ners are “Dipartimento di Fisica” of University of Pisa, “Istituto Nazionale di Fisica Nucleare” (INFN), “Fisica Sanitaria” of AOUP (Azienda Ospedalero Uni-versitaria Pisana) and the “Dipartimento di Ricerca Traslazionale e delle Nuove Tecnologie in Medicina e Chirurgia” of University of Pisa, which rises with the aim of transposing the European Directive 2013/59/EURATOM. The final task

(21)

of RADIOMA is to define a new accurate, reliable and reproducible dose index for mammography. This target is really important because mammographic screening programs expose to ionizing radiation a lot of asymptomatic women. Further-more, the Regional Council of Tuscany, following the guidelines of the Ministry of Health, on 06/09/2016 approved a resolution to enlarge of 10 years the age window of women called for breast screening: in fact till now the age range was from 50 to 69 years while now it goes from 45 to 74. This extension of the screen-ing program started in 2017 with a gradually implementation over five years. Because of breast is a radio-sensitive tissue, a new personalized dosimetric index should consider breast density. In fact, to have a sufficient sensitivity a higher dose is given to patients with dense breast [4]. For this reason, I worked on this project in order to produce a classifier able to classify mammograms according to density. In order to achieve this goal, a public database of mammograms is going to be built. AOUP (Azienda Ospedaliero Universitaria Pisana), with the informed consensus of women, made us available about 8000 images from the mammographic systems classified by density. We decided to chOose only nega-tive mammograms. In fact, this software should be used in screening programs, in which less than 0.5% of patients results positive to cancer. Since the dataset we chose influences the behavior of the classifier, we tried to build a dataset as close as possible to a screening mammographic dataset. We decided also to begin a prospective collection of screening mammograms with the collaboration of the “Azienda USL Toscana Nord-Ovest” (ATNO) in order to increase the size of the sample for the future.

(22)

1.5. MAMMOGRAPHIC DENSITY STANDARD

1.5 Mammographic density standard

The assessment of breast density is important to develop a personalized dose in-dex in mammography. Furthermore, medical research towards the prevention of breast cancer has shown that breast parenchymal density is a strong indicator of cancer risk. Specifically, the risk of developing breast cancer is increased only of 5% related to mutations in the genetic biomarkers BRCA 1 and 2; this risk, on the other hand, is increased to 30% for breast densities higher than 50% [23] [24]. A higher breast density is also responsible for a low sensitivity on mammograms because dense tissue has about the same absorption coefficient of cancer. Defin-ing and sharDefin-ing a classification standard is a fundamental startDefin-ing point in breast classification. In 1976, Wolfe [25] defined empirically four classes of density, show-ing some classified mammograms and describshow-ing few features on them. Beyond controversial efficiency of this classification method, Wolfe had the merit of laying the basis to study the effective correlation between breast density and the increase of risk in developing a cancer. Nowadays the worldwide recognized standard has been established by the American College of Radiology (ACR) and it is called BIRADS Atlas (Breast Imaging-Reporting And Data System) [1]. These classes has been established to standardize mammographic reports in order to reduce interpretative confusion on mammograms. In the previous BIRADS Edition, published in 2003, the four density classes has been identified with percentage indication as follow:

1. B1: It refers to lower dense breast with fibroglandular tissue less than 25% 2. B2: Class with a percentage of fibroglandular tissue between 25% and 50% 3. B3: Class with a percentage of fibroglandular tissue between 50% and 75% 4. B4: It refers to higher dense breast with fibroglandular tissue more than

75%

The first difficulty, in this density standard, is the definition of which pixel represents fat tissue and which one represents instead fibroglandular tissue. In fact, the high tissue variability among women and the different conditions in which mammograms can be performed make thresholding methods not efficient: the pixel value in a woman that is assessed as “fat” can mean “fibroglandular” on another woman. Tissue variability is a problem not only among different women but also on the same individual over time and depends on several factors that should be kept in mind when breast density is assessed such as Body Mass In-dex (BMI), age, use of hormonal therapies, weight and diet. The second main problem is the lack of reproducibility [26]. Studies on inter-reader agreement with k-statistic showed a low value of accordance [27]. For these reasons, in the fifth edition of BIRADS Atlas [1], percentage indication has been replaced with guideline based on text description of mammograms. This standard is widely used in North America and in Europe and it plays an important role in assessing the relations between breast density and cancer detectability. At the same time,

(23)

automated methods for assessing density classes have been developed to overcome limitations of area-based valuations that are subjective and time-consuming and hence not suited for large epidemiological studies. Futhermore, automated classi-fication software makes the density assessment really reproducible. Some of these software are already available [28] and tested. The most known is CumulusV (University of Toronto), which is an interactive software to segment and to esti-mate breast density according to BIRADS standard. Cumulus is not completely automatic but it is operator-dependent and this means that it suffers of the same problem discussed above. Other softwares for breast density assessment are avail-able such as Volpara (Volpara Solutions) and Quantra (Hologic, Danbury, Conn) but they do not classify density in BIRADS standard. However, these automated software have the merit of making density measures really reliable unlike the one assessed by radiologists. Finally, the inherent problem is that we want to measure mammographic density, which is a 3D quantity, from mammograms that are 2D projections. This kind of difficulty may be overcome defining a volumetric density standard using new breast imaging techniques such as MRI but mammography is still the most used breast imaging system all over the world. In this master thesis, I chose the BIRADS standard, established on the fifth edition of BIRADS Atlas.

1.5.1 BIRADS guidelines for breast density assessment

In the fifth edition of BIRADS Atlas, density assessment is defined as an overall assessment of the volume of attenuating tissues in the breast. Density evaluation helps to indicate the relative possibility that a lesion could be obscured by normal tissue and that the sensitivity of examination thereby may be compromised by dense breast tissue. Since mammography does not detect all breast cancers, clinical breast examination is a complementary element of screening. The four density categories are named “A”, “B”, “C” and “D” and they are defined by the visually estimated content of fibroglandular-density tissue within the breasts. If breasts, on the same individual, are not apparently belonging to the same density class, the denser one should be considered in the assessment. The sensitivity of mammography for non-calcified lesions decreases as the BIRADS breast density category increases. The denser the breast, the larger the lesion(s) that may be obscured. There is considerable intra- and inter-observer variation in visually estimating breast density between any two adjacent density categories. The less dense class is “A” and breasts belonging to this class are almost entirely fatty. In this case, mammography shows the highest sensibility possible and the probability of masking effect is really low. In Figure 1.10, a mammogram of an almost entirely fatty breast is reported.

In the second density class “B”, there are breasts with scattered areas of fibroglandular density which can not be considered as mammographic findings. In Figure 1.11, a mammogram classified B is reported.

The category “C” includes heterogeneously dense breasts. It is common that some areas of breast are relatively dense while other areas are almost fat. In these

(24)

Figure 1.10: A: The breasts are almost entirely fatty.

(25)

Figure 1.12: C: The breasts are heterogeneously dense, which may obscure small masses.

cases, it is useful to describe locations of denser areas in the medical density re-port. In fact, in these areas, small uncalcified lesions may be obscured. Some text examples are reported in BIRADS Atlas such as “The dense tissue is located ante-riorly in both breasts, and the posterior portions are mostly fatty” or “Primarily dense tissue is located in the upper outer quadrants of both breasts; scattered areas of fibroglandular tissue are present in the remainder of the breasts”. In Figure 1.12, an exemple of a C-classified breast is showed.

The denser class is “D” and includes breasts with such an extreme density that lowers the sensitivity of mammography. An example of dense breasts is reported in Figure 1.13.

The fourth edition of BIRADS, unlike previous editions, indicated quartile ranges of percentage dense tissue (increments of 25% density) for each of the four density categories, with the expectation that the assignment of breast density would be distributed more evenly across categories than the historical distribu-tion of 10% fatty, 40% scattered, 40% heterogeneously, and 10% extremely dense. However, it has since been demonstrated in clinical practice that there has been essentially no change in this historical distribution across density categories, de-spite the 2003 guidance provided in the BIRADS Atlas. The distribution of density class of 3865070 screening mammography examinations over 13 years is reported in Figure 1.14.

(26)

Figure 1.13: D: The breasts with such an extreme density that lowers the sensi-tivity of mammography.

(27)

Figure 1.14: U.S. Radiologists’ Use of BI-RADS Breast Density Descriptors, 1996-2008

(28)

(29)

Chapter 2 Materials and methods

Public research database with digital mammograms are not available. In fact, public mammographic databases contain mainly digitized analog mammograms and too few mammographic exams. Furthermore, not all public databases are provided with a ground truth about density [29] [30]. Since the proposed ap-proach required a huge number of images, the collection of digital mammograms was needed. For this reason, the “Azienda Ospedaliero-Universitaria Pisana” (AOUP) made us available 1962 mammographic exams and 50 tomosynthesis scans, classified and collected by a radiologist, specialized in mammography, and a radiology technician.

2.1 Image selection

In order to have an enough clean dataset, we decided the following selection parameters:

• All exam reports were negative. Where possible, later mammographic exam in medical records has been examined to verify the current health state of women.

• Badly exposed X-ray mammograms have not been collected.

• Only women with the all four-projection usually kept in mammography (craniocaudal and medio-lateral oblique of left and right breast), have been chosen.

Moreover:

• The mammographic imaging systems used were GIOTTO IMAGE SDL, SELENIA DIMENSIONS, GE Senograph DS VERSION ADS 54.11 and GE Senograph DS VERSION ADS 53.40 (Table 2.1).

• The assessment of density class has been made by the radiologist with the support of medical reports, written by other radiologists of the AOUP.

(30)

2.2. DATASET DISTRIBUTION

Imaging system Number of exams

GIOTTO IMAGE SDL 230

SELENIA DIMENSIONS 50

GE Senograph DS ADS 54.11 121

GE Senograph DS ADS 53.40 1561

TOTAL 1962

Table 2.1: Mammographic imaging systems as reported in DICOM files

2.2 Dataset distribution

The 1962 mammographic exams were distributed over density and age as reported in Table 2.2. In Figure 2.1, 2.2, 2.3 and 2.4, the distribution over age for each density class is reported. As expected, both average age and median age decrease as density increases. A B C D N. of exams 264 611 888 199 Average age 67.6 63.7 58.1 53.0 Standard deviation 11.3 11.5 9.5 6.7 Median 68 62 56 52

(31)

CHAPTER 2. MATERIALS AND METHODS

Figure 2.1: Class A: age distribution

(32)

2.2. DATASET DISTRIBUTION

Figure 2.3: Class C: age distribution

(33)

2.3 Neural Networks and Deep Learning

In recent few years, neural networks and, in particular, deep learning have become one of the most interesting and studied instrument to analyze data [31]. The coming of the “Big Data” forced researchers and industries to create new theories, new computational instruments and new ideas for data representation. Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics and database systems. The success of such network on natural images classification has caused a revolution in medical images analysis too [6]. In Figure 2.5 and 2.6, some deep learning applications are showed [32].

Figure 2.5: Segmentation and classification of a natural image performed by a deep network.

2.4 Neural networks: lexicon and first

defini-tions

Neural Networks are a kind of learning algorithm which constituites the basis of most deep learning methods. The term “Neural Network” is due to the attempt of imitating neural connections of human and non-human brain. Even though this sort of correspondence between human brain and artificial neural network is dangerous because the knowledge about how brain works is too limited to assert that such algorithm performs in about the same way, some hypothesis of biological

(34)

2.4. NEURAL NETWORKS: LEXICON AND FIRST DEFINITIONS

Figure 2.6: Depicted are (a) original image, (b) dense tissue according to expert Cumulus-like threshold, and (c) dense tissue according to a deep neural network called Convolutional Sparse AutoEncoder (CSAE).

functioning of the brain have been fruitfully applied in artificial neural networks. An artificial neural network (ANN) is a network of simple elements called neurons, which receive input, change their internal state (activation) according to that input, and produce output depending on input and activation. The network is formed by connecting the output of certain neurons to the input of other neurons, which compose a directed weighted graph. An example of an orientated weighted graph is reported in Figure 2.7. A weighted graph is a graph in which a number (the weight) is assigned to each edge, while an oriented graph is a directed graph in which at most one of (x, y) and (y, x) may be arrows of the graph.

(35)

The weights as well as the functions that compute the activation can be mod-ified by a process called learning, which is governed by a learning rule. One of the most important feature, in neural network, is that a non-linear function is applied passing from a neuron to another neuron [33]. This means that we can not re-build an input from an output as a product of matrices but also that it can distinguish data that are not linearly separable. Deep feedforward networks, also called feedforward neural networks, or multilayer perceptrons (MLPs), are the quintessential deep learning models (Figure 2.8).

Figure 2.8: An example of a multi-layer perceptron.

The goal of a feedforward network is to approximate some function f∗. For example, for a classifier, y = f∗(x) maps an input x to a category y. A feedforward network defines a mapping y = f (x; θ) and learns the value of the parameters θ that result in the best function approximation. For most of these models, the process of approximation starts from input x, flows through f and ends at the output y. In this network there are no feedback connections in which outputs of the model are fed back into itself. Feedforward neural networks are called networks because they are typically represented by composing together many different functions. For example, we might have f(1)_{, f}(2) _{and f}(3)_{, connected}

into a graph to form f (x) = f(3)_(f(2)_(f(1)_{(x))). In this case, f}(1) _{is called first}

layer, f(2) second layer and f(3) third and last layer. The last layer is the one which produces the output of the neural network. The number of layers is called “depth”. The term “deep learning” is due precisely to the number of layers in the network, even though there are no precise definitions on what is deep and what is not. The dimensionality of each layer is called width. The choice of width, depth and operations that link layers at different depth and width is what defines the architecture of the networks. The training data provide us examples of f∗(x) evaluated at different training points. Each example x is associated with a label y ≈ f∗(x). The training examples specify directly what the output layer has to do at each point; it must produce a value that is close to y. The behavior

(36)

of the other layers is not directly specified by the training data. The learning algorithm has to decide how to use those layers to produce the desired output, but the training data do not say what each individual layer should do. Instead, the learning algorithm must decide how to use these layers to best implement an approximation of f∗(x). Because the training data do not show the desired output for each of these layers, they are called “hidden” layers.

2.4.1 Learning algorithm

A neural network is always guided by a learning algorithm, which defines what learning means. In 1997, Mitchell defined a learning algorithm as [34] “A com-puter program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E”. Experience, tasks and performance measures can be very different among different problems and learning algorithms can be very different one to each other. The task is obviously the final target that we want our network able to get. There are a lot of tasks to which an ANN can be trained at but the process of learning is not itself the task. Process of learning means the capability to solve the task. In this work, the task is image classifi-cation. In fact, the program is asked to specify to which class a mammogram belongs to, among a set of k classes. However, there are a lot of different tasks such as transcription, in which the machine learning system is asked to observe a relatively unstructured representation of some kind of data and to transcribe the information into discrete textual form, or synthesis and sampling, in which the algorithm is asked to generate new examples that are similar to those in the training data. Performance measure P is a quantitative measure we need to assess the ability of a machine learning algorithm. For tasks, such as classification or transcription, P is often the accuracy, i.e. the capability of the network to predict the right output of our data. For tasks that have continuous output, the use of accuracy does not make sense and one of the most used performance measure is the average log-probability the model assigns to some examples. Usually we are interested on how a trained network performs on unseen data, especially because we want to use it in real world. To achieve this goal, we have to evaluate the performance of our network on a really unseen set of data, separated from train-ing and validation data. Finally, machine learntrain-ing algorithms can be roughly divided in supervised or unsupervised. An unsupervised learning algorithm tries to learn a structure in a given features dataset, i.e. a probability distribution of some features of the training set. For example, the unsupervised method is used to solve clustering problems. A supervised learning algorithm instead try to reproduce a certain output after being trained on several similar examples. Each example is paired with a label, which is the ground-truth of the network. Generally, a classification problem is solved with supervised learning and, in this master thesis, I have chosen a supervised learning method.

(37)

2.4.2 Overfitting, underfitting and regularization

The most important challenge in machine learning is to predict the right output on unseen data, which are not part of the training data. The ability of the net-works to perform well on unseen data is called generalization. Typically, when we train a machine learning algorithm, we compute an error on the training set called training error and the task is to minimize this error. This is an optimiza-tion problem. The difference between this procedure and a classical optimizaoptimiza-tion problem is that we want to minimize the error, called test error, on unseen data. The test error, also called generalization error, is estimated on a test set of ex-amples different from training set. The main problem is: how can we minimize generalization error, considering only training error? The answer is that we do some assumption on statistical distributions of both training and test sets. If training set and test set are arbitrarly collected, we can not assert anything on the latter by only measuring the former. So we make some assumptions on the two datasets. The first one is that each example is independent with respect to the others. The second one is that both training set and test set are iden-tically distributed, drawn from the same probability distribution as each other. We called the sharing underlying probability distribution data-generating distri-bution. If we want to build a classifier on a real problem, this distribution must coincide with the real data distribution. These two assumptions allow us to make mathematical studies on the relationship between training and test sets. We sam-ple the training set, then we use it to choose the parameters that minimize the training error and then we sample the test set and we measure the test error. Under this assumptions, the test error is greater or equal to traning error. In a learning process, the two main points we have to fix are:

1. Make the training error small.

2. Make the gap between training and test error as small as possible.

These two points describe the machine learning challenges that are underfitting and overfitting. Underfitting occurs when we can not reach a sufficiently small training error. This means that our network is not able to discriminate and recognize the data on which it has been trained. Overfitting occurs when the gap between traning error and test error is too wide. We can control underfitting and overfitting, modifying the network capacity, i.e. its capability to fit a wide range of functions. If we increase the capacity of the network, we reduce the training error but we might not be able to reduce the gap. If the capacity is too small, our algorithm is not able to fit training data. A graphic example of this problem is reported in Figure 2.9. Three models of a training set are represented. The training data were generated synthetically, by randomly sampling x values and choosing y deterministically by evaluating a quadratic function. In the figure at left, a linear function fit to data suffers from underfitting: it cannot capture the curvature that is present in the data. In figure in the center, a quadratic function fit to the data generalizes well to unseen points. It does not suffer from overfitting

(38)

Figure 2.9: Three models to this example training set are represented.

or underfitting. In figure at right, a polynomial of degree 9 fit to the data suffers from overfitting.

In order to increase model capacity, we can increase the number of examples and simultaneously increase the number of parameters of our networks. In the example reported in Figure 2.9, it corresponds in increasing the number of points and the degree of the polynomial function. There are, indeed, many ways to change a model’s capacity. Capacity is not determined only by the choice of the model. The model specifies which family of functions the learning algorithm can choose from, when varying the parameters in order to reduce a training objective. This is called the representational capacity of the model. In many cases, finding the best function in this family is a very difficult problem. In practice, when we train a network and so we do this process, we just search the best parameters that minimize the training error. In order to produce a generalized algorithm, there are some regularization methods available. One possibility is to add to the loss function a term R(W):

L(W ) = 1 N N X i=0 Li(f (xi, W ), yi) + λR(W ) (2.1)

which should simplify the training so that the model should work on unseen data. The regularizers can be several and different. For example, the L2 regular-izer is R(W ) =P

k

P

lWk,l2 , while the L1 ones is R(W ) =

P

k

P

l|Wk,l|. However,

in deep learning there are other regularization methods which are commonly used. One of them is the dropout, that has been presented for the first time in 2012 by Krizhevsky et al. [35] in the AlexNet network. Dropout [36] consists in randomly setting a fraction rate of input units to 0 at each update during training time in order to help in preventing overfitting. In fact, standard backpropagation learn-ing builds up brittle co-adaptations that work for the trainlearn-ing data but do not generalize to unseen data. Random dropout breaks up these co-adaptations by

(39)

Figure 2.10: A graphical representation of dropout

making the presence of any particular hidden unit unreliable. One of the draw-backs of dropout is that it increases training time. A dropout network typically takes 2-3 times longer to train than a standard neural network of the same archi-tecture [36]. A graphical example that shows how dropout works on a standard neural network is presented in Figure 2.10. Another commonly used regularizer is batch normalization. Batch normalization [37] allows us to use much higher learning rates and be less careful about initialization. It acts on the “internal co-variate shift”. The internal coco-variate shift consists in changing the distribution of each layer’s inputs during training time, as the parameters of the previous layers change. To control this shift, we should do a perfect hyperparameters choice and an accurate kernel initialization, but these operations are really difficult because the number of both hyperparameters and kernels is really high. Batch Normal-ization consists in normalizing the hidden layers computing mean on every single batch. This way, I could choose a higher learning rate, especially in the first part of the training, because this normalization ensures that there are no activations too high or too low. It also reduces overfitting because, as dropout, it adds some noise to hidden layers activation.

2.4.3 Gradient-based learning

Designing and training a neural network is not much different from training any other machine learning model with gradient descent. The most important differ-ence between a neural network and other machine learning model such as Support Vector Machines is non-linearity. In fact, in a neural network the functions be-tween layers are non-linear. Indeed, neural networks are usually trained by using iterative gradient-based optimizers which just drive the cost function to a very low value. As with other machine learning techinques, we have to define a cost

(40)

function that will be minimized by a gradient-based optimizer. The choice of the cost function is a crucial task in designing a deep neural network. Fortunately, the cost functions for neural networks are more or less the same as those for other parametric models, such as linear models. In most cases our model defines a dis-tribution p(y|x; θ) and we simply use the principle of maximum likelihood. The cost function is often a cross-entropy computed between the training data and the model’s predictions. Futhermore, the cost function is often computed with a regularization term which has the scope of not overfitting data. One of the most important algorithm used in deep neural network is the backpropagation of the error. While feedforward networks are trained only in one direction (from input to output), a deep neural network can be trained also in the reverse direction, i.e. from output to input. What we do with a supervised learning paradigm is to give a lot of examples to the network with their labels. All the examples flow from the first layer to the last and, once arrived in the last layer, the network produces an output. The differences between the output and the label of the input, computed by the loss function, represent the error of the newtork in solving the problem. In backpropagation, this error is backward propagated in the network. What we do is to tell the network which is the right result associated with that input and we ask to adjust all the weights of all the hidden layers in order to produce that output. The term back-propagation is often misunderstood as meaning the whole learning algorithm for multi-layer neural networks. Actually, back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform learning using this gradi-ent. To clarify, I will describe these two ways of using gradient in the following subsections.

Stochastic Gradient Descent (SGD)

Stochastic gradient descent (SGD) is one of the most used optimizer in learning problems [38]. It is a simplification of the more complex algorithm gradient descent. Each single example zt of a classification problem is a pair (x, y), where

x is an arbitrary input and y is a scalar. We consider a loss function `(ˆy, y) which measures the error in predicting ˆy since y is the ground truth. Given a functions family F , we want to find fw(x) ∈ F , parametrized by a weights vector w, that

minimizes the loss:

Q(z, w) = `(fw(x), y) (2.2)

averaged on the examples. Here, the problem is that we want to average over an unknown distribution dP (z) while we usually have a set of z1. . . zn examples. So

we need to know dP (z) in order to calculate the expected risk :

E(f ) = Z

`(f (x), y)dP (z) (2.3)

It is called expected because it measures the generalization performance that is the expected performance on future examples. We can define the empirical risk :

(41)

CHAPTER 2. MATERIALS AND METHODS En(f ) = 1 n n X i=1 `(f (xi, yi)) (2.4)

that measures the training set performance. The statistical learning theory states that, if F is sufficiently restricted, it is justified to minimize the empirical risk instead of the expected risk.

It has been proposed many times to minimize En(f ) using Gradient Descent

(GD). At each iteration, we update the weights w using the gradient of En(fw):

wt+1 = wt− γ 1 n N X i=1 ∇wQ(zi, wt) (2.5)

where γ is the gain. In machine learning, γ is called learning rate. Stochastic gradient descent (SGD) is a GD simplification. Instead of calculating the gradient of En(fw) exactly, at each iteration the gradient is estimated from a single example

randomly selected zt:

wt+1= wt− γt∇wQ(zt, wt) (2.6)

This stochastic process depends on the randomly picked zt. We are assuming

that Equation 6 works as Equation 5, except for the noise introduced by the sim-plification. The convergence speed is limited by the true gradient approximation and strongly depends on the choice of the learning rate. In fact, learning rate represents the step size the gradient makes when it goes towards the minimum. In Figure 2.11, the trend of loss over epochs for different learning rates is re-ported. An epoch is an iteration over the entire dataset provided [39]. As shown in Figure 2.11, with a very high learning rate the loss function diverges. With a high learning rate, the loss function can reach a plateau and never goes to the minimum. With a low learning rate, the learning process could be too slow. To choose a good learning rate, we should monitoring the loss function and try the best value possible. A way to do this is to lower the learning rate when the loss function reaches a plateau.

Chain rule and back-propagation algorithm

The term back-propagation refers to the algorithm used to compute the gradient. As said above, in a feedforward network the flow of information starts at the input x, goes through the hidden layers and then the output ˆy is computed. Back-propagation is an algorithm that computes the chain rule, with a specific order of operations that is highly efficient. The difference between back-propagation and, for example, stochastic gradient descent is that the first is used to compute gradient while the second is an algorithm which uses this gradient to compute learning. The main instrument we use to compute the gradient is the chain rule of calculus. The chain rule is a method to calculate derivatives of a function. Suppose that x ∈ Rm, y ∈ Rn, g maps from Rm to Rn and f maps from Rnto R.

(42)

(43)

If y = g(x) and z = f (y) then: ∂z ∂xi =X j ∂z ∂yj ∂yj ∂xi (2.7)

In vector notation, it is equal to:

∇xz =

∂y ∂x

>

∇yz (2.8)

where _∂x∂y is the NxM Jacobian matrix of g. So, the gradient of a variable x can be obtained by multiplying a Jacobian matrix ∂y_∂x by a gradient ∇yz. The

back-propagation algorithm consists in performing such a Jacobian-gradient product for each operation in the graph. Usually we apply the back-propagation algorithm to tensors of arbitrary dimensionality, not merely to vectors. Conceptually, this is exactly the same as back-propagation with vectors. The only difference is how the numbers are arranged in a grid to form a tensor. We could imagine flattening each tensor into a vector before we run back-propagation, computing a vector-valued gradient, and then reshaping the gradient back into a tensor. In this rearranged view, back-propagation is still just multiplying Jacobians by gradients.

2.4.4 The “horse problem”

To understand why in image analysis non-linearity could be an important and sometimes necessary aspect, I will discuss a problem on natural images. An example of the weights of a parametric linear classifier trained on CIFAR10 is presented in Figure 2.12. CIFAR10 is a public dataset containing 50000 training images and 10000 test images, classified in ten classes. These classes are repre-sented in Figure 2.12. One of the most interesting representation produced by this linear classifier is the horse one. In fact, the horse representation is a horse with two heads. This is due to the presence in the training set of horses with different orientations. If we want to represent a horse as a linear combination of the horses in the training set, we will obtain a single representation which con-tains all the information. A neural network, instead, does not produce a linear combination of the input images. The non-linearity, in fact, produces different representations of the images in each layer. A deep neural network with many layers will produce a higher number of hierarchical representations. These rep-resentations are combined together by the learning method in order to output a classification score. This means that an artificial neural network can learn com-plex concepts. One of the biggest problem we have with the ANNs representation is that we do not know how “importance“ is distributed to each neuron and we do not have a widely accepted instrument to evaluate the information content in each hidden layer. One of the scopes of this research is to minimize the number of representations and, at the same time, maximize the content of information necessary to solve the problem. In this way, we will have neural networks with higher accuracy and with a little convergence time.

(44)

Figure 2.12: Weights of a parametric linear classifier trained on CIFAR10. We can see that the representation of a horse is a horse with two heads. No horse with two heads exists. The problem here is that in the training set there are horses with different orientation and a linear representation of them is the two-head horse.

Figure 2.13: This is a graphical rapresentation of AlexNet, the convolutional neural network which won the ImageNet competition in 2012.

2.4.5 Convolutional Neural Network

A CNN (Figure 2.13) is a neural network used to analyze structured data which has in its architecture convolutional layers. If A is a matrix of M · N dimension and H is a squared matrix k · k where k is an odd number, convolution between A and H is: CAH = A ⊗ H = k−1 X p=0 k−1 X q=0 A(i − p, j − q)H(p, q) (2.9)

H is the filter, commonly called kernel, that slides through the entire images in steps whose size can be chosen and it is called ”stride“. The result of the convolution between the input image and a kernel is called “activation map”. Furthermore, a CNN has two main differences compared to a multi-layer percep-tron: the “receptive field” and the pooling layers. The idea of a “receptive field” comes from some biological considerations. In 1968, Hubel and Wiesel [40] stud-ied the response of the striate cortex in monkeys. They found that any small part of the striate cortex can be activated or suppressed in response to specific visual stimuli. When we use images, because of high dimensionality, it is impractical to connect neurons to all neurons in the previous layer. As example, if we have a 500x500 pixels image, we have 250000 pixels. If we fully connect all these pixels to a hidden layer of a hundred neurons, we will have about 25 millions of con-nections. In deep networks, more than one hidden layer is usually used. So if we

(45)

Figure 2.14: Max pooling: the input layer with 224x224x64 is reduces to a tensor of dimension 112x112x64. This produces a downsampling of the image

connect neurons to all neurons, we will have an unmanageable number of param-eters. For this reason, we connect each neuron only to a local region of the input. The extent of this region is called receptive field and it is a hyperparameter that is equivalent to k in Equation 9. Spatial dimensions are treated in an asymmetric way: the connections are local in space (along width and height), but always full along the entire depth of the input volume. This means that if we have a three channel image (RGB), we can choose the kernel height and width but the depth of the filters will always be three, in the first layer. The use of receptive field makes the network invariant to input translation and then a feature on the image is recognized regardless of its position in the image. This lowers the number of trainable parameters. A pooling layer, instead, is a layer in which pixel values are aggregated through an invariant for permutation function such as a maximum. The output of these layers is a downsampled image. An example of a pooling with maximum is presented in Figure 2.14 and 2.15. So pooling not only allow to lower the number of parameters but it also produces more activation maps [39].

VGG and ResNet architectures

In this work, two well known architectures has been used and adapted to mam-mograms classification. The first is a VGG Network, a convolutional neural net-work which won the ImageNet in 2014. ImageNet is formally a project aimed at manually labeling and categorizing images into almost 22000 separate object categories for the purpose of computer vision research. However, in the context

(46)

Figure 2.15: A max pooling of size 2 is applied to a slice of a layer: it has been divided in 4 submatrices of dimension 2x2 and only the maximum values for each submatrices is passed to the subsequent neuron.

of deep learning and Convolutional Neural Networks, the term refears to the Im-ageNet Large Scale Visual Recognition Challenge (ILSVRC). The goal of this image classification challenge is to train a model that can correctly classify an input image into 1,000 separate object categories. Models are trained on about 1.2 million training images with other 50,000 images for validation and 100,000 images for testing. In particular [41], the advancement brought by VGG is the stacking of convolutional layers. In fact, while AlexNet [35] has an architecture mainly made of convolutional, pooling and normalization layers in stack, in VGG three convolutional layers with a little receptive field (3x3) are stacked in order to have a bigger total receptive field (7x7) with less parameters. The second main architecture used is called ResNet [11] which won the ImageNet challenge in 2015. This CNN is made of several blocks called residual blocks. A residual block is made of more convolutional layers in stack in which they are combined in order to let the CNN learn the residual between two of these blocks. Both VGG and ResNet of this work have been trained from scratch and evaluated with accuracy as metric on a test set, separated with respect to training and validation set.

(47)

Chapter 3 CNN development and data

analysis

In this chapter, the development of the algorithm, the optimization of the hyper-parameters and the results are reported. As preliminary test, I trained a neural network on a simpler classification problem: in order to understand the feasibility of a density classifier trained on mammograms, I trained a convolutional neural network to classify mammograms belonging to A and D BIRADS classes only. Second, I trained a convolutional neural network to discriminate between dense and non-dense breast: this algorithm should be able to distinguish between A and B BIRADS classes and C and D BIRADS classes. Finally, I trained a con-volutional neural network to classify mammograms in the four BIRADS classes with a one-view approach, in which four CNNs are trained on the different pro-jections and then the results are merged together. The software I used to train, fit and evaluate the CNNs is Keras, an API written in Python with Tensorflow in backend. The hardware used has been made available by “Istituto Nazionale di Fisica Nucleare” (INFN) and consists in:

• CPUs: 2x 10 cores Intel Xeon E5-2640v4 @2.40 GHz;

• RAM: 64 GB;

• GPUs: 4x nVidia Tesla K80, with 2x GPUs Tesla GK210, 24 GB RAM and 2496 CUDA cores each;

Exams used to train the convolutional neural networks have been provided by the “Azienda Ospedaliero-Universitaria Pisana” (AOUP) in DICOM image for-mat. Each exam contains four images that are the four standard mammographic projections, i.e. right and left cranio-caudal and medio-lateral oblique. In order to make these exams readable to Keras, I converted them in the Portable Net-work Graphics (PNG) format in 8 bits, maintaining the original size. Even if the exams have been saved in 12 bits, I had to convert them in 8 bits because Keras does not support 12 or 16 bits images. All the PNG images has been controlled

(48)

one by one and automatically divided according to density class and the mam-mographic projections. Mammograms that were too different from the majority have been excluded. In Figure 3.1, some samples used are reported. An example of an outlier is reported in Figure 3.2. The presence of such outliers is due to the mammographic system used. In fact, the mammogram reported in Figure 3.2 has been taken with the Selenia Dimension mammographic system, that is in use since only one year in the AOUP. The different resolution and appeareance of these exams made them not usable for the purpose of this master thesis. How-ever, in the near future, the number of exams taken with the Selenia Dimension is going to increase and we will have a sufficient statistics to perform CNN on these mammograms too. At the same time, new standardization techniques are going to be developed, such as the standard mammographic form [42]. The purpose of these techniques is to find a way to normalize images, according to some factors such as the kVp and mAs of the X-ray tube. In this master thesis, all the CNN performance has been evaluated with accuracy as metric, i.e. capability of the network to predict the right label to unseen images.

(49)

CHAPTER 3. CNN DEVELOPMENT AND DATA ANALYSIS

Figure 3.1: An example of mammograms used to train the convolutional neural network.

(50)

(51)

CHAPTER 3. CNN DEVELOPMENT AND DATA ANALYSIS

3.1 Two-class classifier: A and D class

As first density classification test, I trained a convolutional neural network in order to classify mammograms belonging to A and D class. This is a simpler problem respect to the four-class one. According to official BIRADS data, the distribution of the A and D classes over women should be identical. I divided the dataset, which consisted in 440 mammograms, in training set (150 mammo-grams), validation (20 mammograms) and test set (20 mammograms). Since I had to maintain the number of images equal for each class, I did not consider 60 mammograms belonging to class A. The images has been divided carefully to be sure that patients in validation and testing would not be used in training. In Figure 3.3, an example of a pair of mammograms used during this training is reported.

Figure 3.3: An example of the training dataset. In the left image a cranio-caudal mammogram of class A, in the right image a cranio-caudal mammogram of class D.

As explained in Chapter 2, each collected exam is made of four images re-lated to a single patient. The four images are the four standard mammographic projections, i.e. right and left craniocaudal projection (from top to bottom) and right and left medio-lateral oblique projection (side view). In this first test, only the right cranio-caudal projections have been used. In a second phase I studied how accuracy changes as the number of projections used increases. To train this classifier, I divided in several steps the training process in order to understand how the CNNs behavior changed among different hyperparameters. As first step, I preprocessed the images to do both data augmentation and data cleaning. To

(52)

3.1. TWO-CLASS CLASSIFIER: A AND D CLASS

achieve the preprocessing, I used ImageDataGenerator [39], a function provided by Keras that generates batches of tensor image data. With “samplewise center = True”, ImageDataGenerator computes the mean value across one whole sam-ple and subtracts it from each pixel. In this way, the images are normalized such that the mean value of each data sample would be equal to 0. This method is efficient to reduce training time and allows to reach better accuracy. Before training, I also performed a little data augmentation by rotating images at most of 20 degrees and allowing the horizontal flip. This little data augmentation helps in preventing overfitting. These images has been resized to desired dimension. I could not use the high resolution images because of the available RAM. In fact, the use of high resolution images and such deep networks exceeded the available memory. The training has been performed in batches of four images. This means that four images were forward fed into the network and then the gradient was computed. Training a network with batches helps in reducing training time to the detriment of the accuracy. I used mini-batches of four images because I had not a huge number of images to train this network but I wanted to accelerate the training. The main architecture of these CNNs is reported in Figure 3.4. The selected optimizer is the stochastic gradient descent, described in Chapter 2 by Equation 2.6. The loss function, through which the classification error was computed, is a categorical crossentropy, which measures the total classification error, based on the probability for each class being classified when a classification model is applied. As said above, accuracy was the selected metric to evaluate the performance.

The chosen CNN is made of four similar blocks. The main structure is made of three convolutional layers in stack with LeakyReLU (α = 0.3) as activation function and batch normalization as regularizer. In Figure 3.5, a representation of a REctified Linear Units (ReLU) and a LeakyReLU are reported. As shown, the main difference is in the negative X axis. In fact, a ReLU squashes to 0 every neurons that has a negative value, while the LeakyReLU assigns a little value different from zero to those neurons. This allows to keep in life more neurons and prevent the evanescent gradient.

Every convolutional layer has been initialized with a random-normal distri-bution of weights. Then, dropout has been applied with a rate equal to 0.5. At the end of each block, I put a maxpooling layer of 2x2 size. I tried two differ-ent solutions to solve this problem. First, in the first and second blocks, the convolution has been performed with 8 kernels while the others two blocks were identical with 16 kernels for each convolutional layer. After these four blocks, the activation maps have been flattened on a vector and then three fully connected layers have been included. The first two fully connected layers had 10 neurons while the last one only 2 as the number of classes. The activation function of the last layer was a softmax. Softmax squashes a K-dimensional input vector z of arbitrary real values into another vector σ(z) with a range in (0,1) performing:

σ(z)j =

ezj

PK ezk

Convolutional Neural Network (CNN) based classifier for breast density assessment

Universit`

a degli Studi di Pisa

CNN (Convolutional Neural Network) based

classifier for breast density assessment

Contents

Acronyms

Introduction

Chapter 1

Mammography and density

standard

1.1

X-ray spectra: anode and filters

1.2

Photon interactions with breast tissues and

X-ray energy range

1.3

Imaging Systems

1.4

RADIOMA project

1.5

Mammographic density standard

1.5.1

BIRADS guidelines for breast density assessment

Chapter 2

Materials and methods

2.1

Image selection

2.2

Dataset distribution

2.3

Neural Networks and Deep Learning

2.4

Neural networks: lexicon and first

defini-tions

2.4.1

Learning algorithm

2.4.2

Overfitting, underfitting and regularization

2.4.3

Gradient-based learning

2.4.4

The “horse problem”

2.4.5

Convolutional Neural Network

Chapter 3

CNN development and data

analysis

3.1

Two-class classifier: A and D class