Evaldas Padervinskis THE VALUE OF AUTOMATIC VOICE CATEGORIZATION SYSTEMS BASED ON ACOUSTIC VOICE PARAMETERS AND QUESTIONNAIRE DATA IN THE SCREENING OF VOICE DISORDERS

(1)

LITHUANIAN UNIVERSITY OF HEALTH SCIENCES MEDICAL ACADEMY

Evaldas Padervinskis

THE VALUE OF AUTOMATIC

VOICE CATEGORIZATION SYSTEMS BASED

ON ACOUSTIC VOICE PARAMETERS AND

QUESTIONNAIRE DATA IN THE

SCREENING OF VOICE DISORDERS

Doctoral Dissertation Biomedical Sciences,

Medicine (06B)

(2)

Dissertation has been prepared at the Lithuanian University of Health Sciences, Medical Academy, Department of Otorinolaryngology during the period of 2011–2015.

Scientific Supervisor

Prof. Dr. Habil. Virgilijus Ulozas (Lithuanian University of Health Sciences, Medical Academy, Biomedical Sciences, Medicine – 06B).

Dissertation is defended at the Medical Research Council of the Lithuanian University of Health Sciences, Medical Academy:

Chairman

Prof. Dr. Habil. Limas Kupcinskas (Lithuanian University of Health Sciences, Medical Academy, Biomedical Sciences, Medicine – 06B).

Members:

Prof. Dr. Habil. Daiva Rastenyte (Lithuanian University of Health Sciences, Medical Academy, Biomedical Sciences, Medicine – 06B); Prof. Dr. Dalia Zaliuniene (Lithuanian University of Health Sciences, Medical Academy, Biomedical Sciences, Medicine – 06B);

Prof. Dr. Vaidotas Marozas (Kaunas University of Technology, Technological Sciences, Electrical and Electronics Engineering – 01T); Prof. Dr. Habil. Kazimierz Niemczyk (Medical University of Warsaw, Biomedical Sciences, Medicine – 06B).

Dissertation will be defended at the open session of the Medical Research Council of Lithuanian University of Health Sciences on June 16th, 2016 at 2 p.m. in 204 auditorium at Faculty of Pharmacy of Lithuanian University of Health Sciences.

Address: Sukileliu 13, LT-50009 Kaunas, Lithuania.

(3)

LIETUVOS SVEIKATOS MOKSLŲ UNIVERSITETAS MEDICINOS AKADEMIJA

Evaldas Padervinskis

AUTOMATINĖS BALSO KATEGORIZAVIMO

SISTEMOS, PAREMTOS AKUSTINIŲ BALSO

PARAMETRŲ BEI PACIENTŲ KLAUSIMYNŲ

DUOMENŲ ANALIZE, VERTĖ PIRMINEI

BALSO SUTRIKIMŲ ATRANKAI

Daktaro disertacija Biomedicinos mokslai,

medicina (06B)

(4)

Disertacija rengta 2011–2015 metais Lietuvos sveikatos mokslų universiteto Medicinos akademijos Ausų, nosies ir gerklės ligų klinikoje.

Mokslinis vadovas

prof. habil. dr Virgilijus Ulozas (Lietuvos sveikatos mokslų universitetas, Medicinos akademija, biomedicinos mokslai, medicina – 06B).

Disertacija ginama Lietuvos sveikatos mokslų universiteto Medicinos akademijos medicinos mokslo krypties taryboje:

Pirmininkas

prof. habil. dr. Limas Kupčinskas (Lietuvos sveikatos mokslų uni-versitetas, biomedicinos mokslai, medicina – 06B).

Nariai:

prof. habil. dr. Daiva Rastenytė (Lietuvos sveikatos mokslų universitetas, biomedicinos mokslai, medicina – 06B);

prof. dr. Dalia Žaliūnienė (Lietuvos sveikatos mokslų universitetas, biomedicinos mokslai, medicina – 06B);

prof. dr. Vaidotas Marozas (Kauno technologijos universitetas, tech-nologijos mokslai, elektros ir elektronikos inžinerija – 01T);

prof. habil. dr. Kazimierz Niemczyk (Varšuvos medicinos universitetas, biomedicinos mokslai, medicina – 06B).

Disertacija ginama viešame Medicinos mokslo krypties tarybos posėdyje 2016 m. birželio 16 d. 14 val. Lietuvos sveikatos mokslų universiteto Farmacijos fakulteto 204 auditorijoje.

(5)

Šią knygą skiriu savo Tėvams

Ilonai ir Edmundui Padervinskiams

Ačiū

„Gyvenimas skirstomas į tris laiko vienetus:

kas buvo, kas yra ir kas bus.

Tai, ką dabar veikiame, yra trumpa, ką veiksime –

netikra, ką nuveikėme – užtikrinta.“

Seneka

(6)

ABREVIATIONS

RF – Random forest

SVM – Support vector machine GERD – Gastro oesophageal disease F0 – Fundamental frequency SNR – Signal to noise ratio HNR – Harmonic to noise ratio NNE – Normalized noise energy CCR – Correct classification rate EER – Equal error rate

GFI – Glottal functional index

VQOL – Voice-disordered quality of life VHI – Voice handicap index

GFI-LT – Glottal functional index Lithuanian version LVQ – Learning vector quantization

GMM – Gaussian mixture model HMM – Hidden Markov model LDA – Linear discriminant analysis

k-NN – k-nearest neighbours MLP – Multi-layer perceptron SVM – Support vector machine CC – Cepstral coefficient BEC – Band energy coefficient

RASTA-PLP – Relative spectral transform perceptual linear prediction LPC – Linear predictive coding

GNE – Glottal-to-noise excitation VLS – Video laryngostroboscopy SD – Standard deviation

CART – Classification and regression tree

t-SNE – t-distributed stochastic neighbour embedding algorithm DET – Detection error trade-off curve

ROC – Receiver operating characteristic curve AUC – Area under the curve

OOB – Out-of-bag data classification accuracy (%) VHI – Voice handicap index

(8)

INTRODUCTION

Over the past 200 000 years humans have used lungs, larynx, tongue, and lips in order to produce and modify the highly intricate arrays of voice for realizing verbal communication and emotional expression [1]. Vocal folds have evolved to be a key organ in the creation of human voice. The vibrations of the vocal folds serve as an origin of the primary voice signal. The process of voice production is called phonation, and it is the preli-minary stage for speech production [2]. So what is a normal healthy voice? In 1956, Jahnson et al. suggested a description that a healthy voice is a voice of nice quality and colour, and that it shows the speakers age and sex; as well, it is a voice that has normal loudness and adequate possibilities to change voice loudness and tone [3].

Voice is the main and the easiest instrument for communication between people, and it is one of the most challenging means of information transmission from a person to a computer. When we speak, we give some certain straight and verbal information to other people; however, we also supply some certain indirect and bodily information about ourselves. Such information as the psychological and emotional status, personality, identity and aesthetic orientation is also conveyed [4]. Every day the voice is influenced by internal and external factors. External factors, such as dust, dry or very humid air, temperature, noisy background, incorrect bodily position when speaking, etc. affect our voice [5]. There are internal factors, as well, such as viral infections, GERD, larynx pathologies, neurological diseases, and hormone fluctuations [6–8].

If we made a search throughout the scientific literature according to the Pub Med database in 1960, there would only be 60 articles published about the subject of voice; however, if we start a search nowadays, then, starting with 2007, we would find more than 1000 articles published about the subject of voice-processing. In 2015 we would have found more than 1700 articles; consequently, a conclusion could be made that every year there is a growing number of experts, doctors, and engineers, who investigate the subject of voice-processing. Nowadays modern medicine is focused on the screening programs. Screening programs are chosen due to their cost-effectiveness, and screening has been proved to reduce incidence rates, and both disease-specific and overall mortality rates. It is also recommended by all relevant major organizational guidelines [9]. As a consequence, for all types of screening programs there is a needed for a simple, reliable, user-friendly diagnostic tool, that could be used by every family doctor or even a person himself or herself for voice disease diagnosis or even for a follow-up

(9)

after the treatment. Thus, present day medical engineers and doctors are targeting for the above-mentioned simple, reliable, and user-friendly diagnostic tool. As technological possibilities develop forward, it could also be mentioned that what seemed impossible 10 years ago, is already possible nowadays.

What is more, when talking to another person, a speaker not only con-veys the verbal information, but a lot of non-verbal information, as well. This is the main reason, why it is so complicated to input the information from a human to a machine and to process it.

(10)

1. THE AIM AND OBJECTIVES OF THE STUDY

The aim of the study

To develop an automatic voice categorization system based on the analysis of the acoustic voice parameters and the data obtained from the patient questionnaires and evaluate the system’s effectiveness for the voice screening purposes.

The objectives of the study

1. To evaluate the reliability of the measurements of acoustic voice parameters obtained using the oral and contact (throat) microphones simultaneously, and to investigate the utility of the combined use of these microphones for voice categorization.

2. To evaluate the possibilities of Random Forest classifier for the cate-gorization of voice into normal and pathological classes using a different number of acoustic voice parameters sets.

3. To evaluate the reliability of the measurements of acoustic voice para-meters obtained simultaneously using oral and smart phone micro-phones, and to investigate the utility of the application of smart phone microphone signal data for voice categorization for the voice screening purposes.

4. To evaluate the value of fusion of information obtained from special voice questionnaires and analysis of acoustic voice parameters for automated voice categorization into normal and pathological voice classes.

5. To develop and to test the software intended for voice categorization into normal and pathological voice classes based on automatic analysis of acoustic voice parameters and voice related query data.

(11)

2. ORGINALITY OF THE STUDY

Reliable, automatic and objective detector of pathological voice dis-orders from speech signals is a tool, sought by many voice clinicians, engineers, as well as general practitioners. Nowadays, multi-parameter monitoring and diagnosis, viewed as a way to improve healthcare quality, to reduce the feedback time, and to enable home-care at one’s own respon-sibility, becomes possible consequently to the progress of the aspects of the diverse computer application. A detection instrument that could be used for low-cost and no-invasive mass screening, diagnosis, and early detection of voice pathology, would contribute to the survival and reduce mortality from laryngeal cancer of professionals using voice as a main occupational tool, of all individuals working in risky environments, such as chemical factories, and of everyone, who smokes, as well as of the general population [10]. If the diseases are detected earlier, the treatment could be started earlier, too; consequently, the price of treatment would decrease, if compared to the treatment of the highly-affected stages of the disease. Voice contains a lot of data, the extraction of which is still too difficult a task to perform. However, the data, accumulated during the multi-parameter monitoring, and supported by the historical records, as well as the specific context the patient is acting in may be exploited for the early detection of possible diseases [11].

Following the years of research and significant progress in voice patho-logy detection and classification, correct detection and classification rates of various pathology stages are still insufficient for a reliable and trusted large-scale screening. The research performed in this field generally splits into two stages: first, extraction of meaningful feature sets, and second, using these features for the classification of speech recordings into cases of healthy conditions and different pathologies [11]. Currently, there is an increasing demand for robust voice quality measures. However, a compre-hensive systemic and routine measurement of acoustic voice parameters for diagnostic and/or voice screening purposes, or the later-stage treatment is only possible in hospitals with voice laboratory facilities.

One of the possible solutions providing for the automated acoustic analysis-based voice screening could be the use of telemedicine and/or a telephony-based voice pathology assessment with various automated ana-lysis algorithms. Several earlier studies showed that the performance level of speech and speaker recognition systems declined when processing tele-phone-quality signals, if compared to the systems that utilized high-quality recordings.

(12)

In scientific literature only sporadic articles could be noticed to analyse the possibilities to apply smartphones as a voice analysis instrument. However, currently a smartphone is one the most rapidly developing technologies. The number of people using a smartphone is rapidly increase-ing; consequently, it would be of much benefit to use this device in order to analyse human voice and to refer the patient to the specialist, when the diseases are in early stages.

Our current experimental study was targeting at a more substantial investigation in the subject of the different types of microphones, such as contact-throat, oral, and smartphone microphones.

As well, it was intended to test oral and contact-throat microphones in hospitals with voice laboratory facilities with the simultaneously recorded voices, and to establish the differences.

We compared acoustic voice data from oral and smartphone micro-phones, as recorded, as well as simultaneously, for the first time.

Moreover, in the current study we aimed at fusing the data gathered from the questionnaires and from the acoustic analysis in order to highlight how this improved classification rate works with usual statistical methods for two classes of healthy and pathological cases. To the best of our knowledge, we were the first to dwell upon this subject.

In addition, a Random Forest classifier was tried out for the data gathered from the questionnaires and acoustic voice analysis in order to see how this improve classification rate, aimed at two classes (healthy and pathological), compared with the usual statistical methods.

Besides, an objective was raised to generate software targeted at inte-grating the data gathered from the questionnaire and the acoustic voice analysis, and to create a user-friendly tool for laryngeal pathology detection with high accuracy using the non-invasive data.

(13)

3. SCIENTIFIC LITERATURE REVIEW

3.1. Voice and computer

In our age of technology a question is being addressed from time to time to a computer scientist, as well as to a speech specialist, about the possibilities to operate a computer only by voice command. Currently there are a lot of specialists, even not in the field of medicine, that make use of the voice recognition programs; while in medicine there would be specialists from the fields of radiology, pathology, as well as general practitioners, eager to apply the programs of data input with the help of microphone, in order to save the precious office time. Voice recognition programs that are currently available on the market are created to input all the information that is being transmitted through the microphone. Possibilities to install such programs would lead to the economy of person’s or company’s time or money [12, 13]; these being the factors, related to the communication process between the human and the interactive system, with the objective of developing or improving safety, utility, effectiveness, and usability of the interactive computer-based products. By consequence, not only the progress of the science of computers (as the subject of software engineers), but of the human physical and mental characteristics(also known as human factors) is of great importance, as well, not to mention the context, where the interaction is carried out [14, 15].

Nevertheless, the early steps of voice recognition programs were confronted with a lot of scepticism. In 1969, the influential John Pierce wrote an article, questioning the prospects of the technological possibilities in the field of speech recognition, and criticised “the mad inventors and unreliable engineers” that were working in the field. In his article entitled “Whither speech recognition”, Pierce argued that speech recognition was futile, because the task of speech understanding is too difficult for any machine. It must be noted that Pierce wrote at the time when such tech-nological possibilities did not exist. What Pierce’s article failed to foretell was that even a limited success in the field of speech recognition – simple small-vocabulary speech recognizers – would have suggested important and gratifying applications, especially within the telecommunications industry [16]. The voice-processing market was projected to be over $1.5 billion by 1994 and was growing at about 30 % a year [17].

All speech recognition programs work pretty well, if the recorded voice is healthy, if it sounds customarily, if the background is not noisy, and if the pronunciation in a given language is faultless. Otherwise, the input of the

(14)

data might be burdened with certain problems. According to Alcantud et al [18], the problems that are not related to the larynx, and impede to acquire a satisfactory interaction of the human and the computer, are as follows: personal phonation differences, acoustic ambiguity, variable utterance of speech sounds, phonetic variation, coarticulation, time variation, and background noise.

3.2. Acoustic voice analysis

In our knowledge-based societies communication skills have become more and more important in everyday duties. Voice disorders became a socio-economic factor: in 2000 one study estimated losses within the Gross National Product of the USA being up to $186 billion annually, on the basis that approximately 6–9 % of the entire population suffer from communi-cation or voice disorders [19–21]. In 2015 Mehta et al published a study, where they showed that voice disorders have been estimated to affect approximately 30 % of the adult population in the United States at some point in their lives, with 6.6–7.6 % of individuals affected at any given point in time [22].

In 1960 acoustic voice analysis was identified to be essential and increasingly used both for research and objective assessment of voice disorders in clinical settings. In 2001 Dejonckere et al provided a protocol of the European Laryngological Society (i. e. ELS); the protocol contained five multidimensional aspects: visual analysis, perceptual evaluation, acous-tic analysis, aerodynamic measures and self-evaluation by the patient [23, 24]. Consequently, acoustic measures of the severity of the dysphonia have already been commonly used in various voice clinics, due to its beneficial aspect for the algorithms of automated voice analysis and screening, collection of objective non-invasive voice data, and feasibility to document and quantify dysphonia changes and outcomes of the therapeutic and surgical treatment of voice problems [25]. Voice signals traditionally have been analysed by time, amplitude, frequency and quefrency domain [26].

According to Titze, acoustic voice signals could be classified into three types:

Type 1 signals are nearly periodic,

Type 2 signals contain intermittency, strong subharmonics, or modu-lations,

Type 3 signals are chaotic or random.

Therefore, different methods of voice analysis should be applied depending on the voice signal type. For type 1 signals, perturbation analysis

(15)

has considerable utility and reliability. For type 2 signals, visual displays (e.g., spectrograms, phase portraits, or next-cycle parameter contours) are most useful for understanding the physical characteristics of the oscillating system. For type 3 signals, perceptual ratings of roughness (and any other auditory manifestation of aperiodicity) are likely to be the best measures for clinical assessment [27].

The most frequently used acoustic measurement parameters for acoustic voice analysis in scientific literature is jitter, shimmer, fundamental fre-quency (F0), harmonic-to-noise-ratio (HNR), signal-to-noise-ratio (SNR), and normalized noise energy (NNE). Perturbation, the cycle-to-cycle varia-tion present in a waveform, is commonly analysed for an acoustic signal, using the parameters of jitter and shimmer.

Jitter measures the cycle-to-cycle frequency variation of a signal. Shimmer measures the cycle-to-cycle amplitude variation. Perturbation parameters of percent jitter and percent shimmer were calculated for the voice sample segments [28]. F0 quantifies vocal fold vibratory frequency. SNR reflects the dominance of the harmonic signal over noise (measured in dB) [29]. HNR is a measurement of voice pureness; it is based on the ratio calculation of the energy of the harmonics, related to the noise energy, present in the voice (measured in dB) [30]. NNE is automatically computed from the voice signals using an adaptive comb filtering method performed in the frequency domain[31].

The sustained vowel (/a/, /e/, or /i/) is a classical and widely used ma-terial for acoustic analysis. Currently, acoustic analysis is performed by selecting a particular segment from each voice signal and analysing the selected segment using defined acoustic algorithms. Titze suggested that only periodic or nearly periodic voice signals should be analysed using acoustic measures [27]. There are a lot of studies that are published in the English scientific literature about using two or three acoustic voice para-meters that have already been mentioned in this thesis [32–36]. Some of the authors try to use continuous speech analysis in clinical practice because some voice disorders, such as adductor spasmodic dysphonia, can be cha-racterized by relatively normal voice during sustained vowel productions, whereas voice produced in connected speech is often more severely com-promised [35]. Such authors like Krom [37] and Revis et al [38] reported that there are no significant differences between the ratings of a sustained vowel and a running speech. In another study Wolfe et al [39] found a significant difference between the ratings of both sample types and latter finding was supported in part by Zraick et al [40], who reported a statis-tically significant difference between the judgments of sustained vowels and recordings of a picture description. In our current study it was decided to

(16)

analyse the sustained vowel samples with 6 acoustic voice parameters that are mostly described in scientific literature. For this preference, factors, as follows, have been contributed by Maryn et al:

First, a sustained vowel represents relatively time-invariant phonation, whereas continuous speech involves rapid and frequent changes caused by glottal and supra glottal mechanisms.

Second, in contrast to continuous speech, sustained mid-vowel seg-ments do not contain non voiced phonemes, fast voice onsets and ter-minations, and prosodic fundamental frequency and amplitude fluctuations.

Third, sustained vowels are not affected by speech rate, vocal pauses, phonetic context, and stress.

Fourth, classic fundamental frequency or period perturbation and amplitude perturbation measures strongly rely on pitch detection and extraction algorithms. As a consequence, they lose precision in continuous speech analyses, in which perturbation is significantly affected by intona-tional patterns, voice onsets and offsets, and unvoiced fragments.

Fifth, sustained vowels can be elicited and produced with less effort and in a more standardized manner than that of continuous speech.

Sixth, there is no linguistic loading in a sustained vowel, resulting in relative immunity from influences related to dialect and region, language, cognition, and so on [41].

The replacement of analogue recording systems with digital recording systems, the availability of automated analysis algorithms, and the non-invasiveness of acoustic measures, combined with the fact that acoustic parameters provide easy quantification of dysphonia improvement during the treatment process, have led to considerable interest in clinical voice quality measurement using acoustic analysis techniques [42]. Automatic systems for the detection of illness related to abnormalities of the vocal signal have been developed and are mainly based on signal processing or on machine learning and data mining techniques. Several experiences of using algorithmic approaches for the automatic analysis of signals exist. Software tools (commercial and freely available) allow manipulating voice compo-nents in an efficient way (e.g. Multi-Dimensional Voice Program (MDVP), WinPitch, Praat, VOICEBOX,), and permit specialists to manipulate and analyse voice signals [43]. A study by Mendes et al [44] described the automatic voice analysis programs in the Table 3.2.1 that are currently on the market.

(17)

Table 3.2.1. Voice analysis software

Freely available software Commercial software

• Audacity 2.0.0 • EMU Speech Database System 2.3.0

• WaveSurfer 1.8.5 • Praat 5.3.04

• Speech Filing System (SFS) 4.8 • SFS|WASP 1.51 SIL International • Speech Analyser 3.0.1 Dr. Speech, version 4 • Vocal Assessment • Real Analysis • Speech Trainig • ScopeView • Phonetogram • Speech Therapy 4 FonoTools KayPENTAX • Multi-Speech, Model 3700

• Voice Range Profi le (VPR), Model 4326 • Multi-Dimentional Voice Program (MDVP), Model 5105

• Motor Speech Profile (MSP), Model 5141

LingWAVES Voice Clinic Suite Pro Seegnal

• MasterPitch Pro • VoiceStudio • SingingStudio

Estill Voice International

• VoicePrint

• Estill Voiceprint Plus

Time Frequency

Analysis Software - TF32 Video Voice Speech Training System 3.0 VoxMetria

• Vocalgrama

Since 1998, the Department of Otorhinolaryngology of the Hospital of Lithuanian University of Health Sciences Kauno Klinikos has used Tiger Electronics Dr. Speech (Voice assessment 3.0) software. Tiger DRS soft-ware is one of the most frequently used acoustic voice analysis programs that is comparable to Multidimensional Voice Program (MDVP, Kay Pentax, NJ, USA) that is considered to be the golden standard [45, 46]. This software is as well comparable to the free open source voice analysis program Praat. Contrary to the programs mentioned before, this program can be used with Windows and Macintosh, the free Linux operating system and with other systems, such as FreeBSD, SGI, Solaris and HPUX. This makes it easy to be installed in any equipment without the need to have a specific operating system available. However, if compared with the programs of voice analysis, it would reveal weak or moderate correlations in frequency perturbation, and moderate or strong correlations in amplitude

(18)

perturbation [47]. The reason for the above-mentioned difference involves the use of distinct algorithms in order to extract voice data from voice samples.

Thus, first of all, anyone, wishing to choose a program, must be aware of the inter-program reliability from analysis algorithms and methods of the same parameters to be different in every software package; consequently, an impediment arises to establish a common threshold for the acoustic voice parameters. Secondly, one sampling rate should be 44.1 kHz, and the format should be left uncompressed, and typically in wav-file format [24]. Another requirement would constitute the use of the objective-acoustic analysis in research or clinical practice, and the need to attain a high level of accuracy, as well as the reliability of hardware, which was very well analysed by Svec [48].

3.3. Microphone and acoustic voice analysis

There is a need in this part of the thesis, concerned with the scientific literature review, to write about the factors that influence the accuracy and comparability of the measurements of acoustic voice parameters, which may arise from variations in data acquisition environment [49], microphone types or placements [48–50], recording systems, and methods of voice signal analysis [46, 51–54].

During voice and speech production, vibrations from the vocal folds are transmitted through the vocal tract and through the body tissue to the skin surface. These skin surface vibrations can be sensed by contact microphones and/or accelerometers (i.e., vibration sensors that convert mechanical energy into electrical energy in response to the stress applied to it and using piezoelectric effect), as opposed to the microphones, recording in the air, and the output signal, mirroring the sound signal, generated by the vocal fold vibrations, can be used to transmit voice signals into analysis systems [55, 56], even revealing representation of the rapid subglottal pressure vibrations [57]. As opposed to conventional acoustic microphones routinely used for voice recordings, contact microphones are less sensitive to back-ground noise from the surrounding environment. Moreover, contact micro-phones and/or accelerometers have the potential to eliminate acoustic effects of the vocal tract, thus providing enhanced voice signal clarity in elevated ambient noise environments [56, 58].

Microphones are the basic tools for registration of voice signals aiming to convert the sound pressure signal to an electric signal with the same characteristics. Consequently, the type and technical characteristics of the microphone may determine the final results of acoustic voice analysis.

(19)

Despite the fact that voice and speech recordings and measurements are carried out routinely for clinical and research purposes, the subject of microphone selection reflects some controversies[48–50, 59] . Microphones according to Dejonckere et al [23] have to comply with different conditions to enable acceptable voice recordings:

1. Condenser type. Cardioid characteristics were recommended, because these features allow focusing more directly to the voice signal [48,50] 2. Frequency range from 20–20000 Hz to cover all the spectrum of human

voice [48].

3. Frequency response curve of intensity should be flat with a maximum variations of 2 dB by 20–8000 Hz, preferably 20000 Hz [48].

4. Voice signal should be protected as much as possible from equivalent noise level that is generated of every component of the microphone. Voice signal must be loud enough to cover the intrinsic noise with a minimum difference of 15 dB[48].

5. Maximum speech pressure level for 3 % Total harmonic distortion of 126 dB [27].

6. High sensitivity. Used in order to obviate higher gain level to avoid higher noise level. Condenser type microphones with lower than 60 dB sensitivity level are not recommended to be used in clinical voice inves-tigations [24, 27].

Although voice recordings have been carried out for many years in clinical practice, the debate on microphone selection is still going on. Validity and reliability of acoustic measurements are highly affected by a background noise [60]. Due to its vicinity to the voice source, a contact microphone is less sensitive to background noises and provides enhanced voice signal clarity in noisy environments [56, 61–63]. It is suggested that an acoustic environment should have a signal-to-noise ratio of at least 30 dB to produce valid results in audio analysis [60]. This recommendation can be fulfilled easily when voice recordings are performed in a special sound-proof booth. However, this requirement can become not feasible when voice recordings are obtained in an ordinary environment for voice disorders screening task.

Nevertheless, several studies with contact microphones revealed de-creased speech signal intelligibility compared to headset microphones [56, 63, 64]. Moreover, contact microphones are not very effective in transmiting consonant sounds and high frequencies [65]. The elasticity properties of underlying human body tissues acting as a low-pass filter with a 3 kHz cut-off frequency [66], limit the frequency range of the resulting signal.

(20)

It was demonstrated that in case of non-stationary background noise, use of contact microphones can significantly improve accuracy of separation between voice recordings obtained from healthy subjects and subjects experiencing voice-related problems [67–69]. By using recordings from both types of microphones, Dupont et al [66] achieved 80 % recognition accuracy when discriminating between pathological and normal cases. Mubeen et al [70] achieved some increase in performance when combining features of one type (weighted linear predictive cepstral coefficients) ex-tracted from both types of recordings. Erzin [71] proposed a new frame-work, which learns joint sub-phone patterns of contact and acoustic micro-phone recordings using a parallel branch HMM structure. Application of this technique resulted in significant improvement of throat-only speech recognition. In other studies, accelerometers have been used and found to be useful for voice and speech measurements, that is, for detecting glottal vibrations, extraction of voice fundamental frequency (F0) and frequency perturbation measurements [58], evaluation of acoustic voice characteristics before and after intubation [72], voice accumulation/ dosimetry [61,73], estimation of sound pressure levels of voiced speech [61], mapping of neck surface vibrations during vocalized speech [74], and measurement of facial bone vibration in resonant voice production [75, 76].

There is a lack of data in scientific literature concerning the comparative studies on applicability of contact microphones for acoustic voice measu-rements for voice screening purposes and/or for using combined use of standard and contact (throat) microphones.

Therefore, one of the objectives of this research was to validate the suitability of the throat microphone signal for the task of voice screening purposes, to evaluate reliability of acoustic voice parameters obtained simultaneously using oral and contact (throat) microphones, and to invest-tigate the utility of combined use of these microphones for voice cate-gorization.

(21)

3.4. Questionnaires and voice analysis

Questionnaire data, providing essential statements related to various aspects of subject’s health, are easily obtained and also constitute an important, however, under-exploited source of information obtained non-invasively. In 1997, Jacobson et al [77] for the first time used a question-naire composed of 30 questions i.e. Voice handicap index (VHI). It was the first questionnaire that was created to investigate how voice diseases affects in different aspects such as physical, emotional, functional disability. In 2005 Franic et al [78] published a study, where he compared the psycho-metric properties of voice disordered quality of life (VQOL) instruments. Nine VQOL instruments were identified through a comprehensive literature search.These authors selected the instruments that were evaluated basing on 11 measurement standards related to the item information, versatility, practicality, breadth and depth of health measure, reliability, validity, and responsiveness. In comparison with the other 8 questionnaires, VHI ques-tionnaire showed much better results. VHI quesques-tionnaire was validated in 12 different languages [79]. One problem which may arise with the use of the VHI is due to its length. In routine diagnostics, voice patients may need to undergo several further measurements. Therefore, the 30 items of the VHI might require too much time (about 10–15 min) [80]. After VHI become widely used in clinical practice, there was as short version of the ques-tionnaire created, titled VHI-12 and VHI-10 [81] and VHI-9 [80]. The shor-ter versions of VHI have been adopted in German and French languages [80].

In 2005 Bach et al [82] created a simple short self-administered symp-tom index of 4 items with an excellent criterion-based and construct validity i.e. glottal function index (GFI) questionnaire. The GFI questionnaire has been used in the Center for Voice Disorders of Wake Forest University (Winston-Salem, NC) and was initially conceived as an instrument for evaluating glottal insufficiency and its response to therapy. The correlation coefficient between total GFI and total VHI scores was 0.61 (P<0.001), and a strong correlation was identified between those questionnaires. GFI questionnaire was translated and validated in the Lithuanian language in 2011 by Pribuišienė et al [83]. Based on the normative data, Bach et al [82] considered the GFI score higher than 4.0 (mean + 2 SDs) to be abnormal. In the GFI-LT study, a score higher than 3.0 was found to be a limiting value, distinguishing patients and healthy controls with the sensitivity of 88 % and specificity of 84 %, respectively. The same score was found when using ROC curves, and was revealed by Cohen et al [84] during the validation of the GFI for children. The GFI questionnaire was used successfully for the

(22)

monitoring of the results of surgery, ant it was found to present statistically significant differences in pre- and postoperative groups [85].

Responses to specific questions may contain information, which is not present in the acoustic or visual modalities. Analysis of query data can be used for preventive healthcare of larynx, yet very few attempts have been made to use it in screening [86]. To obtain the most important statements in the questionnaires, certain authors used a genetic search of different classi-fiers, and used them in a SVM in order to categorize the questionnaire data into the healthy class and two classes of pathologies: nodular and diffuse [87].

3.5. Classifiers and voice analysis

Usually, classifiers are used for acoustic voice analysis to get the best symptom combination that is needed to achieve the best classification rate. In scientific literature the following classifiers are used: learning vector quantization (LVQ) [88], Gaussian mixture model (GMM) [89], hidden Markov model (HMM) [90], linear discriminant analysis (LDA) [91]. Also, the following discriminative methods are being used: decision tree, Random forest (RF) [92], k-nearest neighbours (k-NN) [93], multi-layer perceptron (MLP) [94], and support vector machine (SVM) [95]. Ensemble methods, which combine separate classifiers into a multiple classifier system, are also sometimes used [86]. In 2012, a study by Arjmandi was published, where these authors compared the classifiers; it was determined that the SVM is the strongest classifier among the different classifiers that are investigated for voice quality assessment [96].Nevertheless, various authors agree that SVM classifier has some disadvantages, like cases when it is impossible to teach SVM, just to mention a few; as well, there is a certain lack of clarity, when SVM gives the inexplicable values.

RF is a popular and efficient algorithm for the classification and regression, based on the ensemble methods. RF advantages were validated and consolidated by the inventors [92, 97]:

it is applicable, when there are more predictors than observations, it performs embedded data selection, and is relatively insensitive to a large number of irrelevant data,

it incorporates interactions between predictors,

it is based on the theory of ensemble learning that allows the algorithm to learn accurately both simple and complex classification functions,

it is applicable for both binary and multi-category classification tasks, and, according to its inventors, does not require much fine-tuning of parameters; default parameterization often leads to excellent performance.

(23)

There is a lack of literature, concerning the comparison of those two classifiers. Statnikov et al. identified that Random forests are outperformed by support vector machines both in the settings when no gene selection is performed, and when several popular gene selection methods are used; however, in 2012 Englund et al [98] determined that for some specific tasks RF has the advantages in comparison with the SVM. At the moment those two classifiers are very similar and both have shown high classification results.

Some previous attempts to recognize the pathology in the larynx, using voice signal features, are summarized in Table 3.5.1.

Non-invasive measurement-based tools, enabling preventive screening for laryngeal disorders, is the combined use (fusion) of different information sources, such as voice analysis and questionnaire data, which is one of the main objectives of this thesis. One of the aims of this thesis was voice pathology detection from the combined use of non-invasive laryngeal data, specifically, voice recordings and responses to questionnaire, and, by com-bining those results, to create a user-friendly tool of high accuracy for the laryngeal pathology detection, using the non-invasive data. Currently, the tool is oriented to experts, working at the departments of otolaryngology, but in the nearest future the tool should run on a smart phone, including voice recording, and become much more versatile. Both modalities can be easily collected using off-the-shelf solutions. Due to the missing data in query modalities, imputation before decision-level fusion is compared to the complete-case analysis: this part investigates if any gain can be achieved by imputing RF decisions instead of discarding instances with missing moda-lities in fusion. Query modality is additionally explored by extracting rules using affinity analysis [86].

(24)

Table 3.5.1. History of non-invasive screening for voice pathology

No. Database (recordings) Features used Classifier (accuracy, %) Reference

1. KobriElkobba (15 norm, 20 path) 2. UCLA-RABTA (50 norm, 50 path) 3. MEEI (53 norm, 44 Edem, 4. MEEI (53 norm, 82 path) 5. MEEI (53 norm, 82 path) 6. MEEI (53 norm, 163 path) 7. MEEI (53 norm, 173 path) 8. MEEI (53 norm, 173 path) 9. MEEI (53 norm, 173 path) 10. MEEI (53 norm, 173 path) 11. MEEI (53 norm, 173 path) 12. MEEI (53 norm, 173 path) 13. MEEI (53 norm, 173 path) 14. MEEI (53 norm, 173 path) 15. MEEI (53 norm, 173 path) 16. MEEI (53 norm, 175 path) 17. MEEI (53 norm, 175 path) 18. MEEI (53 norm, 224 path) 19. MEEI (53 norm, 638 path) 20. MEEI (53 norm, 657 path) 21. MEEI (53 norm, 657 path)

22. Doctor Negrin (85 norm, 57 path: 3 GRBAS levels)

23. Doctor Negrin (100 norm, 68 path) 24. PdA (100 norm, 100 path) 25. SVD (650 norm, 1320 path) 26. LUHS (75 norm, 237 path)

27. LUHS (75 norm, 75 diffuse, 162 nodular) 28. LUHS (103 norm, 671 path)

29. LUHS (103 norm, 212 diffuse, 459 nodular) 30. LUHS (139 norm, 112 path)

RASTA-PLPCCs + ∆ Wavelet transform MFCCs MFCCs + ∆ + ∆∆ MFCCs + ∆ HNRs at 4 freq. bands MFCCs, noise MFCCs + ∆ MFCCs (MFCCs, HNR, NNE, GNE) + ∆ MFCCs Chaos (TISEAN) Modulation spectrum MFCCs MFCCs, HNR, NNE, GNE Noise (LPC-derived) Perturbation, spectral, noise Cochlear filter-bank Perturbation, noise MFCCs, MDVP MFBECs Chaos (TISEAN)

Jitter, shimmer, spectral, noise, chaos Modulation spectrum

MFCCs, HNR, NNE, GNE Various audio features Various audio features MFCCs

MFCCs

Various audio features

HMM (sens. 87.5, spec. 100) MLP (sens. 90, spec. 100) MLP (norm 99, Edem 96, other 93) GMM (94) MLP (94), LVQ (96) k-NN (94.28) SVM (95) GMM (94) MLP (89.6) SVM (93.01), GMM (94.35) MLP (88.3) MLP (99.69) SVM (94.1) GMM-SVM (96.1) GMM (94.8) LDA (96.5) LDA (98.7) k-NN (89.19) k-NN (96.1) HMM (98.3) k-NN (99.59%), LDA (98.48) MLP (82.47) MLP (92.76) SVM (81.2) GMM (79.4) 5 SVMs (95.13) 4 SVMs (84.65) GMM-SVM (89) GMM-SVM (70) 50 RFs (86.86) Saudi et al. (2012) [99] Salhi et al. (2008) [100] Marinus et al. (2009) [101] Godino-Llorente et al. (2001) [102] Godino-Llorente and Vilda (2004) [103] Shama et al. (2007) [104] Godino-Llorente et al. (2005) [95] Godino-Llorente et al. (2006) [105] Sáenz-Lechón et al. (2006) [106] Sáenz-Lechón et al. (2008) [107] Fraile et al. (2009) [108] Henriquez et al. (2009)[109]

Markaki and Stylianou (2009, 2011) [110, 111] Wang et al. (2010) [112]

Martínez et al. (2012) [113] Parsa and Jamieson (2000) [114] Parsa and Jamieson (2001) [115] Shama et al. (2004) [116] Hadjitodorov and Mitev (2002) [45] Dibazar et al. (2002)[117] Hariharan et al. (2009) [118] Henriquez et al. (2009) [109] Alonso et al. (2005) [119] Markaki and Stylianou (2009) [110] Martínez et al. (2012) [113] Gelzinis et al. (2008) [120] Gelzinis et al. (2008) [120] Vaiciukynas et al. (2012) [121] Vaiciukynas et al. (2012) [121]

Vaiciukynas et al. (2014a, 2014b) [86, 122]

1st deriv., or velocity (∆); 2nd deriv., or acceleration (∆∆); Mel freq. (MF); cepstral coefficient (CC); band energy coefficient (BEC); relative spectral transform perceptual linear prediction (RASTA-PLP); linear predictive coding (LPC;, harmonic-to-noise ratio (HNR); normalized noise energy (NNE); glottal-to-noise excitation (GNE); TISEAN (Hegger et al., 1999)[123]; MDVP (Hema et al., 2009) [124].

(25)

3.6. Acoustic voice analysis and smartphones

Automated acoustic analysis-based voice screening could be one of the potential approaches, helping primary care physicians and other public health care services to identify the patients, who require early otolaryngo-logical referral, thereby improving the diagnostics and management of the laryngeal patients /patients suffering from voice disorders. The main goal of the automated pathological voice/speech detection systems is to categorize any input voice as either normal or pathological [125]. Currently, there is an increasing demand for the robust measures of voice quality. However, a comprehensive, systematic and routine measurement of acoustic voice para-meters for diagnostic and/or voice screening purposes, or for following the treatment, is only possible in hospitals with voice laboratory facilities [126]. One of the possible solutions, providing automated acoustic analysis-based voice screening, could be the use of tele-medicine and/or telephony-based voice pathology assessment, using various automated analysis algorithms. Several earlier studies showed that at the time of processing telephone-quality signals the performance of speech and speaker recognition systems aggravated, if compared to the systems, utilizing high-quality recordings [127]. In 2008 and 2014, Vogel et al [128, 129] published the findings, where these authors reciprocally compared modern recording devices, intended for speech analysis: smart phones, landline telephones, laptops, and hard disc recorders. Speech samples were acquired simultaneously from 15 healthy adults, using four devices; as well, these samples were analysed acoustically for measures of timing and voice quality. As the results of the voice analysis allowed, the above-mentioned four devices were compared with the benchmark devise – the high-quality recorder, coupled with a condenser microphone. The conclusion presented by these authors was that the acoustic analysis cannot be assumed to be comparable, if different recording methods are applied to record the speech.

However, more recent studies highlighted the real possibility for cost-effective remote detection and assessment of voice pathology over the telephone channels, reaching normal/pathological voice classification accu-racy close to 90 % [125, 130–132].

Current progress in digital technologies has enhanced the access to por-table devices, capable of recording acoustic signals in high-quality audio formats, as well as transmitting the digitized audio files via the computer network. The high sampling rate (48.0–90.0 kHz), afforded by the contem-porary models of smart phones may prove to be an important aspect, enabl-ing easily-accessible audio recordenabl-ing tool to collect voice recordenabl-ings, and preserving sufficient acoustic details for voice analysis and monitoring [133].

(26)

As a result, some sporadic reports in scientific literature, regarding the applicability and effectiveness of iPhone-based voice recordings for acous-tic voice assessment, have already been introduced[133]. More recent study by Mat Baki et al has demonstrated that voice recordings, performed with iPod’s internal microphone and analysed with OperaVoxTM software application, installed on an iPod touch (4th generation), were statistically comparable to the golden standard, i.e., the Multidimensional Voice Program (MDVP, KayPentax, NJ, USA) [126]. In 2013 and 2015 Mehta et al published a study about a smart phone-based ambulatory voice health monitor that was connected to the accelerometer on the neck surface below the larynx in order to acquire and analyse a large set of ambulatory data from patients with hyperfunctional voice disorders (before and after the treatment stages), and compared the findings with the matched-control subjects. These authors determined that wearable voice monitoring systems have the potential to provide more reliable and objective measures during everyday activities of voice use that can enhance the diagnostic and treatment strategies for voice disorders [134, 135].

Therefore, one of the aims of the present study was to evaluate the reliability of acoustic voice parameters obtained simultaneously using oral and smart phone microphones, and to investigate the benefit of combined use of SP microphone signal, together with the GFI questionnaire data for voice categorization with regard to voice screening purposes, as well as for the development of software, targeted at otolaryngologists for the laryngeal disorder screening purposes.

(27)

4. METHODS

4.1. Ethics

The current study was approved by Kaunas Regional Biomedical Research Ethics Committee (No. P2-24/2013). All patients and healthy volunteers were provided with the written informed consent. This clinical study was approved by the State Data Protection Inspectorate for dealing with personal patient’s data (No. 2R-648 (2.6-1).

4.2. Study design

During the period of 2011–2015, 656 participants were recruited for our study. 9 patients were not included for this study, as their data appeared to be lost. There were 337 healthy volunteers and 319 patients that addressed Department of Otorhinolaryngology of the LUHS. The present study com-prised 4 parts.

The normal voice subgroup was composed of 336 selected healthy

volunteer individuals who considered their voice as normal. They had no complaints concerning their voice and no history of chronic laryngeal disea-ses or other long-lasting voice disorders. All of them were free from any known hearing problems and free from common cold or upper respiratory infections at the time of voice recording. The voices of this group of individuals were also evaluated as healthy voices by clinical voice specia-lists. Furthermore, no pathological alterations in the larynx of the subjects of the normal voice subgroup were found during video laryngostroboscopy (VLS). Digital high-quality VLS recordings were performed with an XION Endo- STROB DX device (XION GmbH, Berlin, Germany) using a 70 rigid endoscope. Acoustic voice signal parameters of these normal voice sub-group subjects that were obtained using Dr. Speech software (Tiger Elec-tronics, Seattle, WA; subprogram: voice assessment, version 3.0) were within the normal range.

The pathological voice subgroup consisted of 319 patients who

repre-sented a rather common and clinically discriminative group of laryngeal diseases, that is, mass lesions of the vocal folds and paralysis. Mass lesions of vocal folds included in the study consisted of nodules, polyps, cysts, papillomata, keratosis, and carcinoma. As well, there were patients with neurological diseases (Parkinson’s disease, Huntington’s chorea). Patholo-gical voice group patients were recruited from consecutive patients who were diagnosed with the laryngeal diseases mentioned previously. The

(28)

clinical diagnosis was based on typical clinical signs revealed during VLS and direct microlaryngoscopy. Patients with neurological diseases were referred to us from the Neurological department of our clinic; the patients’ diagnosis was based on the typical clinical signs. In all cases of mass lesions of the vocal folds, the final diagnosis was proved by the results of the histological examination of the removed tissue.

Demographic data of the total study group and diagnoses of the pathological voice subgroup are presented in Table 4.2.1 and Table 4.2.2. These patients were serially enrolled and, therefore, likely represent the real incidence of pathologies in our series and can be considered to be clinically representative of the population of voice-disordered patients.

Table 4.2.1. The demographic data of total study group

Diagnosis number Total (n=656)

Gender Age (in years)

Male

(n=253) (n=403) Female Mean SD

Healthy volunteers 337 110 227 37.7 12.86

Patients 319 143 176 46,7 14,90

SD – standard deviation.

In study No I (analysis of oral and throat microphones using dis-criminant analysis) we admitted 157 individuals, the normal voice subgroup was composed of 105 selected healthy volunteer individuals, The patho-logical voice subgroup consisted of 52 patients who represented a rather common and clinically discriminative group of laryngeal diseases, that is, mass lesions of the vocal folds and paralysis.

In study No II (analysis of oral and throat microphones using Random Forest classifier) we admitted 273 subjects (163 normal voices and 110 pathological voices) of both genders, ranging from 19 to 85 years of age.

In study No III (Analysis of oral and smart phone microphones data using Random Forest classifier) we admitted 118 individuals examined at our Department of Otorhinolaryngology. The normal voice subgroup was composed of 34 selected healthy volunteers. The pathological voice sub-group consisted of 84 patients who represented a rather common, clinically discriminative group of laryngeal diseases including mass lesions of the vocal folds (nodules, polyps, cysts, papillomata, keratosis, and carcinoma), paralysis and reflux laryngitis.

(29)

Table 4.2.2. The demographic data of pathological voice subgroup

Diagnosis number Total

(n=319)

Male

(n=143) (n=176) Female Mean SD

Vocal folds nodules 45 3 42 35.6 11.84

Vocal fold polyp 81 32 49 44.8 11.63

Vocal fold cyst 15 1 14 40.6 12.62

Vocal fold cancer 31 30 1 61.8 8.47

Vocal fold polypoid hyperplasia

(Mb. Reinke-Hajek) 40 11 29 52.4 9.44

Vocal fold keratosis 8 7 1 50.8 15.35

Vocal fold papilloma 21 12 9 37.2 13.43

Unilateral vocal fold paralysis 20 10 10 54.7 13.54 Bilateral vocal fold paralysis 2 1 1 61.5 6.36 Chronic hyperplastic laryngitis 24 21 3 53.4 14.21

Cystisvestibulumlaryngis 2 0 2 63 4.23 Dysphonia 6 1 5 36.8 13.81 Sulcus glottidis 4 1 3 38.3 22.2 GERD (gastroesophageal disease) 11 8 3 46.3 15.63 Granuloma 2 0 2 26.5 6.36 Acute laryngitis 5 4 1 51 13.06 Presbylaringis 2 0 2 75.5 4.95 Monochorditis 1 1 0 48 – SD – standard deviation.

In study No IV (Testing of Voice Test software) adatabase of 273 sub-jects of both genders (163 normal voices and 110 pathological voices), ranging from 19 to 85 years of age, was used to train the RF classifier. A mixed gender database, containing 596 subjects (106 healthy men, 221 healthy women, 118 pathological men, 151 pathological women) was used for query data to train the RF classifier. 45 unseen subjects (9 healthy and 36 pathological) were admitted for the testing program.

(30)

4.3. Voice recordings

In studies No I and No II voice recordings of a sustained phonation of the vowel sound /a/ (as in an English word “large”) were used. The subjects were asked to utter a speech sound: a sustained vowel /a/ at a comfortable pitch and loudness level for at least 5 seconds. Voice samples, obtained from each subject, were recorded in a sound-proof booth simultaneously as it shown in a Fig. 4.3.1 , with the help of two microphones: oral cardioid AKG Perception 220 (AKG Acoustics, Vienna, Austria) microphone placed at a 10.0 cm distance from the mouth (the subjects were seated with a head rest), keeping at about 90º microphone-to-mouth angle, and low-cost small contact (throat) microphone Stryker/Triumph PC (Clearer Communications, Inc, Burnaby, Canada) placed on the projection of lamina of thyroid cartilage and fixed with elastic bail. Localization of the throat microphone on thyroid lamina was chosen to acquire the strongest signal because the average magnitude of the acceleration tends to be greatest on and in the immediate vicinity of the larynx [74]. The voice recordings were made in the wav file format on separate tracks using Audacity software (http://audacity. sourceforge.net/) at the rate of 44.100 samples per second as it shown in Fig. 4.3.2. Sixteen bits were allocated for one sample. The external sound card M-Audio (Cumberland, RI) was used for digitization of the voice recordings.

Fig. 4.3.1. Voice recording in a soundproof booth simultaneously with two microphones: oral cardioid AKG Perception 220 and contact

(throat) microphone Stryker/Triumph PC.

(31)

Fig. 4.3.2. The voice recordings were made in the wav file format on separate tracks using Audacity software

In study No III voice recordings of a sustained phonation of the vowel sound /a/ were used. Voice samples, obtained from each subject, were recorded in a sound-proof booth simultaneously with the help of two microphones: oral cardioid AKG Perception 220 (AKG Acoustics, Vienna, Austria) microphone, and internal smart phone Samsung Galaxy Note 3 microphone as it shown in Fig. 4.3.3. Both microphones were placed alongside at a 10.0 cm distance from the mouth (the subjects were seated with a head rest), keeping at about 90° microphone-to-mouth angle. The subjects were asked to utter a speech sound: a sustained vowel /a/ at a comfortable pitch and loudness level for at least 5 seconds.

In study No IV voice recordings of a sustained phonation of the vowel sound /a/ (as in an English word “large”) were used. The subjects were asked to utter a speech sound: a sustained vowel /a/ at a comfortable pitch and loudness level for at least 5 seconds. Voice samples obtained from each subject were recorded in a soundproof booth with the help of a microphone: oral cardioid AKG Perception 220 (AKG Acoustics, Vienna, Austria) microphone placed at a 10.0 cm distance from the mouth (the subjects were seated with a head rest), keeping at about 90° microphone-to-mouth angle.

(32)

Fig. 4.3.3. Voice samples were recorded in a sound-proof booth simultaneously with two microphones: oral cardioid AKG Perception

220and internal smart phone Samsung Galaxy Note 3 microphone

4.4. Acoustic analysis

In studies No I and No III the study segments of at least 5 seconds of the sustained vowel /a:/ of separate voice samples from each recording session were analysed using Dr. Speech software (subprogram: voice assessment, version 3.0). Acoustic voice signal data were measured for F0, percent of jitter and shimmer, normalized noise energy (NNE), signal-to-noise ratio (SNR), and harmonic-to-signal-to-noise ratio (HNR). According to the results of our previous study, no statistically significant differences between means of male and female acoustic voice parameters (except the mean F0) were revealed [136]. Therefore, in this study, we did not separate parame-ters of acoustic voice analysis between males and females. However, the F0 parameter was analysed separately considering the gender of the subjects.

In study No II multiple feature sets, containing 1051 features in total, were used. Feature extraction process was done using the Matlab. The features were thoroughly discussed [11, 120, 137].

In study No IV, aiming to obtain a comprehensive description, each audio recording was represented by 14 feature subsets, resulting in a feature vector of 927 elements. Technical details of feature subsets 1–14 can be found in an article by Gelzinis et al[120].

(33)

4.5. Questionnaire data

There are several voice symptoms related of questionnaires in scientific literature; in this study, in order to collect the demographic data, the symptoms, the complains, etc. of our patients, the best-performed questions, selected by our team some time earlier, were used, as it was published in the article by Bacauskiene et al in 2012 [138]. Additionally each participant of the study (normal and pathological voice subgroups) filled in the GFI-LT questionnaire at the baseline along with the voice recordings, at least 1 week before the treatment. The query data were collected from subject responses to the set of questions, summarized in Table 4.5.1.

Table 4.5.1. Questionnaire questions used to collect patient data

Question content Units (or scale) of measurement

1. Subject’s gender {Man, women}

2. Subject’s age Discrete number

3. Average duration of intensive speech use Hours / day 4. Average duration of intensive speech use Days / week

5. Smoking {Yes, no}

6. Smoking intensity Cigarettes / day

7. Smoking history Years

8. Maximum phonation time Seconds

9. SSA of voice function quality Visual analogue scale from 0 to 100 10. SSA of voice hoarsenes From 0 (no hoarseness) to 100 (severe

hoarseness) 11. Voice handicap progressing Grade from 1 to 4

12. SSA of daily experienced stress level From 0 (no stress) to 100 (very much stress)

13. Frequency of singing Grade from 1 to 5 14. Frequency of talking / singing in a smoke-filled

room Grade from 1 to 5

15. SSA of experienced discomfort due to voice

disorder From 0 (no discomfort) to 100 (huge discomfort) 16. SSA of “too week voice” From 0 (no) to 100 (very clear) 17. SSA of repetitive “loss of voice” From 0 (no) to 100 (very clear) 18. SSA of reduced voice From 0 (no) to 100 (very distinctly) 19. SSA of reduced ability to sing From 0 (no) to 100 (very distinctly) 20. Frequency of voice cracks or aberrant voice From 0 (no) to 100 (very often) 21. Level of vocal usage Level from 1 to 4

22. Speaking took extra effort (G1) From 0 (no problem) to 5 (severe problem)

23. Throat discomfort or pain after voice usage (G2) From 0 (no problem) to 5 (severe problem)

24. Voice weakens while talking, vocal fatique (G3) From 0 (no problem) to 5 (severe problem)

25. Voice cracks or sound different (G4) From 0 (no problem) to 5 (severe problem)

26. Glottal function index [23, 24] (GFI= G1+ G2+ G3+ G4)

Grade from 0 to 20

(34)

4.6. Statistical evaluation, classifiers

In the studies No I and No III the statistical analysis was performed with the help of IBM SPSS Statistics software for Windows (version 20.0, IBM Corporation, Armonk, NY). The data were presented as mean ± standard deviation (SD). The Student t test was used for the testing of hypotheses about the equality of the mean. The size of the differences among the mean values of the groups was evaluated with the calculation of type II error, β. The size of the difference was considered to be significant if β 2 (i. e. the power of statistical test !0.8) as type I error α 1⁄4 .05.Fisher discriminant analysis was performed in order to determine the limiting values of the acoustic voice parameters, discriminating normal and pathological voice groups, and selecting an optimum set of parameters for the classification task. Correct classification rate (CCR) was used to evaluate the feasibility of the acoustic voice parameters, classifying normal and pathological voice classes. The correlations among the acoustic voice parameters were evaluated using Pearson correlation coefficient (r). The level of the statistical significance by testing the statistical hypothesis was 0.05.

In the studies No II, No III, and No IV the Random forest classifier was used. Random forest (RF) is a popular and efficient algorithm for the classification and regression, based on the ensemble methods. The core idea of RF is to combine many binary decision trees, built using different boots-trap samples of the original data set, to obtain an accurate predictor. Such tree-based ensemble is known to be robust against the over-fitting, and, as the number of trees increases, the generalization error is observed to converge to a limit.

Decision tree within RF is the classification and regression tree (CART). Given a training set Z, consisting of n observations and p features, RF is constructed in the following steps:

1. Choosing the forest size t as a number of trees to grow and the subspace size q ≤ p as a number of features to be provided for each node of a tree. 2. Taking a bootstrap sample of Z and randomly selecting q features. 3. Growing an un pruned CART using the bootstrap sample.

Repeating of steps 2–3 is necessary, until the size of the forest reaches t. To classify an unseen observation x, each tree of the forest is provided with

x, and outputs a decision. Resulting votes for each class are returned, and

the class that collects most votes is considered to be the winner; the decision is based on the majority voting scheme, as illustrated in Fig. 4.6.1.

(35)

Fig 4.6.1. A general Random forest architecture, where k stands for class label

A data proximity matrix, derived from RF, was used in this study for data exploration and visualization of data and decisions. To map data and decisions onto the 2D space, the t-distributed stochastic neighbour embed-ing (t-SNE) algorithm was used. The t-SNE algorithm often outperforms other state-of-the-art techniques for dimensionality reduction and data visualization [139].

We evaluated the performance of a classifier using the following mea-sures: a) detection error trade-off (DET) curve and equal error rate (EER); b) receiver operating characteristic (ROC) curve and area under the curve (AUC). The DET, EER, ROC, AUC measures were estimated using an interpolated version of the ROC through pool adjacent violators’ algorithm, namely ROC convex hull method, available in the BOSARIS toolkit [140].

The ease of use was evaluated taking the ISO-9241 standard into account, because it is impossible to evaluate the ease of use without taking into account the users’ understanding [141].

As mentioned before, users’ satisfaction is another important detail greatly influencing the success of software implementation. Evaluation of the developed software was done by seven principles of the standard 9241:

1. Suitability for the task. Software is suitable for the task if the user can easily understand what tasks it can do.

2. Self-descriptiveness. This principle is evaluated by checking if software can be understood in intuitive way and no or very little additional

(36)

information is needed. It also requires that any possible usage mistake would be followed by relevant information.

3. Controllability. Software controllability is achieved by creating user interface, which allows completing the task in one sequence of steps. 4. Conformity with user expectations. Software conforms to the users

expectations if it is consistent and complying with characteristics of the user.

5. Error tolerance. Computer program is admitted to be error tolerant if its usage requires no additional effort except in the events of obviously faulty usage.

6. Suitability for individualization. Software is suitable for individualli-zation if it allows personal configuration for each user.

7. Suitability for learning. Software is suitable for learning if minimum effort for usage is required and help information is provided [138, 141].

(37)

5. RESULTS

5.1. Results study No I

(Analysis of oral and throat microphones using discriminant analysis) data

The mixed gender database of voice recordings used in this study contained 157 digital voice recordings of sustained phonation of the vowel sound /a/. Demographic study data is presented in Table 5.1.1.

Table 5.1.1. Demographic data of study I

Diagnosis number Total

(n= 157)

Female (n=102) (n=55) Male Mean SD Normal voice 105 71 34 46.2 6.70 Nodules 7 7 0 25.4 6.00 Polyps 14 8 6 41.1 11.70 Carcinoma 6 0 6 62 7.00

Vocal fold polypoid hyperplasia

(Mb. Reinke-Hajek) 9 7 2 50 7.10

Vocal fold papilloma 7 3 4 40 13.50

Other (cyst, granuloma,

monochorditis) 9 6 3 45.7 8.10

SD – standard deviation.

The mean values and SD of the acoustic voice parameters obtained both with oral and throat microphones in the total study group are presented in Table 5.1.2. Generally, no statistically significant differences (P > 0.05) between acoustic voice parameters obtained with oral and throat micro-phones were found for all parameters reflecting frequency and amplitude perturbations of voice signal. Some exception was revealed only for SNR and HNR parameters demonstrating slight, however, statistically significant differences between the microphone measurements. However, these differences were only within the range of 5.64–5.78 %. The observed statis-tically significant difference between the HNR and SNR parameters of the two microphones could be because of the rather different frequency res-ponse curves of the microphones.

Evaldas Padervinskis THE VALUE OF AUTOMATIC VOICE CATEGORIZATION SYSTEMS BASED ON ACOUSTIC VOICE PARAMETERS AND QUESTIONNAIRE DATA IN THE SCREENING OF VOICE DISORDERS

Evaldas Padervinskis

THE VALUE OF AUTOMATIC

VOICE CATEGORIZATION SYSTEMS BASED

ON ACOUSTIC VOICE PARAMETERS AND

QUESTIONNAIRE DATA IN THE

SCREENING OF VOICE DISORDERS

Evaldas Padervinskis

AUTOMATINĖS BALSO KATEGORIZAVIMO

SISTEMOS, PAREMTOS AKUSTINIŲ BALSO

PARAMETRŲ BEI PACIENTŲ KLAUSIMYNŲ

DUOMENŲ ANALIZE, VERTĖ PIRMINEI

BALSO SUTRIKIMŲ ATRANKAI

Šią knygą skiriu savo Tėvams

Ilonai ir Edmundui Padervinskiams

Ačiū

„Gyvenimas skirstomas į tris laiko vienetus:

kas buvo, kas yra ir kas bus.

Tai, ką dabar veikiame, yra trumpa, ką veiksime –

netikra, ką nuveikėme – užtikrinta.“

Seneka

CONTENTS

ABREVIATIONS

INTRODUCTION

1. THE AIM AND OBJECTIVES OF THE STUDY

2. ORGINALITY OF THE STUDY

3. SCIENTIFIC LITERATURE REVIEW

4. METHODS

5. RESULTS