1
CHAPTER 4.
THE ISLE CORPUS
4.1 Personal interest
It seems to me quite obvious to write my thesis about language learners, as I am myself a second language learner since a long time. After having followed the course of professor Belinda Crawford Camiciottoli that introduced the notion of
learner corpora, a world of curiosity opened to me. My interest in thisparticular
area of linguistics increased after the discovery that this kind of research only has existed since the late 1980s. Therefore, I hope that my contribution to such recent and unsaturated studies could really add something new and be further used for many goals in foreign language acquisition, as well as in foreign language teaching.
I wanted my thesis to be not only a linguistic research, but a personal investigation of my second language learning as well. Being a bilingual speaker of Dutch and Italian, I am fascinated by both German and neo-Latinate languages. I wished to discover more about both the languages, preferably in contact with each other and in relation with the learning of English.
Traditionally, pronunciation errors have received less attention than morphological or grammatical errors in foreign language acquisition and teaching. Not so much research addressed to Italian and/or German students regarding pronunciation has been conducted either. This could thus be a novelty.
I decided to focus my thesis on the topic of pronunciation challenges in learning English among native speakers of Italian and German, and to explore this topic through learner corpora. Therefore, professor Belinda Crawford Camiciottoli, who is supervising my master thesis, suggested utilising the ISLE speech corpus. The ISLE corpus of non-native spoken English includes both Italian and German speakers, and is thus ideal for this project.
2 4.2 Procuring the ISLE speech corpus
To obtain access to the ISLE speech corpus several steps were necessary. Professor Crawford helped me with the procedures. She contacted Valérie
Mapelli,1 an independent researcher and staff-member of the Evaluations and
Language resources Distribution Agency (ELDA). Valérie Mapelli, together with the team of the European Language Resources Association (ELRA) who created this corpus, made this excellent source for research available to me and my supervisor. Therefore, I would like to show my gratitude by spending some words on ELRA and ELDA.
ELDA was founded in 1995, in parallel to ELRA.2 ELRA is a non-profit
organisation created with a twofold mission: to promote language resources for the Human Language Technologies field and to boost the evaluation of language
engineering technologies.Its seat is in Luxembourg and its headquarters in Paris.
The term ‘language resource’ refers to a set of oral speech or written language data and descriptions in machine readable form, used for building, improving or evaluating natural language and speech systems. Furthermore, language resources can be used as fundamental resources for language studies. Language resources can be written and spoken corpora, terminology databases, speech collections, etc. Figure 4 illustrates the total amount of the products and services of the organisation.
1 For further readings of Mapelli’s articles:
https://www.researchgate.net/profile/Valerie_Mapelli
2 It is possible to find all kinds of additional information about ELRA (history, members, catalogues, etc.) on the association’s website: http://elra.info/en/about/
3
Figure 4. ELRA supply3
Sectors like telecommunication, multilingual business, education, translation, etc., can benefit significantly from language resources and in various ways. These sectors can reach a wider audience, decrease their costs, develop better understanding for customers, improve internal and external communication, and improve services, for example: arranging documents or creating databases.
ELDA is the association’s operational body. ELDA is in charge of the development and the execution of ELRA’s missions, and of the management of all the commercial tasks of the association. ELDA makes available more than 1100 languages resources in the ELRA catalogue which is possible to browse on-line. These resources are available in a large number of languages and modalities. The quality of the language resources allows users to develop and/or evaluate a broad range of language technology systems.
4 It is possible to find the ISLE speech corpus in the ELRA catalogue. In order to download a language resource from the catalogue, first of all a completed and signed order form should be returned by email to Valérie Mapelli. This order
form is available for download on the ELRA website.4 On receipt of our order,
the End-User agreement was sent by Valérie Mapelli, to be signed and returned to her.
On receipt of a copy of the agreement signed by Professor Crawford, Valérie Mapelli proceeded with the delivery through a downloading procedure. This procedure is free of charge. To download the language resource, Valérie Mapelli
sent the access information, containing a link and a password valid for 30 days.5
4.3 The Interactive Spoken Language Education (ISLE) Project
A corpus of non-native speech data was collected by W. Menzel, E. Atwell, P. Bonaventura, D. Herron, P. Howarth, R. Morton, and C. Souter for the purpose of developing pronunciation training tools for second language learning. This corpus consists of almost 18 hours of spoken utterances recorded by intermediate-level German and Italian learners of English. The corpus is based on 250 utterances selected from typical second language learning exercises. It has been annotated at the word and the phone level, to highlight pronunciation errors such as phone realisation problems and misplaced word stress assignments. The data has been used to develop and evaluate language learning systems, which can be used to produce corrective feedback to a language learner.
4.3.1 The aim of the project
Originally, project ISLE had the goal of integrating speech recognition technologies in order to improve the performance of computer-based English
4http://catalog.elra.info/purchase_procedure.php
5http://freelrs.elda.org//data/public/3e0263658bced260da5e596c0fb9d9a3.php?lang=en password: 9yGa4OHO5Wg3
5 language learning systems. Speech recognition enables the recognition and translation of spoken language into text by computers (Granger, 2002). The field of ELT is showing an increasing awareness of the potential of speech and language technology (Atwell et al., 2003). Speech recognition allows for learner correction of possible pronunciation errors made when speaking.
Theoretically, the technologies used and developed are valid for any language pairs, but ISLE focuses on German and Italian learners of English. Following Menzel et al. (2000), in order to develop pronunciation training tools, a corpus of non-native speech is needed. They mention three reasons for this:
1. to train the parameters and rules used in the recognition and diagnosis systems;
2. to test the performance of the system on a known data set;
3. to evaluate the contribution of speaker adaptation for improving the reliability of the native British English recognizer. (p. 2)
In other words, to test the recognition and error detection capacities of the computer-based systems, non-native speech data have been collected. These data are necessary to capture typical pronunciation errors made by non-native speakers of English in controlled language learning situations.
4.3.2 The speakers
Non-native speech data were collected from 23 Italian and 23 German adult intermediate-level speakers of English. These 46 volunteer speakers are English language learners from different project sites in Germany and Italy. The initial intention was to balance the speakers for native language (German/Italian), sex, age, and proficiency level (Menzel et al., 2000). Unfortunately, this was only achieved to a limited degree, because of the small number of speakers. Table 3 gives an overview of the speaker sample. Proficiency ratings are based on a self-judgement of the speakers (Menzel et al., 2000).
6 *1=Elementary English, 2=Intermediate English, 3=Upper-Intermediate English, 4=Advanced English
Table 3. Speaker sample (Menzel, et al., 2000)
Furthermore, data from two native speakers of English (Atwell and Howarth) were collected for test calibration purposes (Atwell et al., 2003). It is useful to have native speaker language samples to be able to make comparisons between correct native pronunciation and erroneous non-native pronunciation.
4.3.3 The linguistic material
The linguistic material included in ISLE is divided into seven blocks. Two main sets of data were collected from each speaker. The first portion of data (i.e., block A, B, and C) was used only to produce speaker-adapted non-native speech models for testing the adaptation of the speech recognition system. The linguistic material for this purpose was chosen from a non-fictional, autobiographical text describing the ascent of Mount Everest (Hunt, 1996). It was selected so that the speakers would not have to deal with reported speech or foreign words, which
may cause them to alter their pronunciation.Speakers were asked to read aloud
a passage of approximately 1300 words of the text (82 sentences).
The second portion of data (i.e., block D, E, F, and G) was collected with the intention to capture typical pronunciation errors made by non-native speakers of English. The blocks are divided by exercise types: simple exercises, such as minimal pairs or multiple choice (block D), exercises entirely focused on reading
(block E), and moderately complex description exercises (blocks F and G).The
latter exercises are useful to measure speakers’ pronunciation competence. They
L1 Sex Proficiency Level*
M F 1 2 3 4 Total
German 13 10 - - 8 15 23
Italian 19 4 27 11 4 1 23
7 consist of approximately 1100 words (165 phrases). These exercises are provided in Appendix A. In selecting the exercises, focus has been put on “problem phones, weak forms, words with potentially tricky stress patterns, and difficult consonant clusters” (Menzel et al., 2000, p. 3). According to Menzel et
al. (2000),since the actual system was planned to have both simple and complex
exercises, it was desirable to test the system on different exercises of various complexities. Table 4 provides an overview of the exercises the non-native language learners have been exposed to.
Block Linguistic Issue Exercise Type Examples A B C Wide vocabulary coverage Adaptation/ Reading
“This is the story of how two men reached the top of Everest on the twenty-ninth of May nineteen fifty-three and came back safely to their friends below.”
“When we came back to
England, a group of
students interviewed us.”, “The third difficulty is the climbing itself.” D -Problem Phones -Weak Forms Minimal Pair Item selection/ combination
“I said white not bait.” “What can you see in the picture? a ginger biscuit - a singer singing - a man's finger – a bell ringer”
E -Stress
-Weak Forms -Problem Phones -Consonant Clusters
Reading “A student visa permits
them to stay longer.”
“He has his own
photographic studio.” F -Weak Forms -Problem Phones Description/ Item selection/ combination
“I would like beef and for pudding I would like vanilla ice cream.”
G -Weak Forms
-Problem Phones
Item selection/ combination
“This summer I’d like to visit Rome for a few days.” Table 4. Linguistic material in ISLE
8 It is possible to find a list of weak forms within the corpus in the file “RecordingInstr.doc”, whereas throughout the exercise sentences there are examples of all the major weak forms in English: a, am, an, and, are, as, at, be, been, but, can, does, for, from, had, has, have, he, her, his, must, she, some, than, that, the, them, there, to, us, was, were.
4.3.4 Recording procedures
Recording instructions for the speakers are provided by the corpus in the file
"RecordingInstr.doc".Speakers required between twenty minutes and one hour
to record the complete set of phrases. The data were digitally recorded directly into WAV format.
Before beginning recording, speakers needed to complete an enrolment procedure. They were presented with an electronic form to collect demographic data: name, sex, age, country of origin, native language, and own English proficiency judgement (Menzel et al., 2000). The date and location of recording were also gathered. Furthermore, if a speaker needed a break, the session could be suspended and the rest of the sentences could be recorded later.
The phrases were recorded with high-quality headset microphones in noiseless environments (Atwell et al., 2003). In order to minimise the boredom
effects, the recording data were presented in different orders. The constraints
were that the relatively long Everest text should not appear either first or last because it might put people off. Difficult or long blocks (A, B, C and E) were thus separated from the easier ones (D, F and G). Furthermore, the Everest text (blocks A, B and C) was distributed so that no two Everest blocks appeared in
sequence. Each set begins with fifteen practice sentences which are also
instructions on how to use the recording tool.
9 After the collection of the data, the corpus was carefully annotated. The first portion of data, i.e., the adaptation data containing a large vocabulary, only needed to be verified at the word level. This part was thus not annotated at the phone level, but only word-level errors were transcribed. The second part, focusing on learners’ pronunciation errors in the utterances, instead had been annotated at the phone level. The transcription and annotation of the data was done in a sequence of partly automated steps. Atwell et al. (2003) affirm that:
The annotation contains a transcription of how the utterance was spoken by the speaker in relation to a reference transcription containing a canonical native pronunciation. (p. 5)
The phone-level transcriptions of each utterance were produced automatically using a British English recogniser, namely the “UK English speech-recogniser” (Atwell et al., 2003, p. 5). Phone annotations then have been added. To fulfil this task, the chosen phone set was “Entropic’s UK English phone set” (Menzel et al., 2000). Atwell et al. (2003) argue that this phone set, compared to the International Phonetic Alphabet (IPA) provided in Appendix B, was simpler to adopt as it does not require special fonts for display and printing. I believe that from a linguistic perspective, IPA labelling is preferable. I will thus adopt IPA for this analysis and transcription. Table 5 shows the Entropic’s UK English phone set, provided by Menzel et al.’s (2000) appendix, integrated by IPA labelling.
10 Table 5. Entropic’s UK English phone set supplemented with the IPA symbol set
After the utterances were transcribed, manual phone-level annotations were added. This was done by a team of six linguists from the Language Unit and Linguistics Department at Leeds University (Menzel et al., 2000). They were asked to correct the automatically annotated phone sequences. They also added an overall proficiency rating for each speaker (Atwell et al., 2003). Annotators needed some training to familiarise them with the UK phone set, ignoring the IPA symbol set. After practice, the annotators could finish the work for each
Symbol IPA Example Symbol IPA Example
Vowels Plosives
Aa ɑ balm B b bet
Ae æ bat D d debt
Ah ʌ but G g get
Ao ɔ: bought K k cat
Aw aʊ bout P p pet
Ax ə about T t tat
Ay aɪ bite Fricatives
Eh e bet Dh ð that
Er ɜ: bird Th θ thin
Ey eɪ bait F f fan
Ih ɪ bit V v van Iy i: beet S s sue Oh ɒ box Sh ʃ shoe Ow əʊ boat Z z zoo Oy ɔɪ boy Zh ʒ measure Uh ʊ book Affricates Uw u: boot Ch tʃ cheap Semi-Vowels Jh dʒ jeep L l led Nasals R r red M m met W w wed N n net Y j yet Ng ŋ thing Hh h hat Silence Sil silence Sp short pause
11 speaker in 5-6 hours. However, the total time for all annotation was approximately 300 hours (Menzel et al., 2000, p. 4).
The annotators edited annotations attached to the waveform, if the speaker has said something other than the canonical pronunciation provided in the reference transcription. Such changes consist of “deletions, insertions and substitutions of phones or a stress shift” (Atwell et al., 2003, p. 8).
4.4 Analytical procedure
In order to make it possible to conduct an analysis of pronunciation errors of German and Italian English language learners, a theoretical knowledge about error interpretation, second language learning and learners, and learner corpora was needed. Therefore, the first approach to this thesis was that of reading several studies, papers, and reviews on these subjects in order to familiarise with them. Then, I wrote the theoretical chapters to make an introduction to the ISLE corpus and its analysis possible.
When the first three chapters had taken a reasonable form, I started to explore the ISLE corpus. The very first contact with the corpus was confusing and misleading, as it contains millions of files. It has not been easy to find my way through the ISLE corpus. At first, I started to open all the files confusedly and without really understanding what type of information these files contain. I opened several files containing text and others containing audio data. At that time, I had no idea where to start the analysis. Fortunately, I discovered the existence of the files: “README.TXT” and “RecordingInstr.doc”. After this discovery, the organisation of the files became clearer to me because of the information provided by these descriptive files. I was now able to differentiate between the files belonging to the German speakers’ recording sessions and those belonging to the Italian speakers. The corpus presents a separate file for each reader listed by session number. Each of these files contains an index of the
12 original prompts to be read and audio files containing the speakers’ recordings. Each of the 165 sentences is recorded in a separate audio file.
Then, I had to decide which words to transcribe. This decision has been influenced by the statistics provided by the corpus in the file “STATS”. This file contains most error-prone wordlists and lists with words containing erroneous pronunciation for both the Italian and the German speakers. Following these statistics, I started to search for the audio files containing mispronounced words. However, before being able to transcribe the errors, I read some studies on English, German, and Italian phonology (Antonsen, E. H., 2007; Krämer, M., 2009). I decided to use the IPA for the transcription of the erroneously pronounced words.
Languages are not containing all the phones of the IPA. Therefore, I decided to select only the vowel and consonant phones that occur in English, Italian, and German. Table 6 shows the IPA consonant symbols that occur in the languages of our interest.
Bilabial Labio- dental
Dental Alveolar Post- alveolar
Palatal Velar Glottal
Nasal mmm ɱ n n n ɲ ŋ ŋŋ Plosive p pp b bb t t t d dd k kk g gg Fricative f ff v v v θ ð s ss z zz ʃʃʃ ʒʒ Ç x h h Affricative pf ts ts dz tʃtʃtʃ dʒdʒdʒ Trill rrr Appoximant j j j w w Lateral approximant lll ʎ
English consonants Italian consonants German consonants
Table 6. Selected IPA consonant symbols
For every language, I use a distinct colour to avoid confusion. Figure 5 illustrates the selected IPA vowel symbols.
13
English consonants Italian consonants German consonants
Figure 5. Selected IPA vowel symbols
After a preliminary preparation, I was ready to start transcribing. I started with the Italian speakers’ pronunciation errors and continued with the German speakers’ errors. The transcription task was highly time-consuming, particularly for the pronunciation errors produced by the Italian speakers who made several different types of pronunciation errors. Sometimes one word was mispronounced in nine diverse ways by the Italian speakers. The Germans made fewer errors per word. This made it somewhat easier to transcribe the German speakers’ pronunciation errors.
After transcribing the most occurring pronunciation errors of both the Italian and the German speakers of the ISLE corpus, I analysed these errors. This analysis will be the focus of the next chapter.