Audio recognition techniques: signal processing approaches with secure cloud storage

(1)

Department of Engineering

Doctoral Programme in

"Cyber Physical Systems" xxxiii Cycle

Audio Recognition Techniques: Signal

Processing Approaches with Secure Cloud

Storage

Doctoral Thesis of:

Supervisor:

Murtadha Arif Bin Sahbudin

Prof. Marco Scarpa

The Chair of Doctoral Program:

Prof. Antonio Puliafito

(2)

Declaration of Authorship

A thesis submitted in fulfillment of the requirements

for the Doctor of Philosophy

I hereby certify that the thesis I am submitting is entirely my own original work except where otherwise indicated. I am aware of the University’s regulations concerning plagiarism, including those regulations concerning disciplinary actions that may result from plagiarism. Any use of the works of any other author, in any form, is properly acknowledged at their point of use.

Murtadha Arif Bin Sahbudin Messina, 10 November 2020

(3)

A special thanks to Prof. Marco Lucio Scarpa as a great mentor and super-visor. His advice, guidance, patience and, insight throughout this project I’m truly grateful. I wish to express my deepest gratitude to Dr. Salvatore Serrano, which have dedicated his time and effort in providing countless assistance. To mentioned thank you also to Prof. Antonio Puliafito for providing a wonderful opportunity in doing research at the University of Messina.

Extending the gratitude to all the professors and team members at MDSLab, Department of Engineering, University Of Messina. I also had the great pleasure in working with collaborators from the company’s experts. Also, I would like to recognize the invaluable assistance that was provided to me by University Sains Islam Malaysia, during my research internship period.

I am indebted to both of my parents Prof. Sahbudin Shaari and Siti Zaharah Shaikh Muksin have always been there to support me from the start. My sisters, Dr. Ilfita Kamaliah, Dr. Siti Haura, and Dr. Siti Humaira, which a huge in-spiration for me in every single way. I am also grateful to Nor Afifa for being a good friend and a mother to my daughter.

Finally, to my beautiful daughter Ilhan Safia, I love you always. Murtadha Arif Bin Sahbudin

(4)

Abstract

In this thesis, we identify an audio identification task through audio signal processing for extracting fingerprinting, emphasizing the complex task of design-ing a highly robust system for the signal from Frequency Modulation (FM) Ra-dio Broadcasting. We create a system capable of retrieving analog signals from the radio channel with the proposed Internet of Things (IoT) and Application Programming Interface (API), in conjunction with creating models of database clustering. To complement the rich number of suggested methods and research in this field, we encourage the design of yet another fingerprinting method.

The challenges and research problems in an audio recognition system in regular use can vary from many different aspects. Significant aspects are near similarity of the different original audio, in-variance to noises and spectral or temporal distortions, a minimum length of song track needed for identification, retrieval speed, and computing load.

The study proposes a novel, efficient, highly accurate, and precise finger-printing through the Short Time Power Spectral Density (ST-PSD) method. We propose matching features using an efficient hamming distance search on a binary type fingerprint and subsequently integrating a verification stage for match hy-potheses to maintain high precision and specificity on challenging datasets. We gradually refine this method from its early concept by introducing a new compo-nent such as the Mel frequency bank filter and progressive probability evaluation score.

Our proposed ST-PSD based fingerprint extraction technique and improve-ments can recognize an audio piece of music with an accuracy close to 100%. Despite the white noise of range 5dB, 10dB, 15dB, and 20dB in the sample query, it still outperformed other established methods. Moreover, an API inte-grated into a smartphone app is also included in the research.

(5)

weaknesses of the proposed systems. We make use of this dataset in this thesis for an extensive evaluation of our method. Finally, we show the possibility of establishing a sequence detection program on top of the fingerprinter to allow long query recordings to be monitored for either interactive analysis or fully automated reporting of results.

Finally, this research introduces a database storage framework that addresses data confidentiality problems when multi-cloud storage services are used. The evaluation results prove how practical the approach is in real-time usage. Fur-thermore, the framework is intended to be used for fingerprint collection or even to provide a new platform for users to choose their preferred multi-cloud storage services such as Google Drive, DropBox, and OpenStack to have flexibility and security at the same time.

(6)

List of Publication Contribution

The listed below publications includes three projects areas during the course of doctoral research which consist of Song Recognition (major course), Secure Storage Multi-Cloud Environment (minor course) and Location Based Encryp-tion(external). Therefore this thesis will only present the major and minor course.

Song Recognition:

Sahbudin, M. A. B., Scarpa, M., & Serrano, S. (2019a). Mongodb clustering us-ing k-means for real-time song recognition, In 2019 international confer-ence on computing, networking and communications (icnc). IEEE. https: //doi.org/10.1109/ICCNC.2019.8685489

Sahbudin, M. A. B., Chaouch, C., Scarpa, M., & Serrano, S. (2019b). Iot based song recognition for fm radio station broadcasting, In 2019 7th interna-tional conference on information and communication technology (icoict). IEEE. https://doi.org/10.1109/ICoICT.2019.8835190

Sahbudin, M. A. B., Chaouch, C., Scarpa, M., & Serrano, S. (2020a). Audio fin-gerprint based on power spectral density and hamming distance measure. Journal of Advanced Research in Dynamical and Control Systems, 12 (04-Special Issue), 1533–1544. https://doi.org/10.5373/JARDCS/V12SP4/ 20201633

Sahbudin, M. A. B., Chaouch, C., Scarpa, M., & Serrano, S. (2020b). Mobile application-programming interface (API) for Song Recognition Systems. Review for Journal Publication.

Chaouch, C., Sahbudin, M. A. B., Scarpa, M., & Serrano, S. (2020). Audio fin-gerprint database structure using k-modes clustering. Journal of Advanced Research in Dynamical and Control Systems, 12 (04-Special Issue), 1545– 1554. https://doi.org/10.5373/JARDCS/V12SP4/20201634

(7)

for journal publication.

Secure Storage Multi-Cloud Environment:

Sahbudin, M. A. B., Di Pietro, R., & Scarpa, M. (2019c). A Web Client Se-cure Storage Approach in Multi-Cloud Environment, In 2019 4th inter-national conference on computing, communications and security (icccs). IEEE. https://doi.org/10.1109/CCCS.2019.8888062

Location-Based Encryption:

Sahbudin, M. A. B., Ali Pitchay, S., & Scarpa, M. (2020c). Geo-COVID: Move-ment Monitoring based on Geo-fence Framework for COVID-19 Pandemic Crisis. Advances in Mathematics: Scientific Journal, 9 (9), 7385–7395. https: //doi.org/10.37418/amsj.9.9.85

(8)

List of Abbreviation

AAC Advanced Audio Coding

AES Advanced Encryption Standard AIFF Audio Interchange File Format AM Amplitude Modulation

API Application programming interface APK Android application package ARM Auto-Regressive Modeling ASE Amplitude spectrum envelope AU Audio UNIX

AVD Android Virtual Device BPM Beats Per Minute CD Compact Disc CF Crest factor

CPU Central processing unit CS Candidate Song

cURL Client URL DB Database dB Decibel

DBMS Database Management System DDR Double Data Rate

DFT Discrete Fourier transform DVD Digital optical disc

DWCH Daubechies wavelet coefficient histogram DWT Discrete wavelet transform

ECIES Elliptic Curve Integrated Encryption Scheme FCC Fourier cepstrum coefficient

FFMAP Simple frequency map FFT fast Fourier transform FLAC Free Lossless Audio Codec FM Frequency modulation FN False Negative

FP False Positive Fs Sampling rate GB Gigabyte

GET Request data from a specified resource GUI Graphical User Interface

(14)

TABLE OF CONTENTS

H Hamming Matrix HD High-definition video HDD Hard disk drive

HTML Hypertext Markup Language HTTP Hypertext Transfer Protocol Hz Hertz

ID Identification

IDE Integrated development environment

IFPI The International Federation of the Recording Industry IoT Internet of things

IP Internet Protocol IS Identity Service JAR Java ARchive

JSON JavaScript Object Notation kHz kilohertz

LPCC Linear predictive cepstral coefficients LPCM Linear pulse code modulation

MATLAB Matrix laboratory MathWorks MB Megabyte

Mbps Megabits per second

Mel-PSD Mel-frequency & Power Spectral Density

MEM Measurement unit for the number of memory accesses MFCC Mel-frequency cepstral coefficient

MHz Megahertz Min. Minimum

MIR Music Information Retrieval

MongoDB Cross-platform document-oriented database program MONO Audio recorded and reproduced from a single source MP3 MPEG Audio Layer-3

MPEG Moving Picture Experts Group ms Millisecond

MWT Morlet wavelet transforms NOOBS Raspberry Pi OS installer NoSQL Non-structured query language OSCC Octave scale cepstral coefficient OSS Object Storage Service

PC Personal Computer PCM Pulse-code modulation PHP Hypertext Preprocessor

POST Send data to a server in HTTP request PSD Power Spectral Density

(15)

RAM Random-access memory RAW RAW Audio

REST Representational state transfer RMS Root mean square energy

RTL-DSR Realtek RTL2832U Software-Defined Radio RTSP Real Time Streaming Protocol

SAH Spatial adaptive hashing SB Spectral bandwidth SC Spectral centroid SCF Spectral crest factor SD Secure Digital

SDK Software Development Kit Sec Second

SF Spectral flux

SFM Spectral flatness Measure SM Statistical Moments

SPSF Stereo panning spectrum features SR Spectral roll-off

SSD Solid State Drive

SSME Secure Storage Multi-Cloud Enviroment ST-PSD Short-time power spectral density SWIFT OpenStack Object Storage

TCP Transmission Control Protocol TCS Trusted Cloud Service

TN True Negative TP True Positive

TPH Target hashes per second TV Television

UML Unified Modeling Language URL Uniform Resource Locator USB Universal Serial Bus USD United States Dollar VoIP Voice over IP

VPN Virtual Private Network WAV Waveform Audio File Format WMA Windows Media Audio

XML EXtensible Markup Language ZCR Zero-crossing rate

(16)

List of Figures

1.1 2.4 · 109 _{Fingerprint distribution . . . .} ₇

1.2 A Sample Audio Wave to Linekeys Conversion . . . 16

1.3 Overview of General Song Recognition System Architecture . . . 17

1.4 Song Recognition Framework . . . 18

2.1 Sampled Audio Waveform from .WAV format . . . 25

2.2 Overview of Audio Features Level . . . 31

2.3 First 1000 milliseconds audio signal of song track "ARIZONA-Oceans Away" represented in MATLAB . . . 33

2.4 Signal in the time and frequency domain . . . 34

2.5 Real part of X(k) is even . . . 35

2.6 First 1000 milliseconds Fast Fourier Transform audio signal of song track "ARIZONA-Oceans Away" represented in MATLAB . . . . 35

2.7 Power spectral density (PSD) estimates equivalent to the peri-odogram using FFT . . . 36

2.8 Filter bank on a Mel-Scale . . . 38

3.1 2.4 Billion individual fingerprint distribution in database collection 48 3.2 K-means clustering implementation overview . . . 49

3.3 Fingerprint Windows Slides Recognition . . . 53

3.4 Windows Sizes Performance Comparison . . . 57

3.5 Speed comparison on different window sizes and non-cluster . . . 58

3.6 Hybrid Physical Disk Management . . . 60

4.1 Overview of IoT based Song Recognition Framework . . . 65

4.2 Raspberry Pi with RTL-DSR module Hardware . . . 67

4.3 Overview of Framework of IoT device protocol to Recognition Server 69 4.4 K-Means Clustered Linekey Database Structure . . . 72

4.5 Sliding Linekey Window Recognition . . . 73

4.6 Precision & Recall . . . 75

4.7 Elapsed Audio Recognition Time Methods Comparison . . . 76

(17)

4.9 Average Time Comparison on Different Methods . . . 78

5.1 Song Recognition with K-modes Clustering Framework . . . 83

5.2 Fingerprint Extraction Procedure . . . 84

5.3 K-modes clustering implementation overview . . . 90

5.4 Recognition accuracy for different size of the fingerprint: (a) 10 linekeys, (b) 20 linekeys, (c) 30 linekeys, (d) 40 linekeys, (e) 50 linekeys . . . 95

5.5 Recognition rate for different fingerprint size using the better com-bination of parameters α and Th . . . 98

6.1 Overview of PSD Audio Fingerprint . . . 103

6.2 Audio Signal Windowing to extract linekeys . . . 104

6.3 Windowing to apply Welch PSD estimation . . . 104

6.4 ST-PSD for several windows . . . 106

6.5 Frequency variant threshold . . . 106

6.6 Binary Linekeys Extraction . . . 107

6.7 Binary Linekeys Extraction . . . 111

6.8 Fingerprint Window Recognition . . . 112

6.9 Recognition accuracy for different size of the fingerprint: (a) 10 linekeys, (b) 20 linekeys, (c) 30 linekeys, (d) 40 linekeys, (e) 50 linekeys . . . 117

6.10 Recognition rate for different fingerprint size using the better com-bination of parameters α and Td . . . 119

6.11 Approximation of probability density function of distance ∆ for dif-ferent better combination of parameters [(a) W = 10, α = 0.2, Td= 40, (b) W = 20, α = 0, Td = 20, (c) W = 30, α = 0.1, Td = 40, (d) W = 40, α = 0.1, Td = 20] . . . 121

6.12 Accuracy varying the T∆threshold for different better combination of parameters [W = 10, α = 0.2, Td = 40, W = 20, α = 0, Td = 20, W = 30, α = 0.1, Td = 40, W = 40, α = 0.1, Td= 20] . . . 123

6.13 Accuracy comparison between PSD-Hamming and Landmark-Based124 6.14 Number of unique linekeys extracted varying the size of the corpus 125

(18)

LIST OF FIGURES

6.15 Accuracy comparison between proposed approach and

Landmark-Based . . . 127

6.16 Accuracy comparison between proposed approach and Landmark-Based introducing 05dB White Noise Query . . . 128

7.1 Client and server protocol interface . . . 132

7.2 Class MainActivity for RecordAudio . . . 133

7.3 Android Java classes function cycle . . . 134

7.4 User interface for Start and Stop recoding Audio . . . 135

7.5 Web server UML classes . . . 139

7.6 Initiation of server listener . . . 141

7.7 Listening for incoming HTTP requests . . . 141

7.8 Successful HTTP requests . . . 142

8.1 Overview of Web Client Secure Storage . . . 150

8.2 Main interface for Upload and Download Selection . . . 151

8.3 Schema of interaction of PHP Web application files . . . 152

8.4 Upload File . . . 154

8.5 Download File . . . 155

8.6 Performance by Upload and Download Time . . . 160

8.7 Download Time Overhead . . . 162

(19)

1.1 Songs fingerprints datasets . . . 6

2.1 Key elements of music . . . 24

2.2 Sample rate in different type of audio . . . 27

2.3 Types of common audio files . . . 29

2.4 Types of low-level features . . . 31

2.5 First 1000 milliseconds fingerprints (lines/per 100ms) of song track "ARIZONA-Oceans Away.wav" represented by low-level feature extraction techniques . . . 32

2.6 Music Information Retrieval Systems - Classification and Applica-tions . . . 40

2.7 Music Information Retrieval Systems - Feature Extraction and Similarity . . . 40

3.1 Songs fingerprints datasets . . . 47

3.2 Precision, Recall and Airtime Accuracy Results . . . 56

5.1 Binary Linekeys Categorical Attributes . . . 88

5.2 Parameters used for performance evaluation . . . 94

6.1 Parameters used for performance evaluation . . . 116

(20)

(21)

Introduction

This study presents our research findings on song recognition for a radio broad-cast environment. It reports extensive tests, explanations, and evaluations. The study outlines our progressive attempts to solve problems in this song recogni-tion field, which we then refine to create a realistic, successful framework that can satisfy the requirements imposed by different application scenarios.

1.1 Motivation

An audio representation includes a recording of a musical piece’s output. Dig-ital sound recordings are based on the analog audio signal being sampled. Sam-pling is achieved by capturing the signal amplitude at a specified samSam-pling rate and storing these samples in binary format. In terms of recording efficiency, the sampling rate and the bit rate (number of bits used to store each sample) are the two most important variables. Audio CDs use a 44.1 kHz sampling rate or 44,100 samples per second, and each sample uses 16 bits, which mainly an industry standard.

Common audio streaming sites host millions of audio files, and thousands of broadcast stations transmit audio content at any given time. The ever-increasing amount of audio material, whether online or on personal devices, generates tremen-dous interest in the ability to recognize audio material. This can be achieved by song recognition, an identification technology that seeks to work at the highest degree of accuracy and specificity.

Songs recognition identifies a song segment either from a digital or an analog audio source. Song rankings are based on radio / TV broadcasting or streaming,

(22)

1.2 Challenges

copyright protection for songs or automatic recognition of songs that a person wishes to identify while listening to them are different applications of such a system. Important information such as song title, artist name, and album title can be provided instantly. To create detailed lists of the particular content played at any given time, the industry uses audio fingerprinting systems to monitor radio and TV broadcast networks. Through automatic fingerprinting devices, royalties’ processing relies on the broadcasters who are required to produce accurate lists of content being played.

Given the high demand for an application, several approaches have already been studied based on song fingerprinting recognition, such as Bellettini et al. (2010), Deng et al. (2011), Ellis (2014), Malekesmaeili et al. (2014), and Shustef (2015). Nowadays, the state of the art recognition techniques are those developed by Shazam (Wang et al., 2003; Wang et al., 2014; Wang et al., 2018) and Sound-Hound (Mont-Reynaud et al., 2016), and the detection system by A. Master et al. (2010). These services are widely known for their mobile device applications.

Audio streams of many broadcast channels or recordings of different events are typically analyzed using fingerprint systems for media monitoring. As these systems work on massive quantities of data, the data models involved should be as small as possible, while the systems need to efficiently run on massive and growing reference databases. Besides, high robustness criteria are determined by the application for media monitoring. Although the sensitivity to noise may not be the primary concern for this use case, the systems need to identify audio content that may have been changed by different effects.

1.2 Challenges

The number of songs in the music industry has recently increased significantly, according to a report in Murthy et al. (2018). With massive databases, the agement and identification of songs using a conventional relational database man-agement system have become more difficult. For large datasets, a common linear

(23)

search technique that checks the existence of any fingerprint in an array one at a time has a noticeable decrease in efficiency (Saini et al., 2014). The stored infor-mation, therefore, needs a scalable database system to meet the execution time, memory use, and computing resources for recovery purposes, which suggested in Sreedhar et al. (2017)

Song recognition systems usually operate on vast amounts of data and are expected to meet several robustness requirements depending on the actual use case. Robustness to different kinds of lossy audio compression and a certain degree of noise would seem to be the minimum requirement. Systems designed to detect microphone recordings of short audio segments involve high background noise robustness, such as noise and distortion, or even multiple songs played simultaneously that can be heard in the surrounding.

It is crucial to have robust and quick recognition for effective song information retrieval. Major consumers need details about trending tracks, airtime schedules, and song versions, such as music labels, manufacturers, promoters, and radio stations. They, therefore, demand an application that is capable of generating information that is fast and precise.

In the field of real-time song recognition, the entertainment industry, particu-larly in music, the extensive collection of digital collections, and the commercial interest are opening new doors to research. In 2017, the global digital music industry expanded by 8.1 percent, with total revenues of US$ 17.3 billion, ac-cording to IFPI ’s Global Music Report 2018 (Domingo et al., 2018). For the first time in the same survey, 54 percent of the revenue alone comes from digital music revenue.

However, the most challenging application for bringing new songs to listener-ship is still the FM frequency radio station. The FM frequency channel for music broadcasting in European countries is still actively reliant on radio stations (Hal-lett et al., 2010). Radio stations and music companies have been working to advance music industry data analytics by creating ways of analyzing broadcast

(24)

1.2 Challenges

songs through new services and platforms.

It is an interest to broadcasters and advertisers to measure radio audience size and listening patterns over a broadcast radio station to achieve a source of revenue (Keith, 2012). However, based on demographics and psychographics (psychological criteria) of the target audience of the station, variations in the region of station promotional material can be predicted (Potter, 2002).

In addition to robustness criteria, the seriousness or effect of incorrect results must be considered, and the necessary performance recognition characteristics of fingerprinting systems must be taken into account. For instance, if a song recognition system is used, an unidentified match is missing, the user can waste storage space. However, on the flip side, a specific song recognized as false match systems that report false positives should be avoided.

Most critically, false positives are expensive for large-scale media monitoring; revenue might be attributed to the wrong artist. False negatives, another form of error, may lead to hours of unidentified material that will have to be checked with manual effort. Any form of error would increase the maintenance cost of a system.

To overcome it, we have proposed a more scalable big data framework using fingerprint clustering. In addition, a new recognition algorithm also was required for the new clustered collection. We also compared the performance from both the legacy system (non-clustered) and the new clustered database.

We define extensions of scale modifications that are likely to be encoun-tered when developing a framework for monitoring FM radio broadcasting sta-tions—investigating our dataset of the reported output of radio segments through the percentage of accuracy. This estimation will serve as the appropriate gold standard throughout this study, i.e., a device should be robust to at least this range of scale noise but may be needed to cope with even more severe distortions.

(25)

1.3 Legacy Song Recognition System

This work originates in a collaboration that starts in 2018, with a company specializing in parallel recognition streaming songs in real-time. In this analy-sis, the fingerprint database collection provided by the company was used. The collaboration aimed to gain insight into the industrial requirements and issues in this area. However, due to legal privacy and confidentiality arrangements, the company is anonymous.

For song recognition, the legacy solution was to pre-load all fingerprint data into the central memory. Two versions have been introduced, one with a 32-bit device that runs on 71.5 GB of central memory for a case. Meanwhile, with 12 GB of central memory, the 64-bit device runs on a single example. However, the downside to this approach is that the growing number of fingerprints could not be accommodated in the future.

We performed a fundamental analysis of data using histograms. The purpose of this analysis was to have a representation of the fingerprint value and range. Apart from that, this analysis justified using the database clustering method.

The first approach was to collect samples from datasets in ranges like Ta-ble 1.1. The series is shown in Figure 1.1 is based on 2.4 billion samples alto-gether. The fingerprint value range is described by X-axis, while Y-axis represents the frequency of fingerprints dropping into evenly spaced bins between 512 and 4.2 · 109 and 1024.

Sample No. Total fingerprints Selection Segments 1 2 · 104 Single random song

2 2 · 105 Sorted by song keys 3 2 · 106 _{Sorted by song keys}

4 2 · 106 Random

5 2 · 107 Random

6 2 · 109 _Complete

Table 1.1: Songs fingerprints datasets

(26)

1.3 Legacy Song Recognition System

Figure 1.1: 2.4 · 109 Fingerprint distribution

distribution of fingerprints. We inferred that it would have the same distribution of fingerprints on any given sample on this particular dataset.

This is based on the evidence of similar pattern characteristic:

• A mirror axis between the range 2.0 · 109 _{and 2.5 · 10}9 _{approximately half}

position of the overall dataset.

• Two maximum peaks at a close range 1.5 · 109 _{and 3.0 · 10}9_.

• The maximum and minimum boundaries value ranges between 0 and 3.0 · 109.

The fingerprint frequency diagram suggests that fingerprint distribution is not uniform, and a multimodal distribution could model it. Thus the clustering approach could be effective.

At present, we cannot reveal the technique used for generating these finger-prints from the company due to patents copyrights. The representation values of the fingerprints are as accordingly:

(27)

• Each song is represented by a sequence of fingerprints. • Each fingerprint is an integer value.

• Each fingerprint represents a chunk of real audio playtime; we denote its time length with δ.

• The database contains a hundred thousand songs that are associated with approximately 2.4 billion fingerprints.

In a single array, this vast number of fingerprints is stored and sorted by the song number (key) accordingly. Here, the critical difficulty during retrieval is due to the non-classification of fingerprint values boundaries that will result in an exhaustive search. We try to identify songs without any details on their boundaries from a continuous sampling of music.

Focusing on these findings, we performed both K-modes and K-Means compu-tation to obtain a clustered fingerprint collection associated with centroids value.

1.4 Research Problem

Fingerprint databases, recognition process instances, and FM transceivers are the legacy system in commercial use. The problems of an audio recognition system in normal use can be described from the following aspects:

a) Near Similarity: This occurs when virtually the same audio fingerprints are produced by two or more perceptually different audio recordings, leading to serious problems in the recognition process. Therefore, the key goals when devel-oping an audio fingerprinting algorithm are to keep the probability of collision as minimal as possible (Serra et al., 2010).

b) In-variance to noises and spectral or temporal distortions: Audio signal is usually degraded to some degree due to some kinds of sounds and vibrations when captured or playing in actual environments. The audio fingerprints of the

(28)

1.4 Research Problem

damaged audio signal can be the same as those of the original signal, given important features are still unchanged. High robustness must be obtained in cases of this fingerprinting technique (Bellettini et al., 2010).

c) Minimum length of song track needed for identification: Due mainly to time and storage limitations, making the entire of an unknown audio track in real-time music recognition is still impractical. Nevertheless, it is ideal that only a few seconds of the track is required to find the unknown audio (Wang et al., 2018)

d) Retrieval speed and computing load: It is desirable that the recognition results can be provided in a few seconds in most real-time applications. However, with the increase in song recordings in the audio reference database, locating the matching object correctly in real-time becomes very difficult (Ellis, 2014).

A fingerprint is a type of distinct digital representation of the waveform of a song. This can be obtained by collecting significant characteristics from the various audio properties. Without sacrificing its signature, the created fingerprint may also be segmented into several parts. Moreover, fingerprints can be processed at a much smaller scale relative to the audio waveform’s initial form (Bertin-Mahieux et al., 2011).

The items listed are several criteria that should take into consideration for a robust audio fingerprint:

• Consistent Feature Extraction: The key feature of fingerprint generation is that it can replicate an audio fingerprint identical to that of a music section. • Fingerprint size: Fingerprint file size has to be small enough so that more music collections can be stored in the database. In reality, a lightweight fingerprint offers effective memory allocation during processing.

• Robustness: Even if external signal noise has affected the source audio, fingerprints may be used for identification.

(29)

1.5 Contributions

We provided our outstanding contributions to the academic literature that satisfy all the success criteria listed prior. We show that, despite our large comparison sets, our method is efficient and that there is an extensive search problem caused by the invariances of the hashes and their robustness in signal modifications. Mainly, we designed the proposed device for low-cost hardware, demonstrated its capabilities, and avoided costly CPU processing.

Despite studies on the identification of songs and fingerprints by other re-searchers such as Shazam (Wang et al., 2018) and SoundHound (Mont-Reynaud et al., 2016). We understand the clustering design using K-means for an experi-ment in the real fingerprint database with a set of 2.4 billion fingerprints provided by the company as datasets. We would like to emphasize that a database of this scale seldom appears in the research literature, but there is a chance that it ex-ists. The next critical aspect was the audio recognition system. Although initial K-means computing for compilation is resource-intensive, we achieved significant speed efficiency at the end of the day. The proposed architecture and algorithm will lead to a new insight into song recognition.

Moreover, we had introduced an IoT-based solution to song recognition in a cloud environment in this study. We have developed a recognition system with the capabilities to integrate audio streams from remote FM Radio stations. We conducted a song recognition technique based on the K-means clustered cloud database of MongoDB. We supported various collections of fingerprint length tests to ensure the best accuracy and reliability of the test.

The new fingerprint extraction technique’s significant findings focused on Short Time Power Spectral Density (ST-PSD) was also implemented. Later of which binary encoding group’s attributes lead to the reliability of K-modes. Be-sides, this study clarified the identification methodology primarily through ham-ming distance measure in the predetermined cluster table. The findings were

(30)

1.5 Contributions

given by sampling 400 random 5-second queries from the initial song set in the experiment. Using this method, the optimal chosen combination of parameters is the identification ratio of 90%.

We have already introduced extracting fingerprints from audio in our previous works based on the ST-PSD calculation. We introduce several significant and remarkable improvements to the previously proposed algorithm to enhance the robustness of the linekeys to temporal shift and the consistency of the fingerprints for perceptually distinct audio signals in this thesis. The proposed fingerprints are based on calculating the audio signal’s short time spectral power density ST-PSD obtained on the Mel frequency scale.

According to our tests and performance measurements, the framework can be used specifically for different areas of fingerprinting applications. In addition to typical fingerprint applications, these are, for instance, the identification of audio copies and media tracking, copyright identification of songs.

Sections of the work discussed here have been published in four documents:

• Chapter 3:

Sahbudin, M. A. B., Scarpa, M., & Serrano, S. (2019a). Mongodb clus-tering using k-means for real-time song recognition, In 2019 international conference on computing, networking and communications (icnc). IEEE. https://doi.org/10.1109/ICCNC.2019.8685489

• Chapter 4:

Sahbudin, M. A. B., Chaouch, C., Scarpa, M., & Serrano, S. (2019b). Iot based song recognition for fm radio station broadcasting, In 2019 7th inter-national conference on information and communication technology (icoict). IEEE. https://doi.org/10.1109/ICoICT.2019.8835190

• Chapter 5:

Chaouch, C., Sahbudin, M. A. B., Scarpa, M., & Serrano, S. (2020). Au-dio fingerprint database structure using k-modes clustering. Journal of

(31)

Ad-vanced Research in Dynamical and Control Systems, 12 (04-Special Issue), 1545–1554. https://doi.org/10.5373/JARDCS/V12SP4/20201634

• Chapter 6:

Sahbudin, M. A. B., Chaouch, C., Scarpa, M., & Serrano, S. (2020a). Audio fingerprint based on power spectral density and hamming distance measure. Journal of Advanced Research in Dynamical and Control Systems, 12(04-Special Issue), 1533–1544. https : / / doi . org / 10 . 5373 / JARDCS / V12SP4/20201633

Serrano, S., Sahbudin, M. A. B., Chaouch, C., & Scarpa, M. (2020). A new effective method of fingerprint generation for song recognition, In Review for journal publication

Other sections of this thesis, as Chapter 7, are unpublished and are presented here for the first time.

• Chapter 7:

Sahbudin, M. A. B., Chaouch, C., Scarpa, M., & Serrano, S. (2020b). Mobile application-programming interface (API) for Song Recognition Sys-tems. Review for Journal Publication

In addition, for the scope of Secure Storage in Multi-Cloud environment will be the presented in chapter 9 and have been published in the following publication:

• Chapter 8:

Sahbudin, M. A. B., Di Pietro, R., & Scarpa, M. (2019c). A Web Client Secure Storage Approach in Multi-Cloud Environment, In 2019 4th in-ternational conference on computing, communications and security (icccs). IEEE. https://doi.org/10.1109/CCCS.2019.8888062

(32)

1.6 Music Information Retrieval

According to a particular similarity measure, music retrieval systems are de-signed to help users identify music in extensive collections. Casey et al. (2008), Grosche et al. (2012) suggests a way of classifying retrieval scenarios by precision and granularity to recognize a specific audio signal at the exact time position of the original piece. Some of the most common music retrieval applications are provided below, with references to the respective science and commercial usage. The purpose here is to retrieve or recognize the same music recording fragment with some robustness criteria.

Audio alignment or matching is one of the music retrieval scenarios where the objective is to associate time positions specifically from two music signals, in addition to locating a given audio fragment. Furthermore, depending on the audio functions’ robustness, various outputs of the same item may also be aligned. A cover version is a different arrangement of a previously released song. Since the cover may vary from the original song in the timbre, tempo, form, chord, arrangement, or language of the vocals, the automated recognition of the cover songs in a given music compilation is very complicated. As reviewed by Serra et al. (2010), systems for version recognition are often based on the interpretation of the melody or harmony of music signals and the synchronization of these descriptors by local or global alignment methods.

The mentioned applications are based on a comparison between a target music signal and a database that is often referred to as query sampling, but users may want to identify music that meets those requirements. For example, certain songs that play in specific key music or a faster beat (BPMs). Even in simple use cases, describe music to a particular genre such as "classical," "rock," and others. Semantic-based or category-based retrieval frameworks, such as those suggested by Turnbull et al. (2009), are based on methods used to estimate music semantic labels.

(33)

In recent years, we have seen a steady rise in adoption in developing cities, in-fusing Internet of Things (IoT) technologies to boost urban communities’ quality of life. We might assume, from a human biological point of view, cities’ in-frastructures and ecosystems are living organisms with their organs transmitting messages through the nervous system to the heart of their brains. In practical terms, these IoT systems consist of low-cost sensors, actuators, and other smart devices with interconnected network streaming data, either via private or public cloud networks (Campobello et al., 2017).

Radio stations and music publishers have aimed to advance music markets’ data analysis, particularly by finding methods to track broadcast songs across digital networks and channels. It was essential to broadcasters and marketers to measure radio audience size and listening patterns over a commercial radio station to achieve a source of revenue (Keith, 2012). However, based on demographics and psychographics (psychological criteria) of the target audience of the station, variations in the area of station advertising material can be predicted (Potter, 2002).

1.7 Audio Fingerprinting

Song recognition requires a fingerprint to be extracted for each audio song file that has been stored in the database. An unknown audio clip is then marked when its fingerprint is compared to the reference database. The study by Bellettini et al. (2010) involves broadcast analysis and audio distortion with the robust fingerprinting algorithm and a suitable retrieval method. As in Deng et al. (2011), there are also several audio fingerprint extraction techniques based on spectral energy structure and the factorization of non-negative materials. Although the authors in Malekesmaeili et al. (2014) propose extracting fingerprints from the time-chroma picture characteristics. Besides, a method by Shustef (2015) offers a plug-in module for the music recognition associated with a music management repository.

(34)

1.7 Audio Fingerprinting

Shazam’s cutting-edge process for song recognition is based on the innova-tive fingerprint approach that first appears in the article in Wang et al. (2003). Furthermore, LabROSA (Ellis, 2014) launched the Landmark-based as an open-source version Audfprint in Ellis (2009) as a scientific instrument to use as a gold standard comparison for this experiment.

The necessary steps in technique by Wang et al. (2003) is that each song track is analyzed to find significant frequency-concentrated onsets, as onsets are assumed to still appear in a certain amount of noise and distortion depending on the stage. After the significant peaks have been identified, the group of peaks will be designated as landmarks. This is achieved by identifying a region right after each height. The maximum peak is then combined with the space peaks forming landmarks.

Each single song track is referenced by many of the previous phases’ landmarks and the relative time. This information is stored in the fingerprint database in an inverted index. To identify a sample track (short clip), it is also converted to landmark fingerprints. The database is then searched to find all reference paths with queries that share landmarks.

Figure 1.2 shows audio fingerprints extracted from the first 9.00 seconds of a song and how the file size reduces by 98% from the original audio wave. A sequence of 32-bit represents this song fingerprint unsigned integer called linkeys and generated by a propriety extraction algorithm owns by the company legacy systems.

(35)

0 1 2 3 4 5 6 7 8 9 0.5 0 -0.5 0 1 2 3 4 5 6 7 8 9 1 2 3 4 Linekeys Range Time (Seconds) Time (Seconds) Amplitude

Figure 1.2: A Sample Audio Wave to Linekeys Conversion

1.8 Song Recognition System Architecture

Figure 1.3 demonstrates the general architecture for an audio fingerprint search method consisting of two key stages: first, an audio fingerprint analy-sis to produce an audio fingerprint for undefined audio. For reference database creation, the fingerprints retrieved are processed in a manner that facilitates efficient retrieval. Since its construction, the reference database holds all the in-formation that might potentially be found by the application. The second stage uses a retrieval technique to retrieve an audio fingerprint query using a broad database identified with the original item’s audio fingerprints. By returning the information from the closest audio fingerprints to the database, we can conclude the unfamiliar audio query’s meta-information.

The precision of the system should be focused on considering the goals of the audio fingerprint search system. Furthermore, in the actual situation, the music may be changed in various ways, such as changing pitch and speed, adding background, or mixing with other tracks. In that case, we can use an audio

(36)

1.8 Song Recognition System Architecture

Fingerprint Extraction

Fingerprint Extraction Song Information_{& Timeline} Fingerprint Clustering

Fingerprint Extraction Method

Search & Retrieval

Recognition Algorithm Unknown continuous

audio stream in real-time source

Original song collections

Labeled Fingerprints

Database

Figure 1.3: Overview of General Song Recognition System Architecture fingerprint extraction algorithm with high precision and pair it with an excellent search algorithm to ensure that the song fits correctly. Second, with the difficulty of the extraction and search process, we need to improve efficiency by reducing extraction and scanning time with a vast number of audio fingerprints on the database and multiple queries simultaneously.

Throughout this thesis, we will partly adapt the general architecture to the particular approach we suggest and concentrate on each component’s information. Ultimately, we come up with definitions of all elements and subcomponents. The outcome is a full fingerprinting audio device capable of dealing with distortion and extreme changes from a streaming audio source. Figure 1.4 illustrates the final architecture of our particular system. For a brief example of the following parts, we will use these figures with highlighted components where appropriate.

(37)

Clustering Training & Modeling Song Collections Songs Streaming K-Means /K-Modes

Centroid Values FingerprintClustered Collection Indexing & Limits Fingerprint & Recording ID Database Fingerprint Recognition Real-Time Slide Window Binary Hamming Distance Measure Verification Matched Result Unidentified Fingerprint Extraction Framing ST-PSD |FFT|2 Binary Linekeys Linkeys Windowing Raspi FM-IoT RTL-DSR FM Volume &

Channel Raspi Radio Audio Encoder

Figure 1.4: Song Recognition Framework

1.9 Performance Measures of Song Recognition

Accordingly, the notation that is used in defining our performance measures for the accuracy of recognition:

TP - true positives is the number of query retrieval in which is positive to recognized and the reference exist.

TN - true negatives is the number of query retrieval in which is negative to recognized and the reference does not exist.

FP - false positives is the number of query retrieval in which is positive to recognized but does not match the reference.

FN - false negatives is the number of query retrieval in which is negative to recognized however the reference exist.

(38)

1.10 Organization of Thesis

Accordingly, we define performance measures as in Equation (1):

Accuracy =

PTP + P TN

P(TP+FP+FN+TN) (1)

Another discrimination variable was added to improve the proposed recog-nition system’s efficiency, focused on the distance between the two maximum F-scores obtained in the search process for each fingerprint. Specifically, when the fingerprint was correctly identified and the fingerprint was misclassified, the distribution of this interval (called ∆) was analyzed.

1.10 Organization of Thesis

This thesis is organized accordingly. In this chapter, we have presented the definitions and definitions for the contents page. Next, we review the literature on music and song fingerprinting in Chapter 2.

In the first part of this research, which described the details in Chapter 3, we proposed a K-means clustering of fingerprint collections in MongoDB. We also have a recognition algorithm for cluster fingerprint songs and then test our recog-nition approach’s efficiency and accuracy using real-time input stream audio. At this stage, we do not have any audio fingerprint generation; instead, we concen-trate on studying the data distribution and classification of the song fingerprint database to improve the method of recognition. This chapter is based on the following publication:

Chapter 3: “MongoDB Clustering using K-means for Real-Time Song Recog-nition”, Sahbudin et al., 2019a

The next step, covered in Chapter 4, focuses on combining two main areas: using IoT devices as an FM receiver and using our own original early-stage song recognition algorithm, based on "K-means" clustering, in a cloud service. In

(39)

particular, we concentrate on broadcast control as continuous tracking of the au-dio source from raau-dio stations. The proposed architecture is being tested and compared with LabROSA Landmark-based - Audfprint (Ellis, 2009) and the company’s current legacy framework implementation as part of our industrial partners. This chapter is based on the following publication:

Chapter 4: “IoT based Song Recognition for FM Radio Station Broadcast-ing”, Sahbudin et al. (2019b)

In addition, Chapter 5 is an expansion of earlier research to include a novel ap-proach to database clustering using a non-numeric data structure and a "K-mode" search algorithm. A new extraction technique that produced binary encoded fin-gerprints will also be introduced. Subsequently, provide an evaluation of the recognition approach’s efficiency and accuracy using the proposed new format. This chapter is based on the following publication:

Chapter 5: “Audio Fingerprint Database Structure using K-modes Cluster-ing”, Chaouch, Sahbudin et al. 2020

Moving forward to the next stage in Chapter 6 discusses our new technique for producing fingerprints based on the estimation of the short-term spectral density (ST-PSD) of the audio signal. The sampled signal is buffered and windowed to obtain the ST-PSD, and then a 128-point Fast Fourier Transform is performed for each window. The basic concept is a set of the ST-PSD compact representation several times adjacent to it. This compact representation of the ST-PSD will be referred to as the "linekeys" in this paper. This chapter is based on the following publication:

Chapter 6: “Audio Fingerprint based on Power Spectral Density and Ham-ming Distance Measure”, Sahbudin et al. (2020a)

(40)

1.10 Organization of Thesis

In addition to Chapter 6, we had expanded the full spectrum information over short parts of the song to produce fingerprints to identify the song and each time interval within it. We use the same approach used in short-term spectral density (ST-PSD) with a different computing method, the string of bits representing the fingerprints. The essential contribution here is the Mel frequency scale’s use to enhance the fingerprints’ selectivity; with this change, we find very high precision in identifying the songs, which is superior to both the landmark approach and our previous implementation. Such produced fingerprints can also be organized in clusters for smart research during the online recognition process. This chapter is based on the unpublished results and under reviewing and writing. This chapter is based on the following publication:

“A New Effective Method of Fingerprint Generation for Song Recognition”, Ser-rano, Sahbudin et al. 2020

In our research, as in Chapter 7, we have built interest in song recognition within our department undergraduate students to take as part of their final project. Here the extension development involves an external API build that interacts with the proposed song recognition service instance. The API is demon-strated in the development of android mobile applications and Java applications. This chapter is an unpublish article and first appear here, which is still under-writing:

Chapter 7: “Mobile application-programming interface (API) for Song Recog-nition Systems”, Sahbudin et al. (2020b)

This extended feature was part of the collective final year project supervision of undergraduate students, which should be notably mentioned as part contribu-tion to the whole song recognicontribu-tion project:

Grasso, A., Sahbudin, M. A. B., & Scarpa, M. (2019). Development of Musical Recognition in The Android Enviroment (Bachelor’s Thesis). Uni-versity of Messina

(41)

Di Luca Cardillo, G., Sahbudin, M. A. B., Scarpa, M., & Serrano, S. (2020). Audio Fingerprint Extraction using Power Spectral Density Method-ology on Raspberry Pi RTL-SDR FM Module (Bachelor’s Thesis). Univer-sity of Messina

Siracusa, F., Sahbudin, M. A. B., & Scarpa, M. (2019). Method-based Audio Fingerprint Generation Differential (Bachelor’s Thesis). University of Messina

Di Liberto, A., Sahbudin, M. A. B., & Scarpa, M. (2020). Distributed architecture for music recognition API (Bachelor’s Thesis). University of Messina

Next, Chapter 8, we present a web client application supporting a secure storage approach to guarantee confidentiality and integrity issues concerning data stored in a multi-Cloud environment. Hence, this research present and discuss a web application framework and its implementation. In addition to evaluating this proposal by conducting several experiments for the web application in a real multi-Cloud environment scenario. Hence this storage can be used for the song fingerprint database. The chapter is based on the following publication:

Chapter 8: “A Web Client Secure Storage Approach in Multi-Cloud Envi-ronment”, Sahbudin et al. (2019c)

Finally, Chapter 9 concludes the overall research with several discussions and potential future suggestions.

(42)

Chapter 2

Literature Overview

In this chapter, we begin by introducing music and its critical representations in digital audio representation. We then introduce the audio signal processing literature, which encapsulates the methodologies used in this thesis. Finally, we conclude the section by discussing some standard song recognition systems and researchers’ results in these fields.

2.1 Music Background

Music is a science or art of composing sounds or tones in sequence, in com-bination, and in temporal relationships to create a synthesis with harmony and consistency (“Music”, 2020). The music source commonly for songs are vocal, instrumental, or mechanical sounds having rhythm, melody, or harmony. Some fundamental music and sound elements are useful to understand before proceed-ing, but specific music theory is outside this study’s scope. In general the elements are described as Table 2.1 (Estrella, 2020).

(43)

Element Definition Characteristics Beat Provides music its rhythmic pattern,

which is a basic unit of time. A beat can be regular or irregular pat-terns. Meter recurring patterns accents such as bars

and beats are produced by grouping together strong and weak beats.

A meter may be two or more beats in a measure.

Dynamics The volume of a performance such as a piece is the variation in loudness be-tween notes or phrases.

The punctuation marks, complex ab-breviations, and symbols signify mo-ments of emphasis.

Harmony The sound produced as two or more

notes are played simultaneously Harmony supports the melody andgives it texture and may be analyzed by hearing.

Melody The musical continuity tune created by playing a succession or series of notes

A composition may have a single or multiple melodies.

Pitch Pitch can be measured as the fre-quency of vibration and the size of the objects vibrating.

The slower the vibration and the larger the object that vibrates, the lower the pitch and vice versa would be. Rhythm The pattern or placement in the music

of sounds in time and beats. Rhythm has elements such as rhythmand tempo and is formed by a meter. Tempo The speed of underlying beat of a piece

of music is played The tempo is indicated and measuredin BPM or beats per minute in a track. Texture The number and types of layers that

are combined in composition A texture could be a single line or morelines, or the main melody accompanied by chords that determines the overall quality of music.

Timbre The sound quality distinguishes differ-ent types of sound production either from a voice or instruments

Timbre can with its own unique tone color range from dull to lush and from dark to bright.

Table 2.1: Key elements of music

The significance of understanding the musical elements depends on the sys-tem of audio recognition objectives. Each musical element may be selected as a single feature or more than one for representing the audio data. In cases where recognition is required related to time position, tempo, and beat could be critical. In contrast, recognition of similarity can be identified by elements such as rhythm and melody. Thus, having the right initial analysis is important for the outcome.

(44)

2.2 Audio Representation

In reality, all that is capable of human hearing is an audio signal. Music is one of the subsets of audio in addition to voice, broadcasting, and telecommunications media. Starting with the demonstration of the first phonograph by Thomas Edison in 1877 (Edison, 1878), it was possible to capture these pressure waves in a physical medium and then replicate them later by regenerating the same pressure waves. The waveform structure is directly represented by analog audio media, such as phonograph recordings and cassette tapes, using the groove depth for a record or magnetization volume for a tape. Analog recording can replicate an incredible sound collection, but it still suffers from noise issues. Notably, more noise is added each time an analog recording is copied, decreasing the consistency.

2.2.1 Digital Audio

Whereas digital recording functions differently at evenly-spaced time-points, it measures the waveform, reflecting each sample as an exact number. Digital files do not decay with time, whether stored on a compact disk (CD), MiniDisc (Ikeda et al., 1993), or hard disk storage (e.g., computer, mp3 players, iPod, SD card). Furthermore, it can be copied infinitely without creating any extra noise. A sampled audio waveform is depicted as in Figure 2.1.

(45)

Digital audio can be modified and remastered to reduce noise. Besides, many digital effects can be added, for example, change of speed, pitch, tempo, loudness, reverb, and other modifications.

Two factors significantly affect the quality of the digital audio recording, which is the sample rate and the sample size (Mazzoni et al., 2014). Increasing the sample rate or the number of bits in each sample improves the recording quality and increases the amount of space used by audio files on the device or disk.

• Sample Rate is measured in hertz (Hz) or cycles per second. This is the number of samples taken per second to present the waveform. Higher sample rates allow for the representation of higher audio frequencies. Sample rates in 1000 Hz (kHz) units may also be used in general. Table 2.2 provide the different type of audio usage with variants of the sample rate.

• Sample format determined by the number of bits used by each sample to be represented. The more number bits that are used, the more exact each sample’s representation is. Increasing the number of bits thus increases the full dynamic range of the audio file, i.e., the volume difference reflected between the loudest and softest audible signals.

(46)

2.2 Audio Representation

Sample rate kHz Usage

8,000 Hz 8 Adequate for human speech. Used in low transmission communica-tion such as telephone or walkie-talkie.

11,025 Hz 11.02 Used for lower-quality PCM, MPEG audio and for audio analysis 16,000 Hz 16 Used in most Voice over IP (VoIP), extension of telephone with

better quality.

22,050 Hz 22.05 Used for lower-quality PCM and MPEG audio and for audio anal-ysis of low frequency energy.

44,100 Hz 44.1 Used commonly for Audio CD mass distribution rate with MPEG-1 audio.

48,000 Hz 48 The basic sample rate used by professional digital video equipment could reconstruct frequencies of up to 22 kHz.

88,200 Hz 88.2 Used by some professional recording equipment for mixing, equaliz-ers, compressors, reverb, crossovers and recording devices. Double the quality of consumer audio.

96,000 Hz 96 DVD-Audio, Blu-ray audio tracks, HD DVD audio tracks.

192,000 Hz 192 Used with audio on professional video equipment. DVD-Audio, LPCM DVD tracks, Blu-ray audio tracks, HD DVD audio tracks.

Table 2.2: Sample rate in different type of audio

2.2.2 Music Audio File Formats

An audio file is a file format used on a computer device to store digital audio files. The bit representation of the audio data is called the audio encoding format, which can be uncompressed or compressed to minimize the file’s size, such as lossy compression. Data may be a raw bitstream in an audio encoding format, but it is typically contained in a file system or an audio data format with a specified storage layer.

Audio file compression is a method used to minimize the size of a stereo or multi-channel audio file. A smaller, compressed file decreases the storage space used to store more music or video on a music player or hard drive and enables data to be transmitted quicker over the Internet or between storage media. There are two basic audio compression techniques: lossless and lossy, and there are several formats for both of these techniques. Table 2.3 shows the type of audio file format, usage, platform, extension, compression type, and proprietary information.

(47)

Lossless compression ensures that no audio data is removed during encoding. Lossless compression provides CD-quality sound for the audio enthusiast who focuses on the highest possible quality sound.

Lossy compression implies that the audio data is permanently erased from the audio format—lossy compression results in smaller files. However, there is no way of converting audio data to its original format. The objective of lossy audio compression is to minimize the data needed while preserving the same audio quality for content delivery.

(48)

2.2 Audio Representation Table 2.3: T yp es of common audio files File T yp e Definition Platform File Ex-tensions Compression Proprietary PCM Pulse-Co de Mo dulation is one of the digital represen tations of an analog sig na l. The metho d of represen tation is called "analog si gn al co ding to digital form." cross .p cm no no RA W RA W Au dio is simply a ra w audio file format con taining an y he ad er data suc h as sample rate, bit depth, endian, or the num ber of channels for storing uncompressed audio files. cross .ra w no no W AV W av eform Audio File that stores data on w av eforms. It is a standard au dio file format dev elop ed by IBM and Microsoft for storing bitstream audio on PCs. cross .w av Optional (lossy) no AIFF Audio In terc hange File Format is a file format standard used for storing sound data for personal computers and other ele ctronic audio devices. Apple Inc. in 1988 dev elop ed the format Mac .aif, .aiff, no no A U The Au file format is a file format dev elop ed by Sun Microsystems. Initially w as headerless by us ing 8-bit µ -la w-enco ded data. Unix/Lin ux .au, .snd optional µ -la w (lossy) no MP3 MPEG audio la yer 3 file adv an tages are the compression that sa ves space while preserving the original sound source is almost perfect. This compression mak es MP3 very popular for all mobile audi o pla yers. cross .mp3 MPEG (lossy) license re-quired for distribution AA C A dv anced Audio Co ding is a file for m at that pro vides resp ectably high-qualit y sound that is impro ved by adv anced co ding. It’s nev er been one of the most common audio formats, particularly when it comes to m usic files. cross .m4a, .m4b, .m4p, .m4v, .m4r, .3gp, .mp4, .aac AA C (lossy) license re-quired for distribution WMA Windo ws Media Audio is a Windo ws-based file format to a more common and popular MP3 format. What mak es it so effectiv e is the lossless enco ding, ensuring high audio consistency in all forms of changes. Windo ws .wma WMA (lossy) yes FLA C Free Lossless Audio Co dec is an audio format com pr essed in to a smaller file size. It is a complex file typ e that is less commonly us ed in audio formats. cross .flac FLA C (loss -less) no, op en-source

(49)

2.3 Audio Feature Extraction

The extraction of features is a critical part of analyzing and finding asso-ciations between different audio signals (Friberg et al., 2011). The audio data generated can not be explicitly understood by the models to translate them into a comprehensible format feature extraction. It is a process that describes much of the details but in a comprehensible manner. Feature extraction solves the problem of representing instances to be categorized in terms of feature vectors or pairs similarities.

The direct processing of these high-quality audio tracks for information re-trieval uses high memory and processing time, according to a review done by Jitendra et al. (2020). The numeric features are mostly extracted that emu-late signal characteristics and concisely represent the original audio songs. An extensive range of features has been presented with support for different signal processing techniques and statistical methods to simplify speech processing tasks. The research done by Fu et al. (2010) indicates audio features into three levels, low-level and mid-level features, and top-level label as illustrated in Figure 2.2 and description in Table 2.4.

The low-level feature also sub-categorized as a short-term and long-term fea-ture. Short-term features, such as spectral and timbre, typically capture the audio signal’s characteristics in frames with a length of 10–100ms. Meanwhile, long-term features, such as temporal, capture the long-term effect and signal in-teraction and are generally derived from local windows with longer duration. The research such in Allamanche et al. (2001), Li et al. (2003), Lu et al. (2005), Bene-tos et al. (2006) and Cheng et al. (2008) uses spectral analysis in extracting the low-level features.

Mid-level features are mainly based on three types of characteristics, which are rhythm, pitch, and harmony. Whereas, top-level provides semantic labels about how humans perceive and interpret music, such as genre, mood, and style.

(50)

2.3 Audio Feature Extraction Instrument Guitar, Piano, Drums... Mood Angry, Happy, Sad... Genre Indie, Classic, Jazz.. Other Characteristic Mixing, DJ, covers..

Rhythm Pitch Harmony Other Statistical

Value Spectral SC, SR, SF.. Temporal SM, AM, ARM Timbre ASE, QT, CF.. Top-Level Label Medium-Level feature Low-Level feature Short-Term Long-Term

Figure 2.2: Overview of Audio Features Level

Short -term

Spectral Timbre

Spectral centroid (SC) Amplitude spectrum envelope (ASE) Spectral roll-off (SR) Constant Q-Transform(QT)

Spectral flux (SF) Crest factor (CF)

Spectral bandwidth (SB) Discrete wavelet transform (DWT) Stereo panning spectrum

features Daubechies wavelet coefficient histogram (DWCH) (SPSF) Fourier cepstrum coefficient (FCC)

Spectral flatness Measure

(SFM) Linear predictive cepstral coefficients (LPCC) Spectral crest factor (SCF) Mel-frequency cepstral coefficient (MFCC) Zero-crossing rate (ZCR) Morlet wavelet transforms (MWT)

Short-time power spectral

density Octave scale cepstral coefficient (OSCC) (ST-PSD) Root mean square energy (RMS) Long -term Statistical Moments (SM) Amplitude Modulation (AM) Auto-Regressive Modeling (ARM)

Table 2.4: Types of low-level features

In this thesis, we have used the short-term feature extraction of Short-time power spectral density (ST-PSD) in Chapter 6 and also extended the