A map-matching algorithm dealing with sparse cellular fingerprint observations

(1)

Full Terms & Conditions of access and use can be found at

https://www.tandfonline.com/action/journalInformation?journalCode=tgsi20

ISSN: 1009-5020 (Print) 1993-5153 (Online) Journal homepage: https://www.tandfonline.com/loi/tgsi20

A map-matching algorithm dealing with sparse

cellular fingerprint observations

Andrea Dalla Torre, Paolo Gallo, Donatella Gubiani, Chris Marshall, Angelo

Montanari, Federico Pittino & Andrea Viel

To cite this article: Andrea Dalla Torre, Paolo Gallo, Donatella Gubiani, Chris Marshall, Angelo Montanari, Federico Pittino & Andrea Viel (2019) A map-matching algorithm dealing with sparse cellular fingerprint observations, Geo-spatial Information Science, 22:2, 89-106, DOI: 10.1080/10095020.2019.1616933

To link to this article: https://doi.org/10.1080/10095020.2019.1616933

Published online: 05 Jun 2019.

Submit your article to this journal

Article views: 161

(2)

A map-matching algorithm dealing with sparse cellular

ﬁngerprint

observations

Andrea Dalla Torrea_{, Paolo Gallo}b_{, Donatella Gubiani} c_{, Chris Marshall}d_{, Angelo Montanari}b_,

Federico Pittino a_{and Andrea Viel}b

a_{Cellular Product Center, u-blox Italia SpA, Trieste, Italy;}b_{Department of Mathematics, Computer Science and Physics, University of} Udine, Udine, Italy;c_{Center for Information Technologies and Applied Mathematics, University of Nova Gorica, Nova Gorica, Slovenia;} d_{Cellular Technology, u-blox UK Ltd, Reigate, UK}

ABSTRACT

The widespread availability of mobile communication makes mobile devices a resource for the collection of data about mobile infrastructures and user mobility. In these contexts, the problem of reconstructing the most likely trajectory of a device on the road network on the basis of the sequence of observed locations (map-matching problem) turns out to be particularly relevant. Different contributions have demonstrated that the reconstruction of the trajectory of a device with good accuracy is technically feasible even when only a sparse set of GNSS positions is available. In this paper, we face the problem of coping with sparse sequences of cellularfingerprints. Compared to GNSS positions, cellular fingerprints provide coarser spatial information, but they work even when a device is missing GNSS positions or is operating in an energy saving mode. We devise a new map-matching algorithm, that exploits the well-known Hidden Markov Model and Random Forests to successfully deal with noisy and sparse cellular observations. The performance of the proposed solution has been tested over a medium-sized Italian city urban environment by varying both the sampling of the observations and the density of the fingerprint map as well as by including some GPS positions into the sequence offingerprint observations.

ARTICLE HISTORY Received 16 November 2018 Accepted 19 March 2019 KEYWORDS Map-matching algorithm; trajectory; cellular ﬁngerprint; Hidden Markov Model

1. Introduction

Nowadays, the widespread availability of mobile com-munication makes the mobile phone a resource that may collect information about urban mobility and cellular networks at a very low cost. It can be exploited, for instance, to determine the approximate position of a phone user on the basis of information about the signal of received cells only, if there are previous georeferenced data to match against. Moreover, the collection of sequences of such obser-vations relative to a given mobile phone allows one to guess the path followed by the user carrying that phone. As a matter of fact, reconstructing the path of a moving device is an important task in many

application ﬁelds, ranging from location-based

ser-vices to asset tracking andﬂeet management.

As people do not usually move randomly, but follow the underlying transportation infrastructures, there is the chance to use information about road networks to improve both position detection and trajectory reconstruction. The map-matching pro-blem is mainly about reconstructing the most likely trajectory of a device across a road network given a sequence of observed locations. A variety of

approaches to such a problem can be found in the literature. For a survey on them see, for instance,

(Quddus, Ochieng, and Noland 2007). They diﬀer

in various respects and make use of different technol-ogies. In particular, they exhibit a different trade-off among cost, energy consumption, and accuracy.

The problem has been extensively studied in the case of observations that include location information from a Global Navigation Satellite System (GNSS),

like the Global Positioning System (GPS)

(Thiagarajan et al. 2011; Zhuang, Kim, and Singh

2010; Paek et al. 2011; Li et al. 2014). However,

whenever GPS data are not available, locations must be inferred on the basis of accessible information. In this work, we focus on solutions based on cellular

networks, where signal ﬁngerprinting provides

a complementary/alternative source of information with respect to GPS. Compared to those based on

GPS, solutions based on signal ﬁngerprinting also

guarantee a lower energy consumption. For such a reason, despite the fact that the accuracy in posi-tioning that they guarantee is, in general, lower, in energy constrained environments they are often the preferred choice. Moreover, since in most of the considered systems a communication capability is CONTACTDonatella Gubiani donatella.gubiani@ung.si

This paper is an extended version of the paper“Map matching with sparse cellular ﬁngerprint observations” from 2018 Ubiquitous Positioning, Indoor Navigation and Location-Based Services (UPINLBS), DOI: 10.1109/UPINLBS.2018.8559918

2019, VOL. 22, NO. 2, 89_–106

https://doi.org/10.1080/10095020.2019.1616933

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

(3)

an additional requirement, using a single device (the cellular module) for both communication and posi-tioning turns out to be a cost-eﬀective solution.

Fingerprint positioning systems are based on the observation of the received signals, whose strength, which depends on the current location of the receiver

device, may be diﬀerent (Chen et al. 2006;

Benikovsky, Brida, and Machaj 2010). In general

terms, a ﬁngerprint can be deﬁned as the set of the

values of the signal strengths of the observable trans-mitters. In the speciﬁc case of cellular networks,

a ﬁngerprint is the collection of the values of the

signal strengths of the available cell towers. One of

the fundamental modules of aﬁngerprint positioning

system is the database, that stores observations taken at known locations. When a device asks the system for an estimation of its current position, it compares

the ﬁngerprint of the device, that is, the observed

signals, with the set of past observations recorded in the database. To execute such a comparison, the system has to distinguish among the signals coming from diﬀerent cells (at any given time, signals coming from multiple cells are usually received by a device). Since each cell broadcasts a set of identifying para-meters, it can be easily recognized (Hoy2015).

Whileﬁngerprint positioning is more energy

eﬃ-cient than an approach using GPS, its overall perfor-mance strongly depends on the quantity and

distribution of available ﬁngerprints. A problematic

case for map matching is given by scenarios where observations are temporally sparse, i.e. the frequency at which devices collect data is so low that a few minutes, or even more, may elapse from one observa-tion to the next. Observaobserva-tions can be temporally sparse to reduce energy consumption, but also to satisfy speciﬁc constraints on data storage or trans-mission. Solutions that work well with temporally dense data become less eﬀective when applied to temporally sparse data. The problem of map match-ing when data are spatially noisy and temporally sparse has been addressed in the GPS setting. In order to deal with the limited amount of information, probabilistic algorithms are exploited that return the most likely trajectory with a certain degree of

accu-racy. One of the most eﬀective solutions, which

makes use of Hidden Markov Models (HMM), is

reported in (Newson and Krumm2009). It has been

exploited and further improved in subsequent

con-tributions (Osogami and Raymond 2013; Oran and

Jaillet2013; Jagadeesh and Srikanthan2015).

In this paper, we go a step further and propose a solution to the problem of reconstructing the tra-jectory of a device when data are spatially noisy and temporally sparse using only cellularﬁngerprints. The worst scenario is probably the one in which, due to a low sampling rate, limited temporal information is available and spatial information has a low accuracy.

To deal with sparse observations in the form of

cellularﬁngerprints, we design an HMM-based

map-matching algorithm that works oﬄine. It consists of

four main steps. It starts with (i) a preprocessing step, followed by (ii) the generation of the sets of states and (iii) the deﬁnition of the set of transitions between them. Once the model has been generated, (iv) the most likely map-matched trajectory is

extracted from it. To compareﬁngerprints, the

algo-rithm makes use of a machine learning method based on decision tree ensembles, that turns out to be

competitive against state-of-the-art ﬁngerprint

com-parison functions (Viel et al.2018).

To demonstrate that the proposed algorithm can reconstruct trajectories even withﬁngerprint observa-tions that intrinsically have a low spatial accuracy, we tested it in the urban environment of a medium-sized Italian city. Moreover, in order to evaluate its robust-ness, we applied it to a variety of scenarios obtained by (artiﬁcially) changing the sampling of the

observa-tions and the density of the ﬁngerprint map. We

completed the experimentation by checking its

beha-viour over mixed sequences of GPS andﬁngerprints.

The paper is organised as follows. InSection 2, we brieﬂy analyse related work on cellular ﬁngerprinting positioning systems, the map-matching problem, and

the application of HMM to it. Then, inSection 3, we

illustrate the proposed algorithm, with a special attention to the applied methods. The considered

dataset is described in Section 4. In Section 5, we

show the algorithm at work on real data, and then, in Section 6, we evaluate the experimental results.

Finally, in Section 7, we summarise the achieved

results and outline future work directions.

2 Related work

Before presenting details and outcomes of the pro-posed solution, we give a short account of the rele-vant literature. Weﬁrst focus on positioning systems, that make use of cellular ﬁngerprints (Section 2.1), and then we discuss the map-matching problem, and how it can be addressed by means of HMM (Section 2.2).

2.1. Cellularﬁngerprint positioning systems

Cellular ﬁngerprint positioning systems aim at

esti-mating the current position of a device by comparing

its ﬁngerprint with the ﬁngerprints of known

posi-tions stored in the system database. Such

a comparison of the observed ﬁngerprint with the

recorded ones is a crucial component of the beha-viour of the system. As the devices communicate over

a network, ﬁngerprints always provide information

about the serving cell and its signal strength. Additional information about the set of neighbour

(4)

cells, possibly including the strength of their signals, is sometimes taken into consideration as well. In the best case, all the physical parameters of all the observed cells are available.

Various metrics have been used to evaluate the

distance between pairs of ﬁngerprints in order to

identify, among the ﬁngerprints stored in the

data-base, the most similar ones (the underlying assump-tion is that the more similar theﬁngerprints are, the closer are their positions). The most commonly used

metric is the Euclidean distance (Bahl and

Padmanabhan 2000). Others metrics proposed in

lit-erature are the Spearman Correlation (Zekavat and

Buehrer 2011), which measures similarity between

two ordered sets using ranking, and Hyperbolic

Fingerprinting (Kjærgaard and Munk 2008), that

compares signal strength between pairs of bacon. An innovative approach that makes use of decision tree ensembles has been developed in (Viel et al.

2018).

A variety of localisation methods that exploit information from cellular networks have been

pro-posed in the literature (Deblauwe 2008). The

distin-guishing features of signalﬁngerprinting are that (i) it does not require any assistance from the network operators and (ii) it does not impose any signiﬁcant

modiﬁcation to the network infrastructure. Signal

ﬁngerprinting has been successfully used in various contexts, including Wi-Fi and cellular networks, and indoor and outdoor environments. As for Wi-Fi net-works, the RADAR system (Bahl and Padmanabhan

2000) proves that Wi-Fi access points can be

exploited to achieve accurate indoor location

estima-tions. As for cellular networks, GSMﬁngerprinting in

an outdoor context has been investigated in (Chen et al.2006). In such a contribution, it clearly emerges

that the quality and the quantity of the ﬁngerprints

available for the position estimation signiﬁcantly

aﬀect the outcomes of the process. Cellular

ﬁnger-prints have also been paired with information coming from other sensors to increase the performance of a positioning system. This is the case, for instance,

with (Aly and Moustafa 2013), where the authors

make use of signals coming from cellular networks to reset the accumulated error of a system exploiting inertial sensors.

In general, a variety of factors inﬂuence the

per-formance of ﬁngerprint positioning. The density of

cells is one of them: usually, the higher the density, the higher the precision is. With a high density,

indeed, it becomes easier to diﬀerentiate the

ﬁnger-prints, even in small areas. An important role is played also by the quantity and the distribution of the observations recorded in the database: a highly

populated ﬁngerprint map generally produces more

accurate estimations. A particularly advantageous

characteristic of the ﬁngerprinting method, that

distinguishes it from other cellular positioning ones, like, for instance, those based on the Timing of

Arrival, is its tolerance toward signal ﬂuctuations

and multi-path eﬀects, which is a very helpful feature in urban environments, where also GPS is mostly at a disadvantage. A signiﬁcant drawback of the method is that a sampling of the territory is needed to obtain

a map of signal ﬁngerprints, tagged with the GPS

position of the collecting site. However, once the ﬁngerprint map has been produced, the position esti-mation can be obtained in a very fast way, even in those situations where GPS can take a long time to get aﬁx.

In this paper, ﬁngerprint comparison is done by

means of decision trees. Decision trees are a popular method for many supervised machine learning tasks. Their success is due to the fact they are easily inter-pretable and quite eﬃcient, during both the learning and the prediction phases. On the negative side, the predictions obtained by means of decision trees are usually less accurate than those that can be achieved with other methodologies. Moreover, decision trees

have a tendency to over-ﬁt training data. A way to

signiﬁcantly improve the accuracy of the method is to

combine a set of diﬀerent decision trees into a

so-called Random Forest (Breiman 2001). This solution

builds a set of trees during the training phase, and then it outputs the average prediction of the indivi-dual trees as the result. Such a technique has been

exploited to estimate ﬁngerprint similarity in (Viel

et al.2018), by making use of an ensemble of decision trees. Basically, the similarity between a pair of ﬁn-gerprints is assessed by means of a set of features, which includes the number of common cells and other parameters related to signal strength. An experimental evaluation of the method has shown

a signiﬁcant improvement in accuracy compared to

other methods such as, for instance, the one based on the Euclidean distance. The method has also other advantages over traditional ones. By using features that characterize the diﬀerences between ﬁngerprints,

rather than the ﬁngerprints themselves, the model is

not speciﬁc to the area in which it was generated, that is, at least in principle, it is not strictly necessary to train a diﬀerent model for each region. Moreover, the model provides a similarity measure in meters, which is an advantage compared to other metrics in the literature that report a similarity value devoid of an actual spatial interpretation.

2.2. The map-matching problem

Map matching aims at integrating positioning data with spatial information about the road network. The problem of map matching has been largely explored in the literature and various algorithms have been proposed, which diﬀer from each other in the

(5)

available sensors, sampling rate, road network model, and applicationﬁeld.

There is not a widely accepted classiﬁcation of the proposed solutions. Some possible criteria have been

suggested in (Quddus, Ochieng, and Noland 2007).

Considering the way observations are processed, one can distinguish between online (or real time) and oﬄine algorithms. In the ﬁrst case, the observations are elaborated incrementally, in the second one, there is a post-processing phase that involves all the obser-vations of a path at once. Alternatively, one can distinguish among geometrical, topological, and probabilistic map-matching algorithms. A partition can also be done on the basis of the frequency of observations, that allows us to split existing solutions into two classes: low frequency and high frequency. Finally, there exist some GPS-based solutions, that are respectively based on probabilistic theory,

Kalman ﬁlter, fuzzy logic, and belief theory, which

are not easily ascribable to any of the above classes

(Quddus, Ochieng, and Noland2007).

A critical analysis of several map-matching solu-tions, which pay a special attention to their

perfor-mance, can be found in (Hashemi and Karimi2014).

Its outcomes can be summarised as follows: (i) the choice of the map-matching algorithm to use in

a speciﬁc application strongly depends on the

assumptions and the expected accuracy, (ii) there are few map-matching algorithms that fully exploit topology networks, taking into consideration para-meters like, for instance, direction and turn-restriction, and (iii) there exists a trade-oﬀ between accuracy and simplicity of map-matching algorithms.

2.2.1. HMM-based map matching

The Hidden Markov Model (HMM) is a statistical model designed to deal with sequential data with an underlying hidden structure. It has a number of applications in domains such as language processing, speech recognition, and biological analyses (Juang

and Rabiner 1991; Jurafsky and Martin 2009).

Recently, HMM techniques have been successfully exploited in traﬃc monitoring and positioning (Aly and Moustafa2015; Thiagarajan et al.2009).

In the last years, it has been successfully applied in the context of map-matching trajectories based on GPS observations, which are, at the same time, spa-tially noisy and temporally sparse. The proposed solutions extensively used HMM to describe systems whose states cannot be accessed directly, but only through external observations. Indeed, HMM turns out to be particularly useful when matching a location observation, especially, for example, in the case of cellular observations, for situations when the closest road on the map cannot be distinguished

reliably (Algizawy, Ogawa, and El-Mahdy 2017).

Map-matching algorithms based on HMM build a probabilistic model that, on the basis of the avail-able observations, describes the routes that a device may have followed. States can be viewed as a set of candidate road segments for each observed location, while transitions are interpreted as paths between pairs of road segments. States and transitions are paired with an emission and a transition probability, respectively. The emission probability expresses the likelihood that the measurement was done in that location, while the transition probability is the prob-ability that the path was taken. Most HMM methods proposed in literature heavily rely upon the computa-tion of those probabilities. As an example, in CTrak

(Thiagarajan et al.2011), HMM was exploited for the

mapping of trajectories in a scenario based on cellular ﬁngerprints. Once HMM is built, the most likely sequence of states is computed by means of the

Viterbi algorithm (Forney 1973; Newson and

Krumm2009).

HMM-based map-matching algorithms are usually applied once all the location data are collected to generate the full path. An on-line alternative is to use a sliding window method, where the path is computed excluding the n-th latest observations

(Lou et al. 2009). This approach is well suited for

real-time applications, but, unfortunately, it consider-ably reduces the accuracy of the algorithm. In (Goh

et al. 2012), a diﬀerent approach that makes use of

a variable-width sliding window is outlined, which guarantees that the optimal solution is still found.

Finally, the problem of dealing with scenarios where the positions in the sequence are sparsely dis-tributed and aﬀected by noise is addressed in

(Newson and Krumm 2009). The proposed

HMM-based map-matching algorithm has been experimen-tally evaluated with a GPS trace across Seattle, adding artiﬁcial noise in observations and removing points to obtain a sparse trace, and it proved itself quite successful. For this reason, it has been largely reused and improved by a number of authors in subsequent contributions.

2.2.2. A short account of existing contributions In the following, we brieﬂy describe the most signiﬁ-cant developments of the work by Newson and Krumm.

Lou et al. (2009) use a candidate graph model, in some way similar to the HMM-based solution outline

in (Newson and Krumm2009), that exploits the road

network and the spatial constraints on the trajectory. Such a solution was later improved by Yuan et al.

(2010), whose proposal takes into account the

weighted mutual inﬂuences among GPS points by

means of a suitable voting mechanism. Osogami

and Raymond (2013) make use of the model

(6)

apply a diﬀerent transition probability function, that, in order to estimate the probability of traversing a certain path, takes into consideration the number of turns. A similar approach is followed by Oran and Jaillet (2013), where an alternative transition func-tion, based on a cumulative proximity-weight formu-lation, is used.

The map-matching problem in the presence of cellu-lar positioning data entered the research agenda only very recently. Existing solutions diﬀer from the one given in the present work in the assumptions about the available data. More precisely, Thiagarajan et al. (2011)

propose a ﬁngerprint-based map-matching algorithm

that requires a very high sampling rate (about aﬁngerprint per second) and the availability of acceler-ometers and magnetacceler-ometers. Jagadeesh and Srikanthan

(2015) develop a variant of the HMM algorithm in

(Newson and Krumm2009). In order to address noisy

and sparse smart-phone location data (both Wi-Fi and cellular), that, unlike the case of cellularﬁngerprints, is already in the form of latitude/longitude pairs, they sui-tably change the probability functions.

All the above-described methods use data which can be collected by the device itself, e.g. ﬁngerprints. It is worth mentioning that there exists a diﬀerent class of methods that work with data collected by the cellular operator for purposes generally unrelated to positioning. This is the case, for instance, with call logs, which are available to the infrastructure, but, usually, not to the

client modem or handset. Leontiadis et al. (2014)

describe a solution that, in order to reconstruct the trajectory of a mobile user, exploits cellular handover logs from the network operator, which contain coarse spatial and temporal information as well as knowledge

about the coverage of the cells. A diﬀerent solution is outlined in a work by Schulze, Horn, and Kern (2015), where, in order to determine the most likely path, one constructs a street graph by using the sequence of visited cells.

3. Methods

In the following, we describe our solution to the problem of reconstructing the trajectory of a device in a scenario where only spatially noisy and temporally sparse data from cellularfingerprints are given. We assume that no GPS data from the device are available, and only rely on (i) basic information about the road network and (ii) cellularfingerprints to be matched against a database of previously-collectedfingerprints with an associated GPS position.

The proposed map-matching algorithm exploits a suitable adaptation of the HMM construction. One of its distinctive features is an original technique for the generation of the states of the probabilistic model, start-ing from theﬁngerprint observations, that makes use of a machine learning method based on decision tree ensembles (Viel et al.2018). The algorithm can also be easily adapted to deal with mixed sequences including

bothﬁngerprints and GPS ﬁxes.

The map-matching algorithm consists of four main steps (seeFigure 1for a graphical illustration): (1) preprocessing: theﬁrst step removes data which might be incorrect or simply redundant for reconstructing the trajectory;

(2) emission probability: by exploiting decision

trees, for each ﬁngerprint observation in the

(7)

sequence associated with a device, the second step retrieves the most relevantﬁngerprints in

the database and uses them to ﬁnd the most

probable roads nearby;

(3) transition probability: the third step connects the roads and assigns them a probability which depends on the given spatial and temporal constraints;

(4) trajectory generation:ﬁnally, in the fourth step, the most probable trajectory is generated by applying the Viterbi algorithm on HMM. Most of the above phases are implemented by means

of PL/pgSQL (Procedural Language/PostgreSQL)

(PostgreSQL Global Development Group 2008). PL/

pgSQL is a fully-ﬂedged programming language that allows much more procedural control than plain SQL. For instance, it gives the possibility to use loops and other control structures. As for the emission probability phase, we exploited the machine learning software Weka (Waikato Environment for Knowledge Analysis) (Hall

et al., 2009) for the generation of random forests.

Details about the data and tools used to compute the shortest path between pairs of points on the road map during transition probability phase are described in

Section 4.

3.1. Preprocessing

The preprocessing phase aims at removing data that does not really help in reconstructing the trajectory. As an example, a user travelling around in a city with a mobile phone can acquire a lot of observations even for a short trip. Moreover, he/she can stay almost still for a while, thus recording the same set of visible cells repeatedly. Finally, during such an activity, the measurements may be aﬀected by several errors (a list of typical measurement errors is reported in (Ulm, Widhalm, and Brndle2015)).

In addition, since cellular ﬁngerprints are generally

imprecise, using observations which are temporally very close can easily generate some noise, without producing any clear beneﬁt. Last but not least, a huge number of observations means a large HMM, and thus an increase in the time needed to compute the trajectory.

The problem ofﬁltering the collected trip data has been extensively investigated in the literature. However, it is not the main focus of the present work, and thus we limit ourselves to remove observa-tions that are temporally close to each other (below

10 seconds). More complex ﬁltering criteria can be

possibly adopted without aﬀecting the algorithm in any signiﬁcant way.

3.2. Emission probability

Once the unnecessary observations have been removed, the next step consists of building an

HMM in order to compute the most likely trajectory followed by the device on the map. HMM-based map-matching algorithms work by constructing a probabilistic model, basically a graph, which repre-sents the possible routes that a device may have taken on the basis of the available observations. For every location observation, a set of candidate road segments are identiﬁed, which becomes a set of hidden states of the model. Two states obtained by subsequent obser-vations are linked together by a transition, which represents a path between the two road segments. This representation exploits the connectivity of the road network.

The HMM building process starts with the identi-ﬁcation of the hidden states and the computation of their emission probabilities. The overall process can

be split into two phases, as shown in Figure 1: (i)

comparison of ﬁngerprints and (ii) selection of the

states and assignment of an emitted probability to each of them.

The similarity between a pair of ﬁngerprints is

assessed as follows. Initially, for each pair of ﬁnger-prints, a set of features, like the number of common cells and other characteristics related to the signal strength, is extracted. These features are then fed into a Random Forest model, as proposed in (Viel et al. 2018), trained to report the distance in meters

between the locations at which the ﬁngerprints were

observed. The output of the comparison phase is the

set of ﬁngerprints which are most similar to the

observed one. An example is given in Figure 2,

where the ﬁngerprints closest to the input one are

labelled as f1; f2; f3, and f4 (Figure 2(a)).

In HMM, each input ﬁngerprint (observation)

generates several states, which basically represent candidate points on the roads. A probability must be associated with any such state, which represents the probability of the device being in that place at that step of the trajectory. Since in cellular ﬁngerprints spatial information is not explicit, the computation of the states is more complex than in the case of GPS observations.

Once we have identiﬁed the most similar stored

ﬁngerprints, the next step consists of the identiﬁca-tion of the set of candidate road segments in the map by exploiting information about the road network. As a preliminary step, road segments longer than a certain threshold are split into multiple segments. Then, the output of the Random Forest, which is expressed in meters, is used as an upper bound to the search radius for the roads nearby (plus some tolerance). This makes the method more robust in

the case of a lacking ﬁngerprint map. As shown in

Figure 2(b), f1 and f2 are related to r1, f3 is related to

r1, r2, and f4 is related to r4, r5, and r6. It is worth

pointing out that roads r7, r8, and r9 do not intersect

(8)

around ﬁngerprints are, in general, diﬀerent, as they depend on the distance returned by the Random Forest algorithm.

Next, for each ﬁngerprint, we determine the set

of closest points on the corresponding roads com-puted by the previous step, which become the states of HMM. This is done on the basis of the Euclidean distance between the coordinates of the

stored ﬁngerprints and the road segments. As an

example, inFigure 2(c), the ﬁngerprint f3 produces

the states s3;1, s3;2, and s3;3 on the roads r1, r2, and

r3, respectively, while the ﬁngerprints f1 and f2

respectively produce the states s1;1 and s2;1 both

belonging to the road r1. Since similar ﬁngerprints

are generally close to each other, there can be multiple states on the same road segment. This is the case, for instance, with s2;1 and s3;1 in Figure 2

(c), which are related to two very close

ﬁnger-prints for the same observation (input ﬁngerprint).

If they are close enough, they are merged into one

single point (the blue point in Figure 2(c)), which

ﬁnally becomes a single state of the model (points s2;1¼ s3;1 in Figure 2(d)). Such an approximation

is fairly natural as, given the limited precision of ﬁngerprinting locations, it is not necessary to dis-criminate between close locations. Moreover, it reduces the number of states and, consequently, the running time. Finally, since we are interested in reconstructing the trajectory of the device, if an intermediate point is shifted along the same road,

the ﬁnal trajectory remains mostly unchanged.

The probability of each state of the model is essen-tially given by the distance reported by the Random Forest for theﬁngerprint that generated it. Formally, the emission probability for a state si, generated by

aﬁngerprint observation oj, is determined as follows.

First, we compute the values: pðojjsiÞ ¼ max

k dRFðoj; fkÞ dRFðoj; fcÞ (1)

where dRF is the distance returned by the Random

Forest model, fc is the ﬁngerprint that generated si,

and each fk is aﬁngerprint close to oj. The values are

then suitably normalised to obtain a proper probabil-ity measure. As a result, the states generated by a closerﬁngerprint have a higher probability.

In the case of a state which has been generated by merging multiple points, the probability is given by theﬁngerprint at the shortest distance. We would like to remark that we decided to take the lowest value, and not the sum, as otherwise, we would take into

account the density of the ﬁngerprint map in the

area, which does not depend on the current position of the device. Of course, this would be a reasonable

alternative if the ﬁngerprint map was collected in

a crowd-sourced effort and density was due to the inherent popularity of a location. Taking the average into consideration, instead, could increase the influ-ence of noise. Hinflu-ence, using the one at the shortest distance seems a reasonable choice, considering also that the model tends to overestimate rather than to underestimate the distance (Viel et al.2018).

Figure 2.An example of the process of computing the emission probability. It starts with the matching of closestfingerprints using RF (a). Then, it uses the RF distance in meters to select nearby roads for each matchedfingerprint (b). Next, the closest points on the roads (dotted circle in green) for eachfingerprint (solid circle in grey) are computed (c). Finally, there is a check for close enough points, which are merged together, producing the insertion of blue point in (c), thus obtaining thefinal set of points (d).

(9)

Last but not least, even though the proposed

algo-rithm has been designed to work with cellular

ﬁnger-prints, its structure makes it easy to adapt it to the

case of mixed sequences of ﬁngerprints and GPS

observations. In such a case, following the approach

outlined in (Newson and Krumm 2009), each GPS

measurement is matched to the nearest point of the nearby roads (up to a certain range, that is, 60 m). In

comparison to ﬁngerprinting, it can be seen as

a simpliﬁed case, because high precision coordinates are already available. These points become states of the model and a probability is assigned to them according to a Gaussian distribution which gives a higher probability to the road segments closer to the GPS point. The deviation is given by the accuracy reported by the GPS module.

More precisely, the emission probability for a state sj obtained by a GPS observation oi is deﬁned as:

pðojjsiÞ ¼ 1 ffiffiffiffiffiffi 2π p σexp d2 2σ2 ₍₂₎

where d is the distance between the point pj, from which

sjis obtained, and the GPS measurement oi. Usually, the

standard deviationσ is set to a ﬁxed value, obtained by

estimating the common noise of a GPS ﬁx. However,

using the accuracy reported by the GPS module could be arguably more accurate, since it varies with each measurement. Another difference between GPS and fin-gerprint observations, in the way we use them, is the range for searching road segments around observations. For GPS observations, we can choose afixed distance (60 m), considering that the precision given by GPS is gen-erally higher and the probability outside that range should be minimal in any case. Finally, as in the base case, road segments longer than a certain threshold are split into multiple segments. We use different thresholds for GPS and cellularfingerprints, 10m and 100m respec-tively, taking into account their different accuracy.

3.3. Transition probability

States and transitions are respectively paired with an emission and a transition probability. The former gives the likelihood that the measurement was taken in that location, the latter represents the probability that the path was taken.

As illustrated inFigure 3, in HMM, pairs of states generated by consecutive measurements are linked by a transition, that represents the path on the road network that connects the road segments correspond-ing to those states. The path is computed by the Dijkstra algorithm suitably conﬁgured to return the shortest route.

To assign a probability to transitions, we adopted the probability function proposed in (Jagadeesh and

Srikanthan 2015). Here, we only give a general

account of it, and refer the reader to (2015) for

details. It has a spatial component, inspired by

Newson and Krumm (2009), and a temporal

compo-nent, that deals with the diﬀerence between the time

actually elapsed from the ﬁrst measurement to

the second one and the estimated travel time. The spatial component of the probability function is given by the diﬀerence between the great-circle distance among the points on the road segments associated with the states and the length of the path that connects them following the road network. Given two states si and sj, which have been obtained

from two consecutive observations, its value is com-puted as follows:

Ddðsi; sjÞ Dsðsi; sjÞ

Δt (3)

where the function Ddreturns their driving distance,

the function Dscomputes their straight-line distance,

andΔt reports the time actually elapsed between the

two observations. The rationale behind such

Figure 3.In HMM, the map-matched path is computed by choosing a state and a transition for each observation. States from

(10)

a component of the function is that straight paths are more likely than tortuous ones and thus they are given a higher probability.

Since we are managing sparse and noisy observa-tions, it is clearly advantageous to exploit as much as possible information coming from the road network. As an example, on the basis of the length and the speed limits of the roads, it is possible to approximate the time necessary to travel a path.

The temporal component of the probability func-tion assigns a lower probability to paths which require a time higher than the actual travel time

given by the timestamps of the trajectory

measurements:

maxððTeðsi; sjÞ ΔtÞ; 0Þ

Δt (4)

where the function Tecomputes the estimated driving

time between the two states and Δt reports the time

actually elapsed between the two observations. As a matter of fact, when the estimated time is far higher than the actual travel time, the path is discarded since it is unrealistic (and would have been assigned a very low probability anyway). Notice that the opposite case, that is, the case in which the device takes more time than expected, is not penalised, as a long travel time may reﬂect bad traﬃc conditions. The two equa-tions are then combined as shown in (Jagadeesh and Srikanthan2015).

Finally, map data generally include a classiﬁcation of road segments as highway, pathway, and so on. Whenever the current travelling method is known, the map matching algorithm makes use of it to sim-plify the transition computation by ignoring those roads that cannot be used by the vehicle under con-sideration. This does not only improve the accuracy of the method, but it also improves the running time, as it reduces the search space of the Dijkstra algorithm.

3.4. Generating the trajectory

Once HMM is built, we can obtain a map-matched trajectory as the result of the merging of the more likely paths between each pair of consecutive states. To this end, we apply to well-known Viterbi

algo-rithm (Forney1973; Newson and Krumm2009).

An example is given inFigure 3, where the state to the left is the starting point for each possible sequence. For each observation in the sequence (from time t0 to time tn), there exists a set of

candi-date states. With reference to the example inFigure 2, the states associated with the observation taken at

time t1 are collected in column inside the red box.

A sample path from the starting node to the ending one is coloured in bright green.

It is worth pointing out that there is a chance that synthesising the trajectory turns out to be impossible. This happens, for instance, whenever the set of states, relative to a certain observation, with an incoming transition and the set of those with an outgoing transition are disjoint (as we already observed, some transitions could have been discarded because they require an unreasonable time to be travelled). There are three ways to cope with this unfortunate situation: keep all the transitions, remove all the states related to that observation from the model, or split the tra-jectory into two shorter ones. We opted for the sec-ond approach, because such a situation is an indication that those states are probably too far from the correct position and that explains the unrea-sonable time. In any case, this is a last resort method, and in fact it has been necessary just in one case in our evaluation.

4. Fingerprint data collection

In this section, we describe the collection of data, the way in which they have been acquired, and the tools that we used to test the proposed map-matching algorithm against a real scenario.

Data consist of a set offingerprints paired with GPS location information collected in an urban environ-ment. We chose to evaluate the proposed solution in an urban scenario, because it is characterized by quite complex road networks from which one generates a large number of road candidates. The presence of a lot of buildings, interfering with the signal propaga-tion, is an additional challenging factor to cope with. In comparison, reconstructing a trajectory in rural scenar-ios seems to be more straightforward and less error-prone. Moreover, to evaluate the robustness of the proposed solution, starting from the collected urban dataset, we generated and tested different scenarios by simply removing sets offingerprints.

Data have been collected in the historical centre of Udine, a medium-sized Italian city, in order to prove that the trajectory of a moving device can be reconstructed on

the basis of a set of ﬁngerprint observations with an

intrinsically low spatial accuracy. The overall data collec-tion process took several days using a Sony Xperia Z3 Compact phone equipped with a basic Android applica-tion developed for the purpose. In fact, data collecapplica-tion can be done with any cellular device capable of recording logical and physical parameters of visible cells, not neces-sarily a mobile phone. Different devices may obviously experience better (or worse) signal reception. However, these differences do not significantly affect the overall performance of the system as long as the same device is used for both sampling and testing.

As shown in Figure 4, the process of data

collec-tion involved over 4000 GSM ﬁngerprints. Each

(11)

and GPS-provided locations at the time of capture. Let us take a closer look at the data. Table 1 reports

a sequence of 10ﬁngerprints from the collected

data-set. The ﬁrst three columns respectively contain the

identifier of the fingerprint (ID), the serving cell identifier (Serving), and the capture time (Time). The fourth column (Cell Signature) contains the set of all visible cells, including the serving one, while the fifth column records the signal strengths (Signal Strength). It is worth pointing out that, even if not listed in the table, each record also includes other typical GPS information, such as position, altitude, speed, bearing, number of satellites, and accuracy.

In order to evaluate the proposed algorithm and its performance, besides a set offingerprints with the asso-ciated GPSfixes, it was necessary to collect some sparse fingerprint data, devoid of GPS information, describing a few paths. To this purpose, we extracted from the

entire set of observations three paths between diﬀerent points of interests in Udine, respectively path A, path B,

and path C (Figure 5). Those traces were originally

collected at a high frequency sample rate, with just a few seconds between each observation; in order to simulate the case of sparse observations, we artiﬁcially degraded the sequences to obtain lower sampling peri-ods of approximately 30, 60, and 120 seconds. In such a way, we have been able to obtain new traces with diﬀerent levels of sampling rate, that allowed us to measure the impact of the changes in the sampling period on the outputs of the proposed solution. An account of the distinctive characteristics of the consid-ered traces is given inTable 2.

All data were collected by bike. There are various motivations for such a choice. First of all, it is a common way of travelling, especially in small- and medium-sized cities. Second, it basically represents

Figure 4. Fingerprint observations collected in the historical centre of Udine, an Italian city. Map Data © OpenStreetMap contributors, CC BY-SA.

Table 1.An example of the collectedﬁngerprints (GPS data are omitted).

ID Serving Time Cell Signature Signal Strength 00 73896 17:46:03 73895,73896,74004,74091,75838 13,18,10,9,12 01 73896 17:46:18 73895,73896,74004,74091,75838,75841 15,19,12,13,13,12 02 73896 17:46:21 73895,73896,74004,74056,74091,75841 14,18,12,13,14,12 03 73896 17:46:25 73895,73896,74004,74056,74091,75861 13,19,12,12,13,12 04 73896 17:46:28 73895,73896,74004,74091,75838,75861 14,18,12,11,11,13 05 73896 17:46:31 73895,73896,74004,74091,75838,75861 14,18,12,10,11,13 06 73896 17:46:35 73895,73896,74004,74091,74685,75861 14,17,12,11,10,13 07 73896 17:46:39 73895,73896,74004,74056,74091,75861 14,18,14,13,14,13 08 73896 17:46:42 73895,73896,74004,74056,74091,75861 15,19,14,15,14,12 09 73896 17:46:46 73895,73896,74056,74091,75838,75841 14,22,12,13,13,12 10 73896 17:46:48 73896,74091,75838,75841 21,13,14,14

(12)

a middle ground between car driving and walking. On the one hand, a cyclist is normally slower than a car, apart from the case of abnormal traﬃc situa-tions, and thus the travelled distance is shorter, which helps map matching in the presence of sparse infor-mation. On the other hand, a cyclist has more free-dom of movement, being able to ride his/her bike in several streets and paths that a car simply cannot access. Finally, as pointed out inSection 3, knowledge about the travelling method can be used to restrict the possible routes, resulting in an improvement in both running time and accuracy. The fact that all the paths were traversed by bike allowed us to exclude during computations some of the roads, like the ones classiﬁed as foot-way or steps.

The above data was imported in a PostgreSQL

(PostgreSQL Global Development Group 2008)

rela-tional DataBase Management System (DBMS). To deal with spatial data, we used its spatial extension

PostGIS (PostGIS Project Steering Committee 2018).

Information about the road network, for the area

under consideration, was obtained from

OpenStreetMap. Once information about the relevant points on the road was computed, the additional library pgRouting for PostgreSQL (pgRouting

Com-munity 2018) has been exploited to determine the

shortest paths between pairs of points of the road network. In particular, we used Dijkstra’s algorithm

to compute the shortest paths between road points representing the states in the model.

5. The execution of the proposed map-matching algorithm

The algorithm described in Section 3 has been

exe-cuted on the dataset given in Section 4 (Figure 5),

including three extracted test paths and a set of GPS-tagged observations. For each path, we considered diﬀerent sampling frequencies.

To explain how the algorithm works, we illustrate in detail some of its steps with respect to one of the testing paths, namely, the trace C. The trace is composed of

122 ﬁngerprints, collected at the locations shown in

Figure 6(a) (red points). To simulate the sparse obser-vation scenario, we retain one obserobser-vation per minute. The resulting picture is depicted in Figure 6(b) (red-yellow stars), which only contains 8 observations. Obviously, the GPS positions of path C are not con-sidered by the algorithm; they are only used to show how it works.

The ﬁrst step involves the preprocesssing of the

sequence of observations for removing the unneces-sary or erroneous data. This is crucial step, especially for high sampling rates, for which observations which are unnecessarily close together may generate too many states. In the considered example, however, where the sampling period is 60 seconds, such a ﬁlter-ing step turned out not to be necessary.

Then, each ﬁngerprint from the processed

sequence is compared to those stored into the data-base using the Random Forest model to identify the

most similar ones (Section 3). The training of the

Random Forest was done over theﬁngerprint dataset

by taking a random subset of all the observations and Figure 5.The three paths between diﬀerent points of interests collected in the city of Udine © OpenStreetMap contributors, CC BY-SA.

Table 2.Characteristics of the traces used in the evaluation.

Trace A B C Length (m) 1227 1846 1722 Time (min) 12 9 6 Observations 112 111 101 Observations (30s) 22 16 13 Observations (60s) 12 8 7 Observations (120s) 6 5 4

(13)

coupling them with otherﬁngerprints to obtain pairs.

Then, from each pair ofﬁngerprints, we extracted the

features that are fed to the tree, which is trained to predict the distance in meters between the positions where they were collected. The result of the

compar-ison process is shown in Figure 6(c), where the

ﬁn-gerprints closest to observations are highlighted as green points.

On the basis of the estimated distances, the algo-rithm extracts the candidate roads, which are

high-lighted in Figure 6(d). Next, matching points on the

road segments are computed, thus obtaining the states of HMM with the associated emitted probabil-ities. As a matter of fact, all the considered observa-tions ended up with at least one state.

Then, in order to assign a transition probability to each pair of consecutive states, the algorithm searches for the shortest route between each state and all the next ones in the sequence.

The last step of the algorithm is the reconstruction of the trajectory by means of the generated HMM. This is done by executing the Viterbi algorithm on the model, which produces the most likely path among the states. The path generated by the algorithm in the considered scenario (Figure 6(b)) is shown inFigure 9(d).

The results of the execution of the algorithm on the three paths at diﬀerent sampling rates are reported in

Figures 7,8, and9. Trace A is the shortest and straightest one among the three traces. InFigure 7, it is possible to see that the ground truth and the generated trajectories are quite similar for all the sampling frequencies. In particular, notice that the reduction of the frequency did not aﬀect the shape of the trajectory, apart from shortening the path thanks to the initialization of the algorithm. Trace B, reported inFigure 8, is more twisted and longer than trace A. The path generated with sam-pling intervals at 30 seconds is the one thatﬁts best to the original track. Increasing the sample rate probably intro-duces noise that prointro-duces some deviations from the original path as a result of trying to include some outliers. On the other hand, reducing the sample rate up to 120 seconds did not degrade too much the reconstruction of the trajectory. Trace C, depicted inFigure 9, goes from the train station to a university building, passing through the city centre. It is longer than the other two traces. In this case, when moving from 30 seconds sampling

onward to 120 seconds, the generated path ﬁts quite

well with respect to the ground truth. Increasing the sampling frequency over 30 seconds produces minor outliers that can be explained as a detour to reach some Figure 6.Some intermediate steps of the processing on the trace C, leaving only one observation every 60 seconds. Map Data © OpenStreetMap contributors, CC BY-SA.

(14)

candidate points quite close to the trajectory, but not really on the original path.

We would like to conclude the section with some remarks about the computational eﬃciency of the positioning algorithm we exploited, which is one of distinctive features of the proposed solution. As shown in detail in Viel et al. (2018), in comparison to the classic Euclidean one, the Random Forest posi-tioning method is an order of magnitude slower in computing a single distance estimation, but it turns out to be more accurate. Moreover, the time required by the Random Forest to compute the estimation

(one millisecond) is deﬁnitely exceeded by the cost

of querying the database to retrieve the relevant

ﬁn-gerprints, leading to an overall cost of about 30 milliseconds.

6. Experimental evaluation

In order to evaluate the behaviour of the proposed algorithm on test data, we used some metrics to analyse the accuracy and assess whether it changes

when lowering the sampling rate (Section 6.1).

Figure 7.Path A: ground-truth and map-matched trajectory at various samplings. The path goes from the top to the bottom. Map Data © OpenStreetMap contributors, CC BY-SA.

Figure 8. Path B: ground-truth and map-matched trajectory at various sampling rates. The path goes from the top to the bottom. Map Data © OpenStreetMap contributors, CC BY-SA.

Figure 9.Path C: ground-truth and map-matched trajectory at various samplings. The path goes from the bottom to the top. Map Data © OpenStreetMap contributors, CC BY-SA.

(15)

Moreover, to investigate the influence of changes in the quantity and the distribution of the stored finger-prints on the behaviour of the system, we also simu-lated the removal of some of them from the original GPS-taggedfingerprint dataset (Section 6.2). Finally,

we tested a mixed solution whereﬁngerprint

observa-tions are randomly enriched with sparse GPS obser-vations (Section 6.3).

6.1. Accuracy results

The accuracy evaluation of the algorithm was done using two standard metrics, namely, precision and recall, respectively denoted by P and R, which are deﬁned as follows:

P¼length of the correct matched segments

length of the generated trajectory

(5)

R¼length of the correct matched segments

length of the ground truth trajectory (6) These metrics are commonly used in map matching as they focus on the similarity between the generated trajectory and the one actually followed by the device, rather than looking at the position where each

observation is matched (Jagadeesh and Srikanthan

2015; Thiagarajan et al.2011).

Precision and recall values are reported inTable 3. Interestingly, degrading the sequences by decreasing the sampling does not always change their values significantly. In a few cases, such a decrease produces the opposite effect, like in path B where the lowest precision (62%) is obtained when the sampling is the highest. The effect is even more evident in Figure 10, which describes the impact of the changes in the sampling period of the sequences on the recall. Notice that this supports the choice of removing

redundantﬁngerprints during the preprocessing step.

We conclude by commenting on the spatial char-acteristics of the generated paths. In almost all the cases, the most noticeable error in the trajectories is that they result in shortening of the ends. This is probably due to the fact that a shorter trajectory usually has a higher probability in the model, and can thus favour this kind of behaviour at the extremes of the path.

6.2. Inﬂuence of the ﬁngerprint map size

Let us now briefly discuss the influence of the finger-print map size on map matching. The algorithm described in the previous sections makes use of

a ﬁngerprint database which maps quite densely the

city centre of Udine, obtaining a map-matched tra-jectory very similar to the one actually followed by the device. However, assuming a similar density for a larger region can be unrealistic, as there can

cer-tainly be areas which have a lower number of

ﬁnger-prints stored in the database.

To simulate such a scenario, where a sparse

ﬁn-gerprint map is available, the map-matched trajec-tories were recomputed multiple times, using

increasingly larger random subsets of all the

ﬁnger-prints available in the database. We set three thresh-olds at 50%, 70% and 100% of the full database. In

Table 4, we show the resulting precision and recall Table 3. Accuracy of the map-matched trajectory using

ﬁn-gerprint observations.

Trace Sampling time (s) P (%) R (%)

A 0 100 96 30 100 88 60 100 92 120 100 80 B 0 62 80 30 100 97 60 99 90 120 96 90 C 0 98 95 30 100 95 60 100 99 120 83 85 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Trace A Trace B Trace C

0 s 30 s 60 s 120 s

(16)

values. It is worth pointing out that removing

ran-dom ﬁngerprints from the map is stronger than

removing them uniformly. The random removal of ﬁngerprints is indeed better for testing the robustness

of the algorithm in the case of aﬁngerprint map with

low density.

To make it easier to follow, Figure 11 shows the

recall values for increasingly larger subsets of the ﬁngerprint map, maintaining the sampling period ﬁxed at 120 seconds. Looking at these values, it is possible to notice that there is no strong correlation

between theﬁngerprint map size and the accuracy of

the trajectory. Even with just half of the ﬁngerprint

map available, the accuracy of the generated trajec-tory remains basically unchanged. In most cases, the reported values are very similar and, in certain situa-tions, a larger subset produced a worse trajectory.

6.3. Using mixed GPS andﬁngerprint

measurements

The proposed map-matching algorithm is ﬂexible

enough to deal also with mixed sequences of cellular ﬁngerprints and GPS observations. In general,

obtain-ing a GPS ﬁx is battery-wise more costly than

a passively collected ﬁngerprint. However, GPS

observations are particularly useful when the device is moving in an area with no cellular radio coverage,

or when there are no matching ﬁngerprints to

com-pare within the database.

Hence, we decided to check whether a small num-ber of GPS observations can improve the accuracy of the map-matched trajectories. To this end, we

ran-domly substituted about a third of theﬁngerprints of

each sequence with the corresponding GPS positions and run the map-matching algorithm on the

modi-ﬁed data. Table 5 shows the accuracy results in the

case of the mixed sequence.

In principle, GPS observations should work as high-precision anchors reducing the uncertainty in HMM. Hence, the expected result was a better esti-mation of the trajectory. As a matter of fact, that was not the case: compared to the results in Table 3, the improvement obtained from a mixed sequence is modest. Moreover, in most cases, the precision and recall results do not change. We can interpret such an outcome as a further conﬁrmation of the quality of the output of the algorithm when applied to

sequences ofﬁngerprints only. The added GPS points

would have indeed changed completely the resulting trajectories if they were oﬀ from the real path. As Table 5. Accuracy of the map-matched trajectory using mixed GPS/ﬁngerprint observations.

Trace Sampling time (s) P (%) R (%)

A 0 94 96 30 100 88 60 100 92 120 100 88 B 0 70 85 30 100 97 60 100 90 120 77 97 C 0 100 95 30 100 95 60 96 99 120 87 81 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

50% 75% 100%

Figure 11.Recall measures at 120 seconds of sampling for the diﬀerent traces while taking increasingly larger subsets of the ﬁngerprint map.

Table 4.Accuracy of the map-matched trajectory with di ﬀer-ent subsets ofﬁngerprints.

Trace A Trace B Trace C Sampling time Subset P R P R P R (s) (%) (%) (%) (%) (%) (%) (%) 50 100 100 71 83 98 95 0 75 100 100 65 80 98 95 100 100 96 62 80 98 95 50 100 92 80 76 100 90 30 75 100 88 64 83 100 95 100 100 88 100 97 100 95 50 100 88 99 90 73 59 60 75 100 92 73 90 100 99 100 100 92 99 90 100 99 50 99 80 66 83 87 86 120 75 99 80 95 90 87 85 100 100 80 98 90 83 85

(17)

depicted inFigure 12, the Recall for the 120 seconds sampling with mixed map matching is even lower in the trace C. This is probably related to the outcomes of adding a new GPS point, that forces a detour.

7. Conclusions

In this paper, we proposed an original map-matching

algorithm aiming at coping with sparse cellular

ﬁn-gerprint observations. The algorithm tries to recon-struct the path of mobile devices exploiting only the signals received from the cellular networks. It pre-sents some similarities to recent research

contribu-tions focusing on the problem of ﬁguring out the

path followed by a device on the basis of sparse GPS observations. However, it only exploits cellular ﬁngerprint observations. Cellular ﬁngerprints, unlike GPS measurements, do not contain explicit spatial information and thus cannot be easily mapped into geographical coordinate pairs. The main contribution of the paper is a novel technique for the generation of

HMM states exploiting cellular ﬁngerprints, without

the need of converting them into a single location. To

assess the similarity amongﬁngerprints, it makes use

of a state-of-the-art machine learning approach based on decision tree ensembles.

The proposed map-matching algorithm was tested on the urban environment of the city centre of Udine, a medium-sized Italian city. The outcomes of the experimental evaluation showed that it is actually possible to reconstruct the trajectory of a moving

device with high accuracy using just sparse

finger-print cellular observations. Moreover, it showed that increasing progressively the sampling interval for the sequence of observations, the accuracy of the map-matched trajectory does not exhibit a significant wor-sening. A possible explanation is that the accuracy of fingerprinting is not directly influenced by the spar-sity of observations. Indeed, in our testing scenario, we experienced that the trajectory of a device can be reconstructed by means of a small number of

observations, a higher amount of possibly noisy mea-surements being often useless. The evaluation phase also tested the impact of varying the density of the ﬁngerprint map on the generated path. The experi-mental results demonstrated that it is not strictly

necessary to have a dense ﬁngerprint map to obtain

a good accuracy. Indeed, even using a small subset (even half) of the available data, the generated trajec-tory showed an accuracy similar to the one obtained

by using the fullﬁngerprint map. Finally, we

consid-ered the case in which one deals with sparse

observa-tions mixing ﬁngerprints and GPS ﬁxes. The

experiments showed that, in most of the cases, the insertion of a small number of GPS observations did not change the overall resulting trajectory, which was always quite aligned to the ground truth.

As for future work, we are thinking of further testing the proposed map-matching algorithm in other scenarios, e.g. on larger areas. The experimen-tal evaluation involved a small portion of a city, and the algorithm took a reasonable amount of resources to compute a trajectory. It would be interesting to check whether the algorithm scales well when copy-ing with a larger area and a high number of nodes to process. Since the experimental results made use of observations taken from a single device, it would be interesting to evaluate also the impact of the use of a dataset where contributions are collected by a set of diﬀerent devices. Finally, the availability of data coming from several devices makes it possible to compare the generated trajectory to the pre-viously stored ones. Such a possibility has been already explored in the literature in the GPS case, while, to the best of our knowledge, it has not been systematically addressed in the case of cellular ﬁngerprints.

Notes on contributors

Andrea Dalla Torre got his PhD at the University of Padova (Italy) in 2004 and he is currently manager in the

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Fingerprints Mixed

(18)

Location Service Team in u-blox Italia S.p.A. His work is focusing on the positioning feature in cellular modems.

Paolo Gallo obtained his MSc in Computer Science from the University of Udine (Italy) in 2009. Currently, he is research fellow at the University of Udine. His main research interests are in data science and geographical information systems.

Donatella Gubianireceived her PhD at the University G. d'Annunzio of Chieti-Pescara (Italy) in 2008 and she is currently a researcher in Computer Science at the University of Nova Gorica (Slovenia). Her research inter-ests mainly focus on data science and spatiotemporal databases.

Chris Marshall graduated from Cambridge University in 1980, and then gained his PhD at Imperial College, London (UK). His research and development interests have spanned wireless and positioning technologies, and he is currently a senior principal engineer at u-blox in the UK.

Angelo Montanari got his PhD at the University of Amsterdam (The Netherlands) on 1996 and he is currently full professor of Computer Science at the University of Udine (Italy). His major research interests are in formal methods, AI, and spatiotemporal databases.

Federico Pittinoreceived his PhD in co-tutelle between the University of Udine (Italy) and the Institut Polytechnique de Grenoble (France) in 2015. He has worked at u-blox on localisation algorithms and is currently a researcher at Carinthian Tech Research (Austria) focusing on machine learning, AI, and data science.

Andrea Viel received his PhD at the University of Udine (Italy) in 2019 and he is currently working at u-blox Italia S.p.A. on location-based services. His research interests are in data science and spatiotemporal databases.

ORCID

Donatella Gubiani http://orcid.org/0000-0001-7684-5200

Federico Pittino http://orcid.org/0000-0003-3248-5391

References

Algizawy, E., T. Ogawa, and A. El-Mahdy. 2017. “Real-Time Large-Scale Map Matching Using Mobile Phone Data.” ACM Transactions on Knowledge Discovery from Data 11 (4): 52:1–52:38.

Aly, H., and Y. Moustafa. 2013. “Dejavu: An Accurate Energy-Eﬃcient Outdoor Localization System.” In Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM-GIS), Orlando, USA, November 5–8. Aly, H., and Y. Moustafa. 2015. “semMatch: Road

Semantics-Based Accurate Map Matching for Challenging Positioning Data.” In Proceedings of the 23rd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL GIS), Seattle, Washington, November 3–6. Bahl, P., and V. N. Padmanabhan. 2000. “RADAR: An in-Building RF-based User Location and Tracking System.” In Proceedings of the 19th Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM), Tel Aviv, Israel, March 26–30.

Benikovsky, J., P. Brida, and J. Machaj.2010.“Localization in Real GSM Network with Fingerprinting Utilization.” In Proceedings of the 2nd International Conference on Mobile Lightweight Wireless Systems (MOBILIGHT), Barcelona, Spain, May 10–12.

Breiman, L.2001.“Random Forests.” Machine Learning 45 (1): 5–32.

Chen, M. Y., T. Sohn, D. Chmelev, D. Haehnel, J. Hightower, J. Hughes, A. LaMarca, F. Potter, I. Smith, and A. Varshavsky. 2006. “Practical Metropolitan-Scale Positioning for GSM Phones.” In Proceedings of the 8th International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), Orange County, CA, September 17–21.

Deblauwe, N. 2008. GSM-based Positioning: Techniques and Applications. Belgium: Vrije Universiteit Brussel . Forney, G. D.1973. “The Viterbi Algorithm.” Proceedings

of the IEEE 61 (3): 268–278.

Goh, C. Y., J. Dauwels, N. Mitrovic, M. T. Asif, A. Oran, and P. Jaillet. 2012. “Online MapMatching Based on Hidden Markov Model for Real-Time Traﬃc Sensing Applications.” In Proceedings of the 15th International Conference on Intelligent Transportation Systems (ITSC), Anchorage, AK, September 16–19.

Hall, M., E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. 2009. “The WEKA Data Mining Software: An Update.” ACM SIGKDD Explorations Newsletter 11 (1): 10–18.

Hashemi, M., and H. A. Karimi. 2014.“A Critical Review of Real-Time Map-Matching Algorithms: Current Issues and Future Directions.” Computers, Environment and Urban Systems 48: 153–165.

Hoy, J.2015. Forensic Radio Survey Techniques for Cell Site Analysis. Chichester: John Wiley & Sons.

Jagadeesh, G. R., and T. Srikanthan. 2015. “Probabilistic Map Matching of Sparse and Noisy Smartphone Location Data.” In Proceedings of the 18th International Conference on Intelligent Transportation Systems (ITSC), Las Palmas de Gran Canaria, Spain, September 15–18. Juang, B. H., and L. R. Rabiner. 1991. “Hidden Markov

Models for Speech Recognition.” Technometrics 33 (3): 251–272.

Jurafsky, D., and J. H. Martin.2009. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics. New Jersey: Prentice Hall.

Kjærgaard, M. B., and C. V. Munk. 2008. “Hyperbolic Location Fingerprinting: A Calibration- Free Solution for Handling Diﬀerences in Signal Strength (Concise Contribution).” In Proceedings of the 6th International Conference on Pervasive Computing and Communications (PerCom), Hong Kong, March 17–21. Leontiadis, I., A. Lima, H. Kwak, R. Stanojevic,

D. Wetherall, and K. Papagiannaki. 2014. “From Cells to Streets: Estimating Mobile Paths with Cellular-Side Sata.” In Proceedings of the 10th International Conference on Emerging Networking EXperiments and Technologies (CoNEXT), Sydney, Australia, December 2–5.

Li, X., X. Zhang, K. Chen, and S. Feng. 2014. “Measurement and Analysis of Energy Consumption on Android Smartphones.” In Proceedings of the 4th International Conference on Information Systems and Technologies (ICIST), Valencia, Spain, March 22–24. Lou, Y., C. Zhang, Y. Zheng, X. Xie, W. Wang, and Y. Huang.

2009. “Map-Matching for Low- Sampling-Rate GPS Trajectories.” In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic