Geographical and Topological Analysis of Internet Routing

(1)

UNIVERSITÁ DIPISA

DOTTORATO DI RICERCA ININGEGNERIA DELL’INFORMAZIONE

G

EOGRAPHICAL AND

T

OPOLOGICAL

A

NALYSIS OF

INTERNET

ROUTING

DOCTORALTHESIS

Author

Massimo Candela

Tutors

Prof. Alessio Vecchio Prof. Giuseppe Di Battista

Reviewers

Randy Bush

Prof. Marco Chiesa Prof. Ioannis G. Tollis

The Coordinator of the PhD Program

Prof. Fulvio Gini

Pisa, March 2021 XXXIII

(2)

(3)

Summary

T

HEInternet is a critical infrastructure in our society. The collection of data about

its functioning and the enrichment, analysis, and visualization of such data are activities of paramount importance, from both a research and an operational point of view. The routing is what allows data to flow between devices connected to the Internet. At any time, the protocols that make the Internet work can establish different network paths to follow in order to deliver such data. Revealing routing behaviors allows operators to design, maintain, and diagnose the Internet and enables researchers to better understand it in order to improve it.

In this thesis, we present several approaches for the analysis and visual representa-tion of Internet routing data.

We initially focus on the geolocation of routers and servers that make the Internet work. We describe and evaluate an active geolocation service, called RIPE IPmap. This system is able to estimate the location of a connected device by performing latency measurements towards it. We then study several factors influencing active geolocation processes, and provide an estimation of the maximum theoretical accuracy achievable worldwide by a generic active IP geolocation system, establishing in this way a baseline for future research on the topic.

Further, we tackle the problem of supporting network operators and researchers per-forming routing data analysis by designing and testing new visualization metaphors that can automatically represent such data. In particular, we introduce Radian, a tool implementing a new visual metaphor for the representation of evolving network topolo-gies. We develop new algorithms for drawing dynamic clustered graphs evolving over time, able to support important cognitive aspects during network operations. We then focus on the inter-domain routing case and present Upstream Visibility, a visualization supporting the monitoring of network reachability. While doing so, we define and ex-periment with several heuristics, contributing to the generic problem of automatically generating readable stacked area chart diagrams.

Finally, we apply the methodologies devised to analyze some Internet behaviors at scale. First, we study phenomena of periodic topology changes in network mea-surement data and identify the causes behind some of them. Second, we analyze the

(4)

routing situation in the Middle East, characterized by high latencies caused by neigh-boring countries not connecting with each other. Third, we perform the first large-scale evaluation of the impact of the COVID-19 pandemic on Internet latency, showing how the increased amount of online activities affected Internet performance.

(5)

List of publications

International Journals

1. Candela, M., Luconi, V., Vecchio, A. (2020). Impact of the COVID-19 pandemic on the Internet latency: A large-scale study. Computer Networks, 182, 107495. 2. Du, B., Candela, M., Huffaker, B., Snoeren, A. C., Claffy, K. (2020). RIPE IPmap

active geolocation: mechanism and performance evaluation. ACM SIGCOMM Computer Communication Review, 50(2), 3–10.

3. Candela, M., Di Battista, G., Marzialetti, L. (2020). Multi-View Routing Visu-alization for the Identification of BGP Issues. Journal of Computer Languages, 100966.

4. Iodice, M., Candela, M., Di Battista, G. (2019). Periodic path changes in RIPE atlas. IEEE Access, 7, 65518–65526.

5. Candela, M., Gregori, E., Luconi, V., Vecchio, A. (2019). Using RIPE Atlas for geolocating IP infrastructure. IEEE Access, 7, 48816–48829.

6. Candela, M., Di Bartolomeo, M., Di Battista, G., Squarcella, C. (2018). Radian: Visual exploration of traceroutes. IEEE Transactions on Visualization and Com-puter Graphics, 24(7), 2194–2208.

International Conferences/Workshops with Peer Review

1. Ariemma, L., Liotta, S., Candela, M., Di Battista, G. (2021). Long-lasting Se-quences of BGP Updates. In Proceedings of the International Conference on Pas-sive and Active Network Measurement (pp. 213–229).

2. Candela, M., Gregori, E., Luconi, V., Vecchio, A. (2019). Dissecting the Speed-of-Internet of Middle East. In IEEE INFOCOM 2019-IEEE Conference on Com-puter Communications Workshops (pp. 720–725).

3. Marzialetti, L., Candela, M., Di Battista, G. (2018). Upstream Visibility: A Multi-View Routing Visualization. In Proceedings of the 11th International Symposium on Visual Information Communication and Interaction (pp. 80–87).

(6)

(7)

List of Abbreviations

A

AFRINIC African Network Information Centre. 9

AM Anchor Measurement. 114

AMD Anchor Measurement Derived (dataset).

114

API Application Programming Interface. 10

APNIC Asia Pacific Network Information Centre. 9

ARIN American Registry for Internet Numbers. 9

AS Autonomous System. 5

B

BGP Border Gateway Protocol. 5

C

CAIDA Center for Applied Internet Data Analysis. 10

CBG Constraint-Based Geolocation. 14

CDF Cumulative Distribution Function. 59

CDN Content Delivery Network. 9

CNR Consiglio Nazionale delle Ricerche. 11 COVID-19 Coronavirus Disease 2019. 111

CP Collector Peer. 11

CRLB Cramér–Rao Lower Bound. 33

D

DDoS Distributed Denial-of-Service. 83

DFS Depth First Search. 67

DNA Deoxyribonucleic Acid. 93

DNS Domain Name System. 9

(8)

List of Abbreviations

E

EC/CAHP Enterprise Customers and

Content/Ac-cess/Hosting Providers. 24 F

FIB Forwarding Information Base. 5

FIM Fisher Information Matrix. 39

G

GCD Great-Circle Distance. 23

GPS Global Positioning System. 7

H

HTTP Hypertext Transfer Protocol. 6

I

ICMP Internet Control Message Protocol. 6 IIJ Internet Initiative Japan. 9

IP Internet Protocol. 5

ISO International Organization for Standardiza-tion. 13

ISP Internet Service Provider. 5

ITDK Macroscopic Internet Topology Data Kit.

11

ITU International Telecommunication Union.

115

IXP Internet eXchange Point. 9

L

LACNIC Latin America and Caribbean Network In-formation Centre. 9

LAN Local Area Network. 106

LTP Large Transit Providers. 24

M

MANIC Measurement and ANalysis of Internet

Congestion. 11

MPLS Multiprotocol Label Switching. 93

MSE Mean Square Error. 34

N

NLNOG Netherlands Network Operator Group. 11

NOC Network Operations Center. 73

NP-hard Non-deterministic Polynomial-time hard-ness. 16

(9)

List of Abbreviations

O

OECD Organization for Economic Co-operation

and Development. 113

OWD One-Way Delay. 7

R

REST Representational State Transfer. 19

RFC Request for Comments. 13

RIPE Réseaux IP Européens. 9

RIPE NCC Réseaux IP Européens Network Coordina-tion Centre. 8

RIR Regional Internet Registry. 8

RIS Routing Information Service. 10

RMSE Square Root of the Minimum Mean Square

Error. 34

RRC Remote Route Collector. 10

RTT Round-Trip Time. 6

S

SCC Strongly Connected Component. 60

SOI Speed of Internet. 8

STP Small Transit Providers. 24

T

TBG Traceroute-Based Geolocation. 14

TCP Transmission Control Protocol. 6

TSLP Time-Sequence Latency Probing. 11

TTL Time to Live. 7

U

UDM User-Defined Measurement. 10

UDMD User-Defined Measurement Derived

(dataset). 115

UTC Universal Time Coordinated. 123

V

(10)

CHAPTER

1

Introduction

The Internet is an essential asset for the private and working lives of most people, as well as the administrative and strategic activities of a country, including health care and education. This makes it a part of the critical infrastructure, together with the networks of electricity, gas, and water distribution.

The Internet is composed of networks managed by different players called Inter-net Service Providers (ISPs). A heterogeneous set of protocols, operating at different abstraction levels, are in place to allow communication between customers and ser-vices located in different parts of the world and in different networks. The connections among ISPs change over time for technical and business reasons. At any time, the pro-tocols can establish a different network path to follow in order to deliver data between endpoints on the Internet. This process of selecting a path, both in a network and across networks, is called routing.

In order to scale the network while providing an acceptable quality of service, ISPs need to constantly monitor the behavior and performance of the protocols involved in the functioning of the Internet. Such activity is essential; without it the network and its routing would be a partial black box. Monitoring the network corresponds to at least three mandatory steps: (i) data collection and enrichment (which may or may not include the active execution of measurements); (ii) automated analysis of the data; and (iii) presentation of the data in a way that is easily interpreted. These steps allow ISP operators to prevent network issues and to troubleshoot when such issues arise. Additionally, they enable decision makers to plan for the expansion of both network coverage and offered services.

While collecting data is an enabling factor for the other two steps, it is not sufficient on its own to give network operators a grasp on network behavior or to add any value for day-to-day operations. If we consider “a network”, monitored in isolation from the rest

(15)

Chapter 1. Introduction

of the other networks which make up the Internet, we could easily collect huge amounts of data about that network only. The challenge would be cleaning, interpreting, and summarizing such data in order to be useful for actual decisions. In reality, the situation is extremely more challenging because analyzing a network in isolation is rarely useful. ISPs have to observe how their network interacts with the other networks in order to have end-to-end visibility about the services they offer.

In this thesis, we present several approaches for the analysis and visual representa-tion of Internet routing.

First, we design and evaluate mechanisms enabling the geographic analysis of rout-ing data. The geographic aspect of Internet routrout-ing is fundamental for the interpretation of some indicators of performance and reliability. For example, the latency measured between two devices cannot be easily characterized if we do not know the physical space between such devices. The same can be extended to the topological data (the sequence of network devices traversed by data exchanged by two network endpoints), which is both difficult to interpret and of limited use when troubleshooting network degradations in the absence of geographical data. Such information is so important that some network operators have adopted various strategies for annotating the positions of the network devices they manage. The geographical information is also commonly used: (i) to deliver content in an optimized way, by serving such content from loca-tions closer to the customer; (ii) to optimize the routing between two endpoints, by reducing physical distances and latencies involved in the communication; (iii) to re-spect country-specific regulations about traffic locality or copyright; and (iv) to analyze coverage data and plan network improvements and expansions in specific geographic areas. However, the task of geolocating a network device is far from easy. Many of the approaches adopted until now rely on humans manually providing such information. Moreover, due to the dynamic and decentralized nature of the Internet, geographical data quickly becomes stale and difficult to validate; hence the need for automatic olocation calculation and enrichment of network data. In this thesis, we focus on the ge-olocation of routers and servers that allow the Internet to work. We design and evaluate an active geolocation service able to automatically estimate the location of a connected device by performing latency measurements towards it. We then study several factors influencing active geolocation processes, and provide an estimation of the maximum theoretical accuracy achievable worldwide by a generic active IP geolocation system, thereby establishing a baseline for future research on the topic.

Further, we tackle the problem of supporting operators in their daily monitoring and troubleshooting by designing and testing new visualization metaphors able to auto-matically condense routing data. Troubleshooting means understanding and resolving events that alter the normal operation of the network. For example, any of the devices involved in the network could stop forwarding data due to overload, a bug, or mal-function. This can disrupt or degrade the services offered. In general, troubleshooting starts with the analysis of network topology data, often enriched with other metrics (e.g., latency, geolocation, or path length). For example, the use of a wrongly located server to deliver content to the user may cause suboptimal paths and latencies. Con-versely, an increase in latency on a path segment may justify changes in the routing between multiple endpoints sharing that segment. However, analyzing and visualizing network data is not only about troubleshooting. Network expansion and improvements

(16)

are also part of the daily tasks of a network operator. New devices and links can be added to the network, and new configurations can be deployed in order to balance the load, increase redundancy, or achieve different levels of service quality. Following a new configuration, the operators need to inspect the network and compare it to a pre-vious state in order to verify that what they deployed is achieving the desired effects. In conclusion, operators need tools to maintain and scale the network. In particular, such tools are essential to quickly identify whether and where a failure occurred and to evaluate the extent of its impact. User interfaces able to condense such information are therefore essential. Various protocols operate at different abstraction levels to support the functioning of the network; all those abstraction levels are involved in routing and, therefore, in potential network issues. As such, each of the abstraction levels needs to be monitored. In this thesis, we introduce a new visual metaphor for the representation of evolving network topologies. We develop new algorithms to produce dynamic clus-tered graphs that evolve over time and are able to support important cognitive aspects during network operations. We then focus on the inter-domain routing case and present a visualization to support network reachability monitoring. In doing so, we define and experiment with several heuristics, contributing to the generic problem of automatically generating readable stacked area chart diagrams.

Finally, we demonstrate several applications of the devised techniques in the anal-ysis of Internet behaviors at scale. The development and evaluation of methodologies for the analysis and visualization of network data is scientifically important as well as having operational implications: the developed techniques may enable further research which, in turn, can bring more innovation. Such techniques can be used to better charac-terize the Internet or to study its behavior and evolution over time or during unexpected historical events, such as a pandemic.

The thesis is organized as follows.

Chapter 2 introduces notions and terms used throughout the rest of the thesis. First, we formalize the concept of Internet routing and explain how different protocols con-tribute to it. Other concepts related to network measurements and graph theory follow. In Chapter 3, we present the RIPE IPmap platform, which is a system we developed for the geolocation of IP addresses. This platform is able to estimate the location of a connected device by means of latency measurements. We use it for enriching network measurements with geographic data across the entire thesis. In this chapter, we first introduce the methodology behind RIPE IPmap. We then evaluate its accuracy and coverage, and compare it to other geolocation solutions. RIPE IPmap is currently a service offered by the RIPE Network Coordination Centre.

In Chapter 4, we study the maximum theoretical accuracy that can be achieved worldwide by a geolocation approach aimed at geolocating Internet infrastructure using latency measurements. We study how the position of landmarks (the devices used as the measurement sources) and the strategy adopted for their enrollment affects localiza-tion accuracy. We compare the two most common approaches: a more centralized and controlled one, which uses well-connected machines belonging to the infrastructure as landmarks; and a more distributed and scalable one, based on landmarks positioned at the edge of the network by volunteers. The study is based on an extensive dataset composed of 23 million measurements performed at both global and regional levels, including regions that were scarcely or never observed in previous works.

(17)

Chapter 1. Introduction

Chapter 5 introduces the topic of the topological representation of network data. In this chapter, we present Radian, a visualization that represents complex network topolo-gies at different abstraction levels and animates their evolution over time. We propose a novel visual metaphor and novel domain-specific algorithms for the representation of evolving network topologies by means of dynamic clustered graphs.

In Chapter 6, we continue the theme of network data representation, but we move to a higher level of abstraction. In particular, we introduce the topic of inter-domain routing and propose, prototype, and evaluate Upstream Visibility, a visual interface to display such data. We define new metrics for the evaluation of the readability of stacked area charts, which constitute the heart of our interface. Then we exploit such metrics for creating novel heuristics able to automatically generate readable stacked area charts. Chapter 7 contains further research on the analysis of Internet behaviors. The tech-niques described in the previous chapters are instrumental in enabling the three studies reported in this chapter. The first study presented sheds light on a phenomenon never before identified. Large-scale datasets of Internet topology measurements are com-monly used by researchers and operators to investigate Internet performance or to tackle network issues. Investigating the quality of such datasets is of paramount importance. Looking at topology data over time using a visualization tool like the one described in Chapter 5, we commonly observe paths that change over time, almost periodically. In this analysis, we are interested in verifying whether such changes are indeed the result of a periodic phenomenon and, if so, in determining whether such a phenomenon is an artifact of the dataset used or is the result of real topological changes in the network. The second analysis we report is a more in-depth study of an anomaly we isolated in the Middle East while investigating the maximum theoretical accuracy of active IP ge-olocation worldwide (Chapter 4). The process of geolocating Internet hosts from their IP addresses using latency measurements highly depends on correctly converting such latencies into physical distances. This conversion is done with a delay-distance co-efficient, which can differ in different areas of the world due to the circuitousness of Internet paths. In the Middle East, we discovered that this coefficient is much smaller than in other regions. For this reason, we analyze the topological and geographical properties of Internet paths in the Middle East and compare them to Europe. Finally, we conclude Chapter 7 with an unplanned research topic: an analysis of the impact of COVID-19 on the Internet. The COVID-19 pandemic, ongoing at the time of writing, drastically changed billions of people’s way of life. In this study, we evaluate the im-pact on Internet latency caused by the increased amount of online activity following the lockdowns that various countries adopted to limit exposure to the virus. The study fo-cuses on Italy, which was considerably impacted by the virus, but also includes results from Spain, France, Sweden, Germany, and the whole of Europe.

The thesis comes to a close in Chapter 8 with conclusions and a list of references to datasets and source code for the reproduction of the studies carried out during the doctorate.

(18)

CHAPTER

2

Definitions and Reference Scenario

2.1

Internet Routing

The Internet is often defined as a network of networks that all cooperate together with the goal of moving data among devices. The task of moving data across the Internet is performed by devices called routers. Essentially, various routers are connected to each other and when one router receives data for a specific destination, it forwards it to a neighboring router (a “hop”) in order to push the data closer its final destination. The selection of the outbound interface and next hop is made by the forwarding plane, also known as the data plane, based on the Forwarding Information Base (FIB), a data structure kept in memory by the router.

Routing, performed by the control plane, is the exchange of reachability information between routers. A router uses the reachability information to build and maintain the FIB based on routing processes. Such processes decide which of the possible calculated routes to actually insert in the FIB. In addition, the operator often influences these decisions by configuring local policies.

Each device in a network (e.g., computers, routers, servers) is identified by one or more IP addresses (or simply IPs) [1]. The data transferred on the network is split into packetslabeled with the IP address of the destination.

The different networks which collectively form the Internet are organized in Au-tonomous Systems(ASes). Each AS represents one or more networks operated by a single administrative entity or domain, e.g., an Internet Service Provider (ISP), a broad term referring to a company operating a portion of the Internet infrastructure at what-ever scale. Routing happens both inside an AS and across different ASes; these cases are commonly specified by the terms intra-domain routing and inter-domain routing, respectively.

(19)

reach-Chapter 2. Definitions and Reference Scenario

ability information among ASes, enabling inter-domain routing on the Internet. The reachability information is used to build a partial graph of AS connectivity, which en-ables routers to avoid loops or enforce specific routing policies defined by the operators. Such reachability information is exchanged on TCP sessions established between pairs of routers, called peers. Routers on the boundary of an AS exchanging information with another AS are called border routers (or border gateways). A router can use BGP to announcean IP prefix (a range of IPs) to neighboring routers. The AS containing such a router becomes an origin of the announced IP prefix. When a border router shares information about the reachability of a prefix with a neighboring router in a different AS, it prepends its AS number in front of the AS path, an attribute of the BGP message. Hence, from the point of view of the router receiving the BGP message, the AS path is the list of all the ASes to traverse in order to reach the origin AS of the prefix contained in the message.

Some other protocols deserve to be mentioned to ease the reading of this work. The Internet Control Message Protocol (ICMP) [3] is a protocol used for diagnostic purposes or to generate responses in case of errors during IP forwarding. The Trans-mission Control Protocol(TCP) [4] is the protocol used in the vast majority of Internet activities; for example, the Hypertext Transfer Protocol (HTTP) [5], used for accessing web content, is based on TCP.

2.2

Network Measurement Terminology

Network measurement techniques can be divided in two main categories: passive and active.

Passive measurementsare based on the analysis of data obtained without creating or altering the data flow already happening on the network. Possible sources of such data are logs, router administrative information, and traffic inspection.

Active measurementsare based on injecting new traffic on the network and observing its behavior. In this work, we develop techniques based on both active and passive measurements.

What follows is a list of concepts related to network measurements, which are es-sential for understanding the presented work.

A path is an ordered sequence of hops traversed by a packet from a source device to a destination device (usually called target). The term path is generic and, depending on the context in which is used, can indicate a sequence of IP addresses, a sequence of routers, or a sequence of ASes. The physical route between the source and the target is not straight. From a geographic point of view, Internet paths are composed of multiple segments between adjacent hops that do not take place in a straight line. Such a property is called circuitousness of the path. Each path is characterized by a different amount of circuitousness.

The Round-Trip Time (RTT), or latency, is the delay (in milliseconds) it takes for a packet to go from a source device to another device and back to the source. Let δ be the delay of a given network path. It can be approximated as δ = δtra+ δpro+ δque, where

δtra is the transmission delay, δpro is the propagation delay, and δque is the processing

delay introduced by intermediate routers.

Ping is a utility which performs an active measurement to calculate the RTT to a device on an IP network. Pre-installed in virtually all operative systems, it is possibly

(20)

2.2. Network Measurement Terminology

the most used network measurement utility. The basic mechanism ping uses to collect the RTT is to send an ICMP echo request message to the target and wait for an ICMP echo reply. Packet loss and errors are also reported. Ping is supported by all major network measurement platforms and we use it throughout this thesis.

Internet routing changes frequently as the routing protocols update their decisions. For example, paths can change due to the disconnection of a link or due to load bal-ancing. To approximate the current followed path between a source and a target, the most used tool is traceroute. It works by sending a sequence of packets with progres-sively increased time-to-live (TTL) values to the specified target. The TTL signals the intermediate routers when the packet has to be discarded. When a packet is discarded, a time exceeded message is generated by the router discarding it and sent back to the source. By using incremental TTL values, the source is able to collect the sequence of IP addresses of the routers sending back time exceeded messages (or “*” if no response was received) between the source and the destination. For each response, the traceroute also calculates the RTT. One or more traceroute results form an IP-level topology (or simply traceroute topology). If we group together all the IP addresses belonging to the same router, we obtain a router-level topology. Moreover, by grouping together all the routers belonging to the same AS in a single entity, we can observe the AS-level topol-ogy, which is yet another abstraction level of Internet topologies. We use traceroutes and the topologies derived from them heavily throughout this thesis.

The terms routing and forwarding have different meanings, as described in the previ-ous section. However, in an Internet measurements context, these terms are sometimes used interchangeably. We prefer to clarify our usage of these terms. In the traceroute context, we use the term routing as in the set of routing decisions deduced by observ-ing the forwardobserv-ing of measurement packets. We explicitly use the term inter-domain routing when we refer to BGP data.

IP to AS lookupis the process needed for converting traceroute topologies to AS-level topologies. This process consists of identifying the origin AS of the IP addresses of the various hops in the traceroutes. It can be performed with BGP data, such as the one pro-vided by RIPE RIS (see Section 2.4). Unfortunately, links between two ASes (inter-AS links) have IP addresses belonging to the pool of one of the two, which makes a simple IP to AS lookup (based on BGP data) unreliable. There are methodologies to iden-tify such links and mitigate the issue [6–8]. In particular, the authors of [6] released a tool able to perform the identification of inter-AS links on generic traceroute datasets. In this thesis, we use a combination of RIPE RIS and [6] as sources for the IP to AS lookup.

The One-Way Delay (OWD) is the delay measured between the source of a measure-ment and the target, with communication in only one direction. It differs from the RTT in that the latter measures the time needed for a data unit to reach the target and come back. The OWD is commonly used in IP geolocation. Precise OWD values between two devices in an IP network can be measured through the use of synchronized clocks, which usually involves having GPS on the devices [9]. However, this method suffers from several limitations, such as requiring ad hoc cooperation between both devices and the need for special hardware to synchronize clocks. Hence, it is not suitable for generic devices connected to the Internet. For this reason, the most commonly adopted approach during IP geolocation is to derive an approximation of the OWD by simply

(21)

Chapter 2. Definitions and Reference Scenario

calculating RT T₂ . In this thesis, we use OWD to refer to such an approximation. This approximation introduces an error, since there is no way to be sure that the paths used in the two directions contribute equally to the RTT value. In particular, the paths can be asymmetric, involving different routers, different amounts of hops, or even different ASes. The first work to appear on the topic is the one in [10], which studies routing pathologies and defines the problem of routing asymmetry. They perform 40 000 tracer-outes between 37 hosts and observe that asymmetries involve at least one different AS in 30% of the cases. More recently, the study in [11] includes a full-mesh of traceroute measurements between 350 hosts. They note that at the AS level the routing asymmetry involves 65% of the pairs when those are in commercial networks (compared to 14% in academic networks). In [12], the authors identify how often and where asymmetries happen in the RIPE Atlas dataset. In particular, they identify that, in general, paths are asymmetric by either 1 or 2 hops and that those are mostly located in large ISPs. The study in [13] analyzes the topic of the OWD asymmetry induced by routing asymme-try. The authors confirm that, from the delay point of view, the majority of asymmetries are introduced by commercial networks and that this is exacerbated by paths subject to high RTTs. They also highlight how router-level asymmetry does not imply OWD asymmetry, while OWD asymmetry usually implies router-level asymmetry. The use of the OWD in the process of IP geolocation is briefly introduced in Section 2.5 and later explained in Chapter 3. Errors involving IP geolocation are studied in Chapter 4.

The Speed of Internet (SOI) is the conversion factor describing the linear relation-ship between the distance and the OWD. In simple terms, it is the number of kilometers a packet travels (in a straight line) toward the measured target in a millisecond. The SOI is crucial for the accuracy of some IP geolocation processes. We better explain the importance of the SOI during IP geolocation in Chapter 4.

The term probe, in this document, refers to a device which can be remotely in-structed to perform active network measurements. Network measurement platforms are distributed networks of probes. This nomenclature is mostly derived from the one defined by the RIPE Atlas platform (presented in Section 2.4), which is heavily adopted to collect network measurements in this work.

We define landmark as a device of known location used as spatial (geographical) reference during the process of active IP geolocation. More precisely, we use the word landmark to refer to a probe used as a source for the latency measurements needed in such a process.

2.3

Stakeholders

Network data analysis is of major importance for any player operating on the Internet. Different players, or stakeholders, operate on the network in different roles. In this section, we describe some of the main Internet stakeholder organizations or groups which will be mentioned during the rest of the thesis. Some of these have a strong interest in the techniques defined in this work, and played a major role in guiding the development and the production of some of the techniques.

Founded in 1992, the Réseaux IP Européens Network Coordination Centre (RIPE NCC) is the oldest of the five Regional Internet Registries (RIRs) which administer resource assignment (IP addresses and Autonomous System numbers) and support

(22)

In-2.3. Stakeholders

ternet coordination worldwide. It is a not-for-profit membership organization which counts as members more than 24 000 ISPs and other stakeholders [14]. The main func-tion of the RIPE NCC is to provide Internet resource allocafunc-tions in Europe, the Middle East, and parts of Central Asia. In addition to its registry role, the RIPE NCC has a key role in supporting the development and administration of the Internet worldwide by participating in Internet governance processes and maintaining several technical el-ements vital to the Internet’s infrastructure. For example, the RIPE NCC maintains K-root, one of the world’s 13 root Domain Name System (DNS) name servers, and DNSMON [15], the service monitoring the worldwide core DNS infrastructure, in-cluding root and top-level domains. The RIPE NCC developed some of the largest data collection platforms constantly monitoring the Internet worldwide. Some of the de-rived datasets receive hundreds of millions of requests per day [16], as they are widely used by researchers, network operators, and law enforcement. These datasets, better described in the following section, were instrumental for the research carried out in this thesis. The other four RIRs are: (i) LACNIC, covering Latin America and parts of the Caribbean; (ii) AFRINIC, covering Africa; (iii) ARIN, covering North America and parts of the Caribbean; and (iv) APNIC, covering the Asia-Pacific region.

Other stakeholders also play key roles in the Internet’s infrastructure.

Tier-1 carriers are large ISPs covering massive areas. The most generic definition is that they are able to reach the rest of the Internet via peering and without paying any other provider for transit. Tier-1 carriers built important parts of the Internet infrastruc-ture, such as sea cables, and they are in charge of most of the traffic exchanged between continents and countries. Sometimes they are referred as backbone Internet providers.

Internet Exchange Points (IXPs) are physical infrastructures through which ISPs connect and exchange traffic directly.

Content Providers are organizations that handle the distribution of content on the Internet. The content distributed can be of any kind, from websites to movies. In the path definition introduced in the previous section, a content provider would be one of the endpoints of a path (however, some content providers also own part of the network in between).

Content Delivery Networks (CDNs) are geographically distributed networks and data centers offering services to content providers to increase the availability and per-formance of content distribution. In general, this is achieved by promoting data local-ity: geographically distributed servers proxy content in a way that such content can be served from a location closer to the end user.

The list above is not meant to be exhaustive, but to report stakeholders that are often named in this thesis. Other players with key roles in the Internet’s infrastructure also exist which are not listed above (e.g., middle-tier ISPs).

Finally, some research institutions play an important role in the production and anal-ysis of network data.

IIJ Innovation Institute is a research institute founded by Internet Initiative Japan Inc. to promote research on communication technologies, in particular in the field of network measurements and security. They produce the Internet Health Report [17], which combines many publicly available datasets in a dashboard to monitor global In-ternet performance. They also maintain, together with rg.net, a set of border routers pe-riodically announcing and withdrawing prefixes, called BGP beacons, which are useful

(23)

for experiments on timing and anomalies of inter-domain routing [18].

Cooperative Association for Internet Data Analysis (CAIDA) is a research center based at the San Diego Supercomputer Center, which investigates both practical and theoretical aspects of the Internet, with particular focus on analyzing behavior, usage, and evolution of the Internet.

2.4

Measurement Platforms and Datasets

Some of the organizations listed in the previous section provide the open datasets adopted in this work.

RIPE Atlas [19] is a large-scale distributed system provided by the RIPE NCC, aimed at measuring Internet reachability and performance. Its ultimate goal is to pro-vide a detailed understanding of the state of the Internet on a global scale. RIPE Atlas is a network of devices belonging to two categories: probes and anchors. Probes are small low-end hardware devices, mostly hosted in domestic environments or in office networks. Anchors are more powerful devices, mostly deployed as servers hosted in professional environments, such as IXPs, ISPs, and data centers. Hosting either a probe or an anchor is done on a voluntary basis, but with some important differences. Probes are freely given to interested users with almost no requirements for the host. Anchors are instead deployed according to more strict requirements (e.g., most of the anchors must support both IPv4 and IPv6 and have dedicated bandwidth), as they should al-ways be online and running. The difference between these two types of devices allows to observe the Internet from different perspectives, which is leveraged in this work. Among the open platforms able to perform Internet measurements, RIPE Atlas is the one with the greatest number of vantage points. At the time of writing, there are more than 11 000 probes and 650 anchors involved in Internet measurements. While it has a bias towards Europe, where many devices are located, the large number of vantage points allows, in any case, excellent and unprecedented world-wide coverage.

RIPE Atlas automatically carries out predefined measurements, called built-in mea-surements, towards important parts of the Internet’s infrastructure and among all the vantage points of the platform itself. Additionally, RIPE Atlas allows its users to schedule User-Defined Measurements (UDMs) towards arbitrary targets. Scheduling measurements can be done with an API, thus allowing users to create measurements and to collect results automatically (i.e., by means of scripts), enabling the creation of large-scale measurement campaigns, like the ones used in this thesis. Among the built-in measurements produced autonomously by the platform, we have: (i) the Anchor Measurements, in which several probes and anchors perform measurements towards anchors; and (ii) the Anchor Mesh Measurements, in which all the anchors measure each other. Both datasets are composed of ping, traceroute, and HTTP measurements. Additionally, built-ins produce DNS measurements towards the root and main top-level domains, and HTTP measurements towards strategic web pages. The results produced by RIPE Atlas have been extensively used for both research (e.g., more than 800 results on Google Scholar at the time of writing), and operational purposes.

Another public dataset offered by the RIPE NCC is the one produced by the Rout-ing Information Service (RIS) [20]. RIS is a network of globally distributed Remote Route Collectors(RRCs), typically located at IXPs, that collect and store inter-domain routing data. Volunteering ISPs peer with the RRCs using BGP. Commonly, such peers

(24)

2.4. Measurement Platforms and Datasets

are called collector peers (CPs). RRCs act similarly to border routers, except that the routes they collect are not propagated: RRCs simply store all the BGP updates received by their peers. RIPE RIS was created in 1999, and today it collects around 60 gigabytes of routing data daily from 1 310 CPs. In total, the CPs provide more than 190 million routing changes daily. The RIPE NCC stores this data and provides open access to more than 15 years of BGP routing data collected worldwide. A number of other or-ganizations created similar projects, with the most prominent (in terms of number of observation points) being: (i) RouteViews [21] of Oregon University; (ii) BGPmon [22] of Colorado State University; and (iii) Isolario [23] of the National Research Council (CNR) of Italy.

Finally, the RIPE NCC also offers RIPEstat [24], a set of data APIs and visualiza-tion tools providing easy access to Internet-related data. This service is essential for obtaining information such as which AS number is announcing or managing a specific IP, or which organization is behind a specific AS number. RIPEstat, with more than 100 million requests per day [16], is one of the most used resources by Internet opera-tors and researchers. Among the most used visualization tools offered by RIPEstat, is BGPlay [25, 26], a tool for the visualization of evolving inter-domain topology data we developed prior to the work presented in this thesis and which is prerequisite knowledge for the work presented in Chapters 5, and 6. Some of the new techniques developed in this work are now also deployed in production in RIPEstat.

Archipelago(Ark) [27] is a measurement platform deployed by CAIDA. At the mo-ment of writing, the platform is composed of 158 hardware nodes. The Ark platform is used to produce various datasets, including the following. (i) The Routed IPv4 /24 Topology Traceroute [28] is based on traceroutes from Ark nodes to randomly cho-sen destinations in all the routed /24 IPv4 BGP prefixes reachable by the platform. (ii) The Macroscopic Internet Topology Data Kit (ITDK) [29] is derived from the pre-vious dataset and used to infer router-level topologies. Routers have a different IP address for each interface they use to forward data. Converting a traceroute topology into a router-level topology requires identifying which addresses belong to the same router. This process is called IP alias resolution and can be performed in various ways (described in [30, 31]). ITDK provides such data in an easy-to-use format, which can be used to convert different IPs belonging to the same router into a single one. (iii) The Measurement and ANalysis of Internet Congestion (MANIC) reports inter-AS links and their congestion state. The dataset is collected using Bdrmap [8] and the Time-Se-quence Latency Probing (TSLP) [32] method from the Ark platform. As a side effect of this probing, this dataset lists a large number of connected IP addresses, able to receive and answer ICMP requests, which can be used as targets for further network measurements.

NLNOG RING is a platform maintained by the Network Operator Group associa-tion of the Netherlands. It is composed of 567 nodes distributed in 56 countries. The platform was born as a collection of machines with shared access to support network operators in troubleshooting. In recent years, the platform has also started to be used more and more for research purposes.

Measurement Lab(M-Lab), founded in 2009 by New America’s Open Technology Institute, the PlanetLab Consortium, and Google, provides an open Internet measure-ment platform and hosts tools to help researchers understand certain aspects of the

(25)

Internet performance. The measurement platform offered by M-Lab is composed of devices called “pods”. Each M-Lab pod consists of 3-4 servers and a switch, connected directly in Tier-1 data centers. At the moment there are 155 pods connected in 35 countries.

One of the key properties of the measurement platforms listed above is that, in ad-dition to the possibility of using them as a source during network measurements, they provide an invaluable set of targets to measure, since they publish accurate metadata for each of their nodes. This information is fundamental to the understanding of their geographical location, their connectivity type, and other aspects that can be exploited during network research.

2.5

IP Geolocation

The geographical location of probes is one piece of information provided by the plat-forms described in the previous section. This is useful in order to know where the source of the measurement is (and in some cases, the target, if a probe of the platform is used as a target). For all the other devices involved in the measurement, such infor-mation is not available. For example, if we consider a traceroute between two RIPE Atlas probes, we would know the location of the endpoints but not the location of all the hops in between. This information is extremely important during day-to-day opera-tions, for example in troubleshooting, to identify the geographic location of the culprit of a network issue, or simply to verify that the routing policies deployed are behaving as planned.

Additionally, such information is fundamental in research to study the evolution of the network, or to investigate path circuitousness and traffic locality. For example, the geolocation of nodes belonging to the infrastructure was used to verify the location of the servers of some proxy services [33], which showed that one third of such proxies are geolocated in countries different from what was advertised. In [34] and [35], in-frastructure geolocation was used to infer whether measurements between endpoints passed through IXPs and if they remained in the same country. Similarly, servers run-ning advertisement and web tracking services were located to estimate the amount of traffic that crosses data protection borders [36]. Thanks to the system that we describe in Chapter 3, the authors concluded that 90% of European user tracking is carried out inside EU’s borders.

More generally, estimating the geographic location of a device connected to the Internet is fundamental in many other situations, such as offering geographically tuned digital content, language selection, and regulation and tax compliance.

We can define a geolocation method as a function that, given an IP address, returns the geographical location (e.g., the city) in which the device connected with the IP is located. Similarly to the Internet measurements classification, existing IP geolocation methods also fall into two main categories: passive and active.

2.5.1 Passive Geolocation

Passive geolocation uses information that is available without the need to perform mea-surements towards the target IP address. For example, DNS hostname-based geoloca-tionis a geolocation method which converts IPs to fully qualified domain names

(26)

(host-2.5. IP Geolocation

names) by using reverse DNS lookups, after it tries to detect geographic information (such as airport codes and country codes) placed by network operators in such host-names [37, 38]. Unfortunately, this technique has some weaknesses: (i) the approach works only with IP addresses with a reverse DNS set; (ii) the operators do not adhere to a standard in annotating such domains, and in many cases no standard codes (such as ISO codes) are used to identify locations; and (iii) the location has to be updated manually in the case of changes.

RFC1876 [39] instead proposes an experimental dedicated DNS record “LOC” to be used in place of exploiting hostnames. This never gained momentum, and it is rarely set, possibly also due to the complicated geolocation format adopted. In this case, the location also has to be updated manually.

The RIR databases of the allocated resources are publicly accessible and often used as a source of geolocation information. These databases contain information such as the location of the company to which an IP prefix is registered. There is no real correla-tion between the company’s physical presence and the deployment of an IP address; for example, a network could span more than a city/country, the company’s administrative headquarters could be in a different location than where the network is deployed, or the IP could simply be assigned to a customer with another geographical location. How-ever, such information is commonly used by many geolocation providers in absence of a better source. Some RIRs provide a country field where assigned prefixes can be annotated with their geolocation. Among the RIR databases, the RIPE and APNIC databases provide an additional optional “geoloc” field, which can be used to annotate the latitude and longitude of an assigned prefix.

Geofeed[40] is a textual format based on csv, which allows ISPs to self-publish IP geolocation feeds by simply making available a textual document somewhere on their websites. The format prescribed allows the specification of the geolocation of single IPs or of entire IP prefixes. The geolocation can be expressed with flexibility, from the country down to the city. Geofeed is a promising format, but currently has one main weakness: the geofeed files published by the ISPs are not indexed anywhere, which prevents automatic discovery and importing of such data.

Finally, Commercial geolocation datasets use proprietary methodologies relying on a combination of public datasets, e-commerce logs, updates manually crowdsourced by network operators and more. They often start with RIR data and improve such data with e-commerce logs, which provide mapping of users’ IP addresses to the shipping addresses they provided during a purchase.

2.5.2 Active Geolocation

Active geolocation is a series of methodologies aimed at solving the main problem afflicting passive geolocation: the lack of automation, which contributes to unverified and stale data. Active geolocation methodologies use active measurements to locate a device connected with a specific IP address.

Figure 2.1 depicts the basic three steps common to active geolocation approaches: 1) one or more landmarks issue latency measurements towards the target IP address; 2) the measured RTT is converted into an approximation of the OWD;

(27)

RT T 2 ∗ SOI

Figure 2.1: Depiction of the basic approach for active IP geolocation.

3) the OWD is used to estimate a distance by using the distance-delay coefficient (which in the best-case scenario is the SOI between the landmark and the target) and this distance then defines an area around the landmark in which the target device must be located.

At this point, a geolocation method can locate the target in the same place as the land-mark reporting the lowest latency value; in this case, the set of possible locations is discrete and restricted to the location of the landmarks [41]. Trying to overcome the limitation of the discrete answer space, another set of geolocation methods, defined Constraint-Based Geolocation (CBG), adopts multilateration [42]. With multilater-ation, the target IP is located in the intersection of the areas calculated by multiple landmarks, thus allowing for a continuous space of answers [42–45]. Anyway, these methods are also strongly affected by the landmark proximity, since accurate CBG re-quires landmarks within sufficiently small RTT from targets [46].

Other geolocation methods called Traceroute-Based Geolocation (TBG) adopt tracer-oute measurements to exploit the information obtained about intermediate hops (e.g., tar-get proximity to hops of known position and reverse DNS) [47–52]. Furthermore, some of these approaches also introduce nodes of known position as passive land-marks (e.g., web servers) to improve geolocation accuracy [44, 47, 53–55]. For ex-ample, in [56], the ISP’s points of presence (PoPs) are detected with path exploration, associated to their location, and collected in a database. This information is then used to guess an end user’s city-level location based on the PoP traversed by active measure-ments towards the user.

Finally, statistical methods are used to define the probability density of the position of the target or to rate the quality of an estimated position [48, 49, 57–59].

(28)

2.6. Graph Drawing

2.6

Graph Drawing

In this thesis we extend and devise algorithms to automatically produce graphical rep-resentations of networks. In particular, we use graphs and graph-derived metaphors.

In mathematics, a graph G = (V, E) is a data structure composed of a set V of vertices (commonly called nodes), and a set E of edges connecting pairs of vertices. An edge e ∈ E connecting two vertices v1and v2is noted as e = (v1, v2), we say that e

is incident to v1 and v2, and that v1 and v2 are incident to e. Two vertices are adjacent

if they are incident to the same edge. Two edges incident to the same vertex are also called adjacent. The degree of a vertex is the number of edges incident to it. A graph is directedif an order exists across the edges in E; otherwise, the graph is undirected. A pathin a graph is a sequence of adjacent edges. A cycle is a non-empty path in which the first and last vertices are coincident. A maximal connected acyclic graph is called a tree. In a tree, a vertex is designed as the root of the tree, while vertices with a degree of 1 are called leaves. We say that a graph is complete (or a clique) if there is an edge for each pair of vertices in the graph, while we say that a graph is connected if there is a path between every pair of vertices (and it is otherwise disconnected). A loop is an edge (v, v). Multiple edges are sets of edges incident to the same pair of vertices. A simplegraph is an undirected graph without loops or multiple edges.

A basic drawing for a G = (V, E) graph is a diagram in which the vertices v ∈ V are represented as circles and the edges are represented by lines or curves. These types of drawings are commonly used for representing networks, in which the vertices are the routers and the links between them are the edges. Also, vertices can be visually grouped together in clusters, which allows for routers to be grouped based on their properties (e.g., administrative domains). There are various definitions for graphs which allow nodes to be clustered. A compound graph is one in which clusters of nodes can be nested, and edges can exist between nodes and clusters. A clustered graph, which we heavily use in network visualization, is a compound graph in which clusters cannot have edges (only nodes can). A clustered graph in which the clusters cannot be nested is called flat. A graph is defined as planar if it is drawn in such a way that the edges do not cross each other. Density is a property of a graph defined as |E|/|V |. A graph can be dense, with a number of edges close to the maximum possible, or sparse, with only a few edges. Such a definition is vague and usually specified in the context in which it is used. We consider a graph to be sparse when |E| is lower than the number of edges a planar graph with |V | vertices would have.

There are several algorithms for the automatic drawing of graphs. However, for the understanding of this thesis, two are of interest: force-directed and layered drawings.

In force-directed algorithms, the graph is seen as a set of physical objects interacting with each other. Of these algorithms, the most used is possibly the spring embedder, in which nodes are modeled as objects having a repulsive charge and edges are springs try-ing to keep those nodes close. This family of algorithms has some important strengths. First, the forces involved in the physical system can be easily tuned based on the do-main of the data, or simply based on some trial-and-error cycles. Second, there is no particular requirement for the graph’s input. Finally, it scales well to large graphs. An example of a network visualization tool adopting a spring embedder algorithm is BG-Play. However, the main drawback of force-directed algorithms is the unpredictability

(29)

of the aesthetic of the final drawing. In particular, these algorithms’ aesthetic results are strongly impacted by the initial position of the nodes, and by the total number of components involved.

Layereddrawings (or Sugiyama drawings) are those in which the edges of the graph are organized into hierarchies. Initially, each node is assigned to a specific layer based on its distance from a root node. Then, the layers are ordered in such a way as to minimize the edge crossings. If clusters exist, the layer ordering must take into account that nodes in the same cluster have to be in consecutive layers. Finally, the position of the nodes is computed with geometric operations enforcing the layer ordering and the distance between nodes. Two major drawbacks afflict these types of drawings: the layer ordering is NP-hard (so heuristics are used), and they work only on acyclic graphs.

(30)

CHAPTER

3

Geolocation of Internet Infrastructure

Our efforts in the field of IP geolocation are focused on deriving the location of nodes deployed in the Internet’s infrastructure. In this chapter, we describe a geolocation solution we developed that is used in many of the following chapters. We published some of the content presented in this chapter in [60].

3.1

Infrastructure Geolocation

IP geolocation databases predominantly focus on finding the location of end users’ devices, such as computers and mobile phones. These databases are typically inaccurate for the geolocation of IP addresses that belong to the Internet’s infrastructure, such as routers and servers. The geolocation of the infrastructure is instrumental in both research and operations. For example, it helps to study the Internet’s evolution, to understand data locality, to troubleshoot accidents, and to visualize the effects of natural disasters.

Various commercial services offer APIs that can be used to retrieve the location associated to a given IP address. Currently, this information is mostly used to person-alize the experience of a website based on the user’s location. For this reason, they are focused on the geolocation of end users, even if they are also often adopted as a geolocation source for infrastructural components. As described in Subsection 2.5.1, these services are often based on passive datasets. RIR data is a considerable compo-nent of their geolocation data. Such data is based on administrative information about the company to which an IP address has been assigned [61]. As a consequence, the reported position can be rather far from the actual location in which the IP is deployed in a connection, especially for large organizations. Various geolocation databases have been evaluated, including commercial ones, and found to be inaccurate [61–65].

(31)

Chapter 3. Geolocation of Internet Infrastructure

Active IP geolocation methods offer an alternative to passive datasets (such as RIR data and DNS hostname-based geolocation). These methods use latency measurements to determine the position of a target IP address. A set of devices of known position, called landmarks, are used to measure the network latency between themselves and the target IP. The resulting latency is converted into distances and used to estimate the position of the target IP. In such methods, the availability and geographical distribution of the landmarks strongly affects the accuracy of the results [46].

In Section 2.5, we introduced some active and passive geolocation approaches de-scribed in the literature. Of the mentioned works, only [37, 38, 50, 66] focus on ge-olocating IP infrastructure instead of end users. In particular, the methods described in [37] and [38] use DNS hostname-based geolocation to geolocate routers. However, this approach can be used only when reverse DNS is set, and it suffers from a lack of uniform naming rules. In [50], OWD measurements collected with synchronized high-resolution clocks are used to estimate the location of routers along the paths between landmarks. This approach is not applicable on common devices, since they do not have the required special-purpose hardware. In [66], a tool for geolocating routers using ac-tive measurements is presented. In particular, this tool uses the DNS hostname-based geolocation described in [37] to produce hints about the target’s location. The hints are later used to reduce the number of measurements required. However, results show that the use of such hints is detrimental to the overall accuracy and coverage of the platform [67].

Active IP geolocation methods have usually been studied on a small scale, based on platforms with a few dozen landmarks deployed (some exceptions have a few hundred), and usually in few developed regions. As a result, many of these systems have almost no operational use.

With more than 11 000 probes distributed across 179 countries [68], the RIPE At-las [19] platform provides unprecedented conditions for active IP geolocation tech-niques. For this reason, at the RIPE75 meeting [69] in 2017 in Dubai, we announced the RIPE NCC’s latest effort on the topic of geolocating IP infrastructure, the RIPE IPmap platform [70]. This platform aims to combine various geolocation methods, in-cluding active geolocation performed with RIPE Atlas. A large part of the platform was designed and improved as part of the research activity carried out during this doctorate. Our platform received great interest in the last two years, in particular from researchers who published several papers based on geolocation data obtained with RIPE IPmap, e.g., [36, 67, 71–77]. The platform’s accuracy is praised in some of these papers. How-ever, we believe a formalization of such accuracy is needed and present this in the following sections together with other factors, such as coverage and consistency. We estimate that 80.3% of the platform’s results have city-level accuracy for our ground-truth dataset, and 87.0% have city-level consistency when geolocating different inter-faces on the same routers. Additionally, the platform provides geolocation inferences for 78.5% of 26 559 infrastructure IP addresses from our coverage evaluation dataset.

3.2

The RIPE IPmap Platform

RIPE IPmap is a platform focused on the geolocation of IP infrastructure that aims to combine active and passive geolocation methods in a single system. In developing the

(32)

3.2. The RIPE IPmap Platform

Figure 3.1: RIPE IPmap high level schema. Source [78].

architecture, a main goal was to ensure that the system would be able to easily accom-modate various geolocation approaches, including possible approaches created by third parties. Each geolocation approach is encapsulated in a self-contained microservice called geolocation engine. Figure 3.1 depicts a high-level schema of this architecture. At the beginning of the IP geolocation process, each engine of RIPE IPmap receives as input the IP address to geolocate. The various engines share the same database [79] of cities and countries (including latitude and longitude coordinates). All engines, in parallel and in isolation from each other, compute a list of possible locations for the queried IP address. Each location is marked with a rank from 1 to 10, which indicates the engine’s confidence in the estimated location. The engines are also ranked based on the overall accuracy of their answers, and this rank is used as a multiplication factor for the ranks of the locations they produce as output. Subsequently, all the estimations are collected by a component called the reducer, which combines the locations received along with their ranks, and returns the result to the user.

Queries to RIPE IPmap can be issued through a REST API [80], which we use pervasively in the research presented in the next chapters. The API allows to query for the geolocation of one or more IP addresses at once. Additionally, it allows viewing the partial answers produced by the individual engines separately. Metadata about the geolocation process is also provided.

Currently, RIPE IPmap offers four engines:

1) probeslocation, which returns the known location of IP addresses in use by RIPE Atlas probes;

2) crowdsourced, which returns crowdsourced geolocations provided by resource holders and operators;

3) simple-anycast, which infers if an address is anycast and provides anycast in-stances by means of active measurements; and

4) single-radius, an active geolocation approach based on pings issued from RIPE Atlas probes toward the target IP.

In this work, we focus on single-radius since it is the most innovative part of this plat-form. In single-radius, we implemented active IP geolocation with unprecedented land-mark coverage and computational power.

(33)

Chapter 3. Geolocation of Internet Infrastructure

3.3

Single-Radius Methodology

Mostly, active geolocation techniques have been designed in theory only, or imple-mented in such a way that they are applicable only to a small geographical area, or re-quire either significant computation or pre-processing. We instead designed the single-radius engine of RIPE IPmap as a close-to-real-time production service, therefore choos-ing computational efficiency over more complex measurement approaches.

Single-radius performs four steps for each geolocation request:

1) The target IP is mapped to the AS announcing its containing prefix using the data provided by RIPE RIS [20]. Based on the AS, RIPE Atlas probes that are topologically close to the target are selected (see Subsection 3.3.1). Such probes are used as landmarks from which ping measurements towards the target IP are scheduled.

2) All RTTs below 10 ms are collected and converted to OWD values (as RT T₂ ). This initial RTT filtering assumes that geolocation using distant landmarks (e.g., on another continent) is ineffective [46]. We call this threshold ξ.

3) The landmark λ reporting the minimum OWD is selected and its OWD converted to distance ˆd using a SOI of 2₃c (where c is the speed of light in a vacuum) [81]. 4) The location of λ is used as the center of circle η having radius ˆd. The 100 closest

cities to λ are selected using the RIPE Worlds dataset [79]. Only cities inside η are selected; hence, lower OWDs yield fewer cities. Finally, cities are ranked (see Subsection 3.3.2) and returned to the user.

3.3.1 Initial Landmark Selection

One of the main limitations of most of the active geolocation theories available until now is that they focus on the localization process but do not impart anything about the landmark selection. Usually, all the available landmarks are used for the measurements, as this is the best-case scenario for the localization process. Single-radius, instead, limits the number of landmarks it dedicates to a single IP geolocation.

While using all landmarks provides the maximum possible coverage, it does not scale in practice. See Chapter 4 for more details. Using all landmarks in RIPE Atlas would require 11 000 ping measurements for each geolocation attempt. This would drastically increase the usage of the RIPE Atlas platform, and the time needed for the measurements to be scheduled and retrieved. It would also cause ICMP rate limiting close to the probes (see analysis in Subsection 7.1.4.1), and trigger anti-flood and DDoS alerts [82].

One of our design goals is to use as few measurements as possible. Hence, ideally we would like the set of landmarks used as measurement sources to be small and to contain the landmark λ with the lowest RTT γ to the target τ . Single-radius tries to achieve this by selecting landmarks that are topologically near τ by selecting probes either in AS(τ ), the AS of the target IP, or in AS-level neighbors of AS(τ ). Single-radius creates a list of cities lc and a list of ASes la from which to select the landmarks. The lists are created in the following steps:

Geographical and Topological Analysis of Internet Routing

G

EOGRAPHICAL AND

T

OPOLOGICAL

A

NALYSIS OF

INTERNET

ROUTING

Summary

T

List of publications

List of Abbreviations

Contents

CHAPTER

1

Introduction

CHAPTER

2

Definitions and Reference Scenario

2.1

2.2

2.3

2.4

2.5

2.6

CHAPTER

3

Geolocation of Internet Infrastructure

3.1

3.2

3.3