RT-Mongo: Differentiated Real-Time Performance on a NoSQL Data-Base

(1)

Master’s Degree in Computer Science

RT-Mongo: Differentiated Real-Time

Performance on a NoSQL Data-Base

Author:

Remo Andreoli

Supervisors:

Dino Pedreschi

University of Pisa

Tommaso Cucinotta

Scuola Superiore

Sant’Anna

ACADEMIC YEAR 2019/2020

(2)

(3)

The advent of Cloud Computing and Big Data brought several changes and innovations in the landscape of database management systems. Nowa-days, a cloud-friendly storage system is required to reliably support data that is in continuous motion and of previously unthinkable magnitude, while guaranteeing high availability and optimal performance to thousands of cus-tomers. In particular, NoSQL data-base services are taking momentum as a key technology thanks to their relaxed requirements with respect to the relational counterparts, that are not designed to scale massively on dis-tributed systems. Most research papers on performance of cloud storage systems propose solutions that aim to achieve high read and write through-put, while neglecting the problem of controlling latencies for specific users or queries. The latter research topic is particularly important for real-time ap-plications, where task completion is bounded by precise timing constraints. In this thesis a set of modifications is proposed to the popular MongoDB NoSQL data-base software, that reduces temporal interference among com-peting requests on a per-client/request basis. Extensive experimentation with synthetic stress workloads demonstrates that the proposed solution is able to assure that requests with higher priorities achieve reduced and sig-nificantly more stable response times, with respect to lower priorities ones.

(4)

1 Introduction 4 1.1 Achievements . . . 7 1.2 Outline . . . 8 2 Background 9 2.1 Distributed databases . . . 9 2.2 SQL vs NoSQL . . . 11 2.3 MongoDB . . . 13 2.4 Scheduling in Linux . . . 18 2.5 Thread affinity . . . 20 3 MongoDB internals 22 3.1 MongoDB Architecture . . . 23 3.2 WiredTiger . . . 23

3.3 The mongod process . . . 24

3.3.1 Execution models . . . 24

3.3.2 Life cycle of an incoming connection . . . 25

3.3.3 Replication process . . . 28

4 Implementation 30 4.1 Better response times through niceness . . . 30

4.2 A soft checkpoint system . . . 31

5 Experiments 35 5.1 Testing framework . . . 35

5.2 Test environment . . . 37

5.3 Pre-assessment . . . 38

5.4 Assessment . . . 40

5.4.1 Single-core database, weak data durability . . . 42

5.4.2 Single-core database, strong data durability . . . 44

5.4.3 Multi-core database . . . 46

5.4.4 Performance differentiated per priority level . . . 47

(5)

Introduction

The importance and disruptive capability of Cloud Computing has become increasingly evident throughout the last decade for the development of mod-ern web applications. Quoting the definition given by the NIST[25], Cloud computing is a model for enabling ubiquitous, convenient, on-demand work access to a shared pool of configurable computing resources (e.g., net-works, servers, storage, applications, and services) that can be rapidly pro-visioned and released with minimal management effort or service provider interaction. The pooled computing resources offered by cloud providers have been proven to offer tremendous benefits to business organizations, re-lieving them from the burden of investing into dedicated data-centers. One of the main techniques used to save on operational costs is server consolida-tion[4], which allows to maximize resource utilization by replacing multiple under-utilized machines into Virtual Machines running on a single one. Un-surprisingly, Virtualization is the single most effective way that allows Cloud services to reduce IT expenses while boosting efficiency and agility for all size businesses.

However, providing a reliable service through the Cloud is not easy. Tra-ditionally, reliability focused on single-instance hardware improvements: a classic Oracle server or Sun Workstation would be provided with hot-swappable and hot-pluggable components (disks, CPUs, network interfaces, ...) that could be easily replaced in case of failure. In a cloud environment, reliability is achieved by replicating on different Availability Zones, namely multiple servers spread across isolated locations within a region, a nation or even a continent. In this context a key component of cloud architectures is represented by database systems offering replicated and infinitely scalable data storage services. The idea is simple: cascading failures on different data-centers with different components and independent powering systems are virtually impossible. In short, a reliable Cloud infrastructure has to be deployed as a big distributed system.

(6)

The next requirement is to ensure high performance: services must scale resources up and down to handle high and dynamic loads that depend on current customer needs. Most of the works in the literature [47, 38, 10] [37, 34, 43] address the performance problem for general-purpose applications and focus on maximizing overall throughput, neglecting the issue of pro-viding customizable and reduced response times to the individual users/-customers. It is natural that the response time for an “average” task tends to decrease as the overall throughput increases, but the trade-off between throughput and response time becomes evident when the service has to balance the ongoing need for high throughput with an immediate need to perform a real-time activity: the more resources are applied to the latter, the larger is the impact on the throughput. Conversely, the fewer resources you allow to the real-time task, the longer it takes to complete. However, a real-time task must complete within a specified time window, the deadline, to be considered valid, and thus requires predictable performance guaran-tees.

While a real-time application with hard constraints, such as a military control system, is deployed on a closed environment by design, a soft con-strained application, such as online video streaming or gaming, would ben-efit significantly from the Cloud due to their highly dynamic workload. However, as depicted in [13], Cloud Computing presents performance insta-bilities due to the noisy neighbour or temporal interference problem: the platform is not able to comply with the real-time requirements of the ser-vice because it is executed on heavy loaded servers that handle concurrent activities from several independent users. This performance inconsistency is a huge problem, because it makes the Cloud unpredictable. There are four ways to tackle this issue [22]:

• Compensation: Add more computing resources to the single server (scale vertically) or balance the workload between multiple servers (scale horizontally).

• Physical resource separation at host-level: Provide a single-tenant cloud architecture with dedicated instances per-customer • Physical resource separation, at resource-level: Keep the

clas-sic multi-tenant cloud architecture, but offer dedicated computing re-sources: dedicated CPUs, dedicated DRAM memory controllers, ded-icated Network Interface card with SR-IOV [11] and so on so forth.

(7)

• Software mechanisms: Exploit the operating systems by providing a QoS-aware schedulers for CPU, networking and disk access. This is the cheaper solution and has been found to be quite effective.

Only a few industrial cloud systems provides a solution to this problem. The most popular example is DynamoDB [36], a proprietary fully-managed NoSQL database offered by AWS to provide a fast and reliable cloud storage system [46] with predictable performance at any scale. In particular, the customers can specify the custom throughput requirements (read and write capacity) they require for a given table, and DynamoDB will allocate suf-ficient resources to the table to predictably achieve real-time performance and stable latencies under 10ms. Thus the problem is tackled by compensa-tion and physical resource separacompensa-tion at the infrastructure level. There are several optimization performed by DynamoDB in order to provide a per-formance guarantee: re-configuration of the Virtual Machine topology or of the network routes to the customers, capacity planning on single SSDs, so on and so forth.

In this thesis, the problem of achieving differentiated performance on a stor-age system is tackled by using the prioritized access principle, a trade-off widely adopted in the design of real-time systems: higher priority activities are allowed to take more resources than lower priority ones, and sometimes even able to preempt them by revoking their resource access in the middle of the task, or to starve them for arbitrarily long time windows. While the priority-based access technique by itself may not be enough to achieve predictable performance, when coupled with strong real-time design princi-ples it is possible to ensure correct operation and sufficient resources to all hosted real-time activities [3]. Since the goal of the proposal is to provide a Quality-of-Service feature to reduce response times for prioritized users/re-quests, not a brand-new data store, the modifications have been performed to the MongoDB open-source NoSQL database. Developed by MongoDB Inc., it gained popularity as a flexibile, cheap and easy to use solution that is able to provide all the capabilities needed to meet the complex require-ments of modern applications: indeed it is the most wanted database by

developers (Figure 1.1)1 _{and the “aces in the hole” of the NoSQL paradigm}

in most comparative studies found in the literature [28, 31, 15, 16].

(8)

MongoDB PostgreSQL Elasticsearch Redis Firebase MySQL 19.4 15.6 12.2 12.2 9.2 9

Figure 1.1: Top databases that developers would love to use.

As of 2020, MongoDB comprises more than 2 million lines of codes, it is maintained by several thousands employees, has over 18,000 paying cus-tomers, including big ones like Google and Facebook, and has been down-loaded 110 million times worldwide.

1.1 Achievements

The result of this work is a custom version of the base software, named RT-Mongo, which comprises a set of modifications that fully integrates with the internal architecture of the database, and provision it with a differenti-ated performance mechanism that creates an on-demand prioritized access channel with reduced response times. This QoS feature exploits two design choices regarding the connection pooling execution model and the concur-rency control mechanism, that have been identified thanks to an in-depth study of the almost one million lines of code that make up the core module. When disabled, this mechanism does not affect the expected performance and reliability capabilities of the base software, allowing RT-Mongo to be ready to use as it is for general-purpose applications, but also for appli-cations with soft real-time requirements, if the underlying infrastructure complies with the real-time design principles.

(9)

1.2 Outline

The content of this thesis is organized as follows:

• Chapter 2 provides an introduction to the topics that interest this thesis and the reasoning behind some implementation and experimen-tation choices. The first two sections briefly describe the current land-scape of database management systems, the third section presents an introduction to the MongoDB software, and then the last two sections describes the tools provided by the Linux operating system that have been used to implement and assess the proposal.

• Chapter 3 is the result of a direct exploration of the database source code and provides a technical introduction to the core components that interested the design of the proposal.

• Chapter 4 describes how the proposed modifications have been inte-grated in the internals of MongoDB and how they work.

• Chapter 5 describes a subset of the experiments performed to assess the correctness and efficiency of the proposed modified version of the MongoDB software.

• Chapter 6 summarizes the work done, the achieved results and ad-dresses the future works.

(10)

Background

2.1 Distributed databases

In a traditional database system the data and database management sys-tem (DBMS) resides on a single location. A centralised database is easy to backup, manage and keep secure. However, they provide minimal to no data redundancy, are highly dependent on network connectivity and are dangerous single point of failure if there is no fault-tolerant mechanism in place. As a result, a centralised architecture is not suitable for the high re-liability, scalability, and availability requirements of Cloud computing [29], on which most modern business applications depend.

According to the definition given by [20], a distributed database is a collec-tion of multiple, logically interrelated databases located at the nodes of a dis-tributed system: in short, it is a collection of autonomous databases that are interconnected by a computer network and share a common structure and logical access interface to a set of data. A distributed database is bundled with a distributed DBMS that manages the access so that the distributed database appears as one single entity to users. The key concepts [41] for dis-aggregating a centralised database into distributable components are data fragmentation and data replication:

• Data Fragmentation: Break a single relation or table into parti-tions, without any loss of information, and distribute them on multi-ple nodes. There are two ways to do this: split the table in subsets of rows (horizontal fragmentation) or in subsets of columns (vertical fragmentation). The first approach allows to arbitrarily expand the data set and it is suitable for the “infinitely scalable storage” that most cloud platforms advocate. The second one allows fast access to few attributes and it is mostly used by wide column data stores (See Section 2.2) for analytical applications. Note that the two approaches

(11)

can be combined, as proposed in [26, 33]. Two simple examples of data fragmentation are shown in Figure 2.1 and 2.2 next page. • Data Replication: Duplicate existing data on multiple nodes to

enable redundancy and maximize availability. This technique requires a consensus algorithms [19, 27] to synchronize the nodes (also called replicas) in order to avoid inconsistencies in the data copies.

Figure 2.1: Horizontal scaling row-by-row

Figure 2.2: Vertical scaling column-by-column

Without data fragmentation, a distributed database does not scale; without data replication, a single failure may render the data set unusable. The advantages of a distributed database system can be summorized in four points:

1. Transparency: The added complexity of managing fragmented and replicated data is resolved by the distributed DBMS software trans-parently to the user, who keeps querying as in centralised databases. 2. Reliability:Through the support of distributed transactions and thanks

to replication, a distributed system is able to eliminate single points of failure.

3. Performance: The parallel nature of a distributed database enables the parallel execution of multiple queries via concurrent transactions. Moreover: data fragmentation enables data locality (store data in close proximity of its point of use), while data replication reduces access delays and CPU and I/O contention of a single node.

(12)

4. Scalability: A system expansion is easy to accomodate by simply adding new processing elements. This is the reason why distributed databases gained much interest in contexts where horizontal scaling is key, such as in cloud computing.

However, there are also as many disadvantages:

1. Complexity: Data transparency and coordination across several sites requires complex (and expensive) software.

2. CAP theorem: Data redundancy ultimately requires determining a trade-off between consistency and availability due to Brewer’s CAP theorem [12].

3. Processing overhead: Even simple operations may require a large number of communications to provide consistency in data across the sites.

4. Distribution overhead: A linear performance improvement may not be possible, due to the necessary communication overheads of the distribution [7].

2.2 SQL vs NoSQL

The Structured Query Language is one of the most versatile and widely-used query language, adopted as the standard way to define and manipulate data in relational databases, which is why they often are simply called SQL databases. A relational database organizes the data using Codd’s relational model[6], where it is represented in the form of one or more tabular schemas that can be linked based on attributes common to each table. Tradition-ally, relational databases were designed to run on a single large server as general-purpose data stores.

A contrast to this “one size fits all” philosophy is the Not Only SQL (NoSQL) approach, that refers to any alternative to the traditional row-and-column data mode. This methodology highlights the need of specialized data mod-els to satisfy the requirements of web 2.0 and cloud data management: horizontal scalability, fault-tolerance and availability. This is achieved by relaxing the strict requirements of relational databases. The main NoSQL system’s data stores are:

(13)

• Key-value store: A flexible and scalable schemaless model that represents data as key-value pairs, where the key uniquely identify the value. A popular non-relational key-value database is the Dy-namoDB [36] fully-managed 24/7 commercial database service offered by AWS.

• Document store: An “enhanced” key-value store where the keys map to values of document type, such as JSON or XML. Since doc-uments are self-describing (namely, they store both data and meta-data), they allow for more advanced queries based on the content. Documents can be embedded in order to represent complex hierarchi-cal relationships with a single record. A popular document-oriented database is the open-source MongoDB [32] software, which introduced in Section 2.3 and described more in depth in Chapter 3.

• Wide Column store: A mixture of relational database and key-value store’ properties: data is presented as schemaless tables of wide columns and rows, where “wide” means that a column does not nec-essarily contains atomic values only, also but multiple key-value pairs. Moreover, the columns’ name and format may vary from row to row. The first working prototype of this paradigm is the fully-managed Bigtable [5] commercial database offered by Google.

• Graph store: Data is represented and stored as graphs, and it al-lows for fast processing of graph-like queries through a graph query language. It is a popular paradigm among web-based applications that models a complex network, such as social networks, because it overcomes the overhead of joining multiple tables. An example of graph-based database is Neo4j [44].

A visual representation of each system is shown in order in Figure 2.3, but notice that there are also hybrid solutions such as NewSQL databases[35], that combine the scalability of NoSQL systems with the ACID guarantees (namely, strong consistency and usability) typical of relational databases.

(14)

Figure 2.3: Bird’s eye view of the NoSQL data store categories

2.3 MongoDB

MongoDB[32] is an open source NoSQL document-oriented database devel-oped by MongoDB Inc. Previously known as 10gen, they began developing the database in 2007 as open source free download software with paid com-mercial support and training services[30]. The name MongoDB derives from humongous, which can be literally translated to “extraordinarily large”, to highlight the fact that the database’s strength is to store and serve big amount of data. The ability to handle large scale traffic, which is a typical use case for content serving application such as web sites, is something that relational database technologies are not able to easily achieve: for instance, MongoDB access speed is 10 times than MySQL when the data exceeds 50GB [31].

MongoDB uses BSON as document format, a binary-encoded serialization of JSON designed to be efficient both in storage space and scan-speed. A BSON document contains multiple named fields and an identifier generated automatically by MongoDB to uniquely identify it. A field comprises three components: a name, a data type and a value. BSON supports complex JSON data types, such as arrays and nested documents, but also additional types like binary, integer, floating point or datetime. An example of BSON document is depicted in Listing 2.1.

Documents are stored in collections, which are analogous to tables in rela-tional databases, but since MongoDB is schemaless, any type of document can be stored in a collection (although similarity is recommended for index efficiency). In turn, collections are stored in databases. The full list of

(15)

ter-Listing 2.1: Simple BSON document { _id : 535485 , // g e n e r a t e d by M o n g o D B n a m e : " R e m o " , e x a m s : [ { y e a r : new D a t e ("2019 -12 -20") , c o u r s e : " ASE " , g r a d e : 29 , } , { y e a r : new D a t e ("2019 -11 -07") , c o u r s e : " PDS " , g r a d e : 30 } ] }

minology comparison between relational databases and MongoDB is shown below in Table 2.1. RDBMS MongoDB Database Database Table Collection Tuple/Row Document Attribute/Column Field

Table Join Embedded Documents

Table 2.1: Terminology comparison.

To interrogate the database, the users are supplied with a query language expressed in JSON that supports CRUD operations (Create, Read,Update, and Delete), but also more complex ones, such as indexing and aggregation: thus a query may return documents, specific fields or complex aggregations of values within multiple documents. The general form to query a MongoDB database is:

db.collection.f unction(J SON expression)

where db is a global variable referring to the database we want to query, function is the database operation and JSON expression is a document stating the criteria to select data, thus providing a unified approach to both storing and manipulating the data. A user can interact with its MongoDB system via API libraries, called driver s, that are used to connect and inter-act with the database. They are easy to integrate within user applications

(16)

and available in 13 programming languages, including C/C++, Java and Python. Alternatively, MongoDB provides an interactive shell called mongo. A MongoDB database system is designed as a configurable client-server architecture. A simple deployment, that does not require redundancy or scalability, is made of:

• A mongod instance, the core database service that handles requests, manages data access, and performs background management activi-ties.

• Multiple instances of a user application interacting with the mongod service through one of the provided drivers.

A stylized depiction of the scenario is shown in Figure 2.4.

Figure 2.4: A simple MongoDB system.

Since a key requirement of today’s web applications is a system able to con-tinuosly operate without long downtimes, MongoDB support redundancy and high availability through the deployment of a replica set, a group of mongod instances that maintain the same data set. A replica set requires the following components:

• A mongod instance named primary node, whose task is to receive all write operations and keeps track of the changes to the data sets in the operation log, or simply oplog. The latter is actually stored within the instance as a normal collection with fixed size (capped, using MongoDB’s terminology).

• A set of mongod instances named secondary nodes, whose task is to replicate the primary’s oplog to their local one in order to reflect the changes to the data set.

• An optional mongod instance named arbiter node, whose task is to participate to the primary election but without providing data redun-dancy, i.e. it does not hold data.

(17)

The members of the replica set communicates frequently via heartbeat mes-sages to detect a topology change and react to it: for instance, if the primary node becomes unavailable, the replica set goes through an election process to determine a new one. Both the primary election and the oplog replication process are built upon a variation of the RAFT consensus algorithm [27]. An example of a 3-member replica set deployment is shown in Figure 2.5 below .

Figure 2.5: A MongoDB system with failover capabilities.

By default, the read operations are also directed to the primary node, but it is possible to change the read preference to increase data availability. How-ever, a secondary node may return stale data because the oplog replication is an asynchronous process. A user can specify the level of data durability through the write concern option, which defines the number of replica set members that must acknowledge a write operation before the operation re-turns as successful. The default value of 1 provides a “weak” write concern (or data durability) requirement, because only the primary node acknowl-edges the write operation before the result is sent back to the user. A value greater than 1 provides a “strong” write concern (or data durability) re-quirement, because the primary node and a number of secondaries has to acknowledge the write operation. An high value deteriorates the through-put, while a low one leads to higher risks of data loss in case of failure. An analogous feature is available for read operations, the read concern, which in turn is used to control data consistency. Both options should be tuned according to the specific use case.

Modern web-based application can drain the capabilities of a single machine or replica set due to very large data sets and high throughput operations. Two methods to avoid bottlenecks are vertical and horizontal scalability:

(18)

while the former consists of a simple increase of the server’s physical ca-pabilities, the latter consists in deploying a distributed database to exploit the advantages already discussed in section 2.1 by partitioning the data set and load over multiple servers. MongoDB supports horizontal scalabil-ity through sharding (namely, horizontal fragmentation): in particular, the users can create a cluster of many machines, called a sharded cluster, and break up a collection in subsets of data to be distributed across it according to a user-defined subset of fields in the documents, called the shard key. The MongoDB software automatically manages load and data balancing, documents redistribution and routing of read and write operations to the correct machine introducing the following components:

• shard, a mongod instance or replica set containing a subset of the sharded data.

• mongos, a read and write operations router that sits in between the user application and the sharded cluster. Its goal is to keep a “table of contents” that states which shard contains which data, hiding the fact that the cluster in made of multiple shard s from the user application. Indeed, from the perspective of the application, a mongos instance behaves identically to a mongod single service or a replica set.

• config server, a specialized mongod instance or replica set that store metadata and configuration settings in order to organize and keep track of the state for all data and components within the sharded cluster. It’s used by the mongos instances to route the operations to the correct shards. For example, the metadata includes the list of chunks on every shard and the ranges that delimit the chunks. An example of sharded cluster made of one mongos router, one config server and multiple shard s is given in Figure 2.6.

(19)

To conclude this section, a summary of the key features of MongoDB: 1. It is a open-source software released for free under a GNU AGPL

license

2. It offers the benefits of a non-relational database, but it also supports complex features that are considered standard in relational databases, such as transactions and indexing.

3. The BSON data format provides efficient support to complex data types.

4. It supports durability/availability and easy horizontal scaling through replication and sharding.

5. It provides a powerful query language that allows simple operations even in distributed scenarios.

6. It is capable of high-speed access to huge data sets.

2.4 Scheduling in Linux

A modern Operating System manages the CPU resources through a sched-uler, which is in charge of granting CPU time to the tasks. It chooses the next task to be executed depending on a scheduling policy. For instance, the Linux Multitasking Operating System provides a scheduling framework that comprises three categories, each suitable for specific use cases:

• sched fair: fair scheduling, where each task receives a “fairly weighted” amount of CPU time.

• sched rt: fixed-priority real-time scheduling. [14]

• scher deadline: reservation-based scheduling, where a task receives guaranteed run-time within a given deadline. [18]

The various implementation are schematically shown, in order from lowest priority, in Figure 2.7. [23]. The two latter categories are used in specific scenarios where the total real-time workload is defined and put in relation with the needs of the other tasks. Failing a proper analysis on the re-quirements would cause starvation, or stalls in the worst case, that would compromise the functioning of the entire system. Although, there are stud-ies in the literature that explore these scheduling strategstud-ies to deploy highly time-sensitive applications on Cloud. [8, 9]

(20)

Figure 2.7: Linux scheduling classes.

Since MongoDB is a general-purpose software designed to run on typical scenarios with no real-time requirements, the proposal as been tested for the default implementation of the sched fair scheduling category of the Linux Operating System, which is briefly described in the paragraph below. The “Completely Fair Scheduler” [45, 40] tries to eliminate the unfairness of a real CPU by emulating an ideal, precise multi-tasking CPU [45] in software: namely, a CPU where each task/thread runs in parallel and it is given an equal share of “power” (See Figure 2.8). Naturally, this is an ideal scenario that highly depends on the number of available physical cores and tasks to run.

The CFS scheduler tackles this problem by giving CPU access to the thread that waited the most. In particular the scheduling order is defined according to the lowest vruntime, a per-thread multi-factor parameter that measures the amount of “virtual” time spent on the CPU. There are several param-eters taken into account when computing this value, one of which is the nice level(citazione). This levels alter the scheduling order by weighting the vruntime value: A numerically large nice value increases the willingness of a thread to give precedence to others. A valid nice level value ranges between from -20 (highest priority) to 19 (lowest priority). It is possible to manipulate the scheduling priority of a process with nice, or at run-time with the system-call setpriority. Chapter 4 will describe how these levels have been used to achieve differentiated performances.

(21)

Figure 2.8: Ideal precise multi-tasking CPU.

2.5 Thread affinity

A key feature that was used to assess the correctness of the proposal is the thread affinity. This term refers to the possibility to control on which core(s) a given thread can be executed by the OS. The chapter related to the experiments (see Chapter 5) exploits this mechanism in two ways:

• The conventional way: to improve thread/process performance by reducing cache misses. Indeed the remnants of a thread that was run on a given processor may remain in that processor’s state (for example, the data in cache memory) after a context switch, thus reducing the overheads if that thread gets re-scheduled on the same one. To exploit the full potential of thread affinity, the number of pinned threads should not exceed the one of the real cores.

• The unconventional way: to simulate a high interference scenario by pinning the threads to the same core(s).

On Linux systems, a simple way for controlling processes/threads affinity is

through the command taskset. Quoting the related man page1_{: “taskset is}

used to set or retrieve the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity. CPU affinity is a scheduler property that “bonds” a process to a given set of CPUs on the system. The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs”. It is also possible to exploit the POSIX Threads library to change the thread affinity at run-time using pthread setaffinity np.

For reference, Figure 2.9 illustrates the basic architecture of a modern CPU: it comprises a number of physical cores that, through hyperthreading [24], can handle simultaneously several contexts. There are multiple studies in

(22)

literature [39] about the possible performance gain or degradation of such technique. For instance, the example CPU is made of 16 physical cores for a total of 32 logical cores (t2 contexts for each physical core).

Figure 2.9: A modern CPU architecture.

In NUMA architectures, the contention on the memory controller in ac-cessing the main memory on cache misses is mitigated by the use of an architecture that connects, for example, a physical CPU socket with a close-by memory controller (the ensemble being known as NUMA node). With a proper configuration of the platform and memory allocator (through cpusets in Linux), one can realize a partitioned system where applications deployed on different NUMA nodes cannot interfere with one another. The experi-ments analyzed in Chapter 5 exploit this resource isolation mechanism by performing all the tests on two NUMA machines.

(23)

MongoDB internals

This chapter expands on Section 2.3 by giving a brief overview of the Mon-goDB architecture, then describes the internal components and mechanisms that interested the development of the proposal. The intent is to give a par-tial view on the technical details needed to understand the mongod service, which is the component that has undergone most changes. The reference

version is MongoDB 4.2, which can be found on GitHub1_.

Note that the information below have been gathered by looking at the source files, the technical documentation, the MongoDB developer news group, code comments and an email exchange with a developer: what is stated in this chapter is not to be considered unquestionably complete nor correct, but as a starting point to the development of core features in Mon-goDB.

Figure 3.1: MongoDB Architecture.

(24)

3.1 MongoDB Architecture

A MongoDB database is designed as a multi-layer application (See Figure 3.1)2_:

• Storage Layer: It comprises the software modules to create, read and update data from a database. It is made of several pluggable storage engines with different storage strategies. As of MongoDB ver-sion 4.2, the default implementation is WiredTiger, which is suitable for general-purpose workloads.

• Security Layer: It comprises various key features, such as authenti-cation, access control and encryption, to secure a MongoDB deploy-ment at every layer.

• Management Layer: It comprises various free or enterprise services to manage and monitor a MongoDB deployment.

Section 2.3 already described the data model and query language layers.

3.2 WiredTiger

Wired Tiger is the default storage engine and it is suitable for

general-purpose application. It is capable of sustaining high write throughput

thanks to the Multi-version concurrency control [2, 1] mechanism applied at document-level, which allows multiple concurrent clients to insert different documents to the same collection without incurring in locks. When a piece of document is updated, WiredTiger stores the changes in a idempotent format inside a tree-structure residing in cache, thus the document itself is not overwritten. Consistency is maintained by displaying to the connected users a version of the database at a particular instant of time, thus any changes made by a writer will not be seen by other users until the operation has been completed without conflicts. Moreover, WiredTiger is actually able to avoid locking also the single document, thanks to an optimistic ap-proach to the resolution of write conflicts: if there is a conflict between two write operations (that is, they update simultaneously the same document), one will raise an error, causing MongoDB to transparently retry it. The immediate gain in performance from the previous default storage engine, MMAPv1, is noticeable, because the latter updates the documents in-place and requires a collection-level locking mechanism to avoid inconsistencies. On the other hand, it handles high read loads better, because it does not have to traverse the tree-structure to retrieve the correct document state.

(25)

3.3 The mongod process

It is the module which is in charge of performing all the core database ac-tivities for a single machine: it handles the requests, applies changes to the storage unit and performs management and monitoring activities. For this reason it will be also called server or simply database throughout the chap-ter.

A mongod instance is a multi-threaded program written in C++ usually run as a daemon process. In order to keep track of the information and pro-cedures needed by its threads to carry out their mansions, mongod main-tains a set of contexts organised hierarchically. The root context is the ServiceContext, which is created during the service startup and in charge of:

• The creation and deletion of Client objects (see Section 3.3.2), which represent the server-side view of a client session/connection.

• The creation of OperationContexts, the set of information required to handle the operation currently issued by the clients.

• The initialization of the ServiceExecutor (see Section 3.3.1), the set of client worker threads used to support connection pooling. Mon-goDB provides several thread-pool based implementation of this com-ponent.

• The initialization of the ServiceEntryPoint, the component that pre-pares the server-side worker thread that handles an estabilished user connection. In short, it is the entry point from the transport layer into the mongod instance.

3.3.1 Execution models

MongoDB provides a standard interface, called ServiceExecutor, used to implement the threading and execution models supported by mongod to

execute client requests. There are three implementations3_:

• ServiceExecutorSynchronous: It is a thread pool managed on a per-connection basis. Each incoming per-connection is given its own dedicated worker thread that runs the corresponding ServiceStateMachine. As of MongoDB version 4.2, It is the default execution model.

3_{Special thanks to Benjamin Caimano, Kernel Senior Software Engineer at}

(26)

• ServiceExecutorReserved: Similar to the previous implementation, but it also reserves threads ahead of time if the system reaches its limits on thread creation. Used only in specific specific scenarios that usually does not affect the normal customers.

• ServiceExecutorAdaptive: It is a thread pool managed on a per-request basis. It uses asynchronous networking to assign a thread to each incoming connection from when a new request is received until when it sends a response. However, as of MongoDB version 4.2, it has been removed due to bad performance.

Since the latter implementation has been discontinued, and there is no sig-nificant difference between the first two, the proposal has been designed to run on the default implementation only and has not been tested on the others. Thus for the rest of the writing is assumed that a ServiceExecutor worker thread throughout its “life” corresponds to a single client connection only.

3.3.2 Life cycle of an incoming connection

A client communicates with the database server through a regular TCP/IP socket using a simple socket-based, request-response style protocol called MongoDB Wire Protocol. In particular, every MongoDB deployment is given a public ip address and a port through which the clients establish a connection with the database.

When an incoming connection is detected, mongod instantiates a new ses-sion and creates the corresponding server-side representation of the client, namely the Client object, that keeps track of the information required by the client-server interaction until the connection is closed. Note that a client does not necessarily correspond to a user application connected remotely to the database: it could also be a secondary node asking for the latest oplog entries, or an internal thread monitoring the read performance or a config server trying to balance the number of chunks across a sharded cluster and so on so forth. Therefore, a Client object corresponds to any entity that wishes to interact with the storage engine. For the sake of simplicity, remote or user client will refer to the client-side representation of the interaction, while client will refer to the server-side representation of an incoming con-nection, regardless of the entity. Based on the context, the term internal client will be used to refer to any entity that is not an “end user”.

The state of a single client connection is modelled as a finite-state ma-chine called the ServiceStateMama-chine: each time a new connection to

(27)

the database is estabilished, the ServiceEntryPoint creates a new state machine to be executed by the ServiceExecutor. Note that, since this proposal has been designed under the assumption of an a synchronous ex-ecution model, it is safe to assume that a given ServiceStateMachine instance corresponds to one and only Client object, thus a given incom-ing connection corresponds to a given ServiceExecutor worker thread only. The life cycle of a client comprises seven possible states, each of which corresponds to a set of activities performed by the underlying client worker thread:

• Created: A connection between a client and the database (namely, a session) has been made, but no operations have been performed yet. It does not correspond to any particular activity, other than the creation of a Client object.

• Source: Request a new message to the remote client.

• SourceWait: Wait for the new message to arrive from the client. • Process: Process the request enveloped into the message through the

database. It comprises all the operations needed to comply with the user and storage engine requirements: query validation and optimiza-tion, await for replication in case of strong data durability, so on and so forth.

• SinkWait: Wait for the database result and, eventually, send it back to the client.

• EndSession: Start of the session closing procedure, which can be caused by a voluntary disconnection or an error. Any other action is now invalid.

• Ended: End of the session closing procedure, where all the cleanup activities have been completed. Any following transition is illegal. Figure 3.2 below shows a graphical representation of the valid state tran-sitions of a ServiceStateMachine. Every state, except the Created and Ended states, converges to EndSession in case of error: this is because these two states simply act as starting and ending points of the client’s ses-sion.

There are three possible valid paths:

• Standard Remote Call Procedure: Standard client-server inter-action: the client sends a request and wait for a response, the server

(28)

waits for a request, processes it and then sends the response back. This is the typical transition path of an external client.

... → Source → SourceW ait → P rocess → SinkW ait → Source → ... • Fire-and-Forget: Similar to the previous interaction, but the client

does not wait for an acknowledgement by the server.

... → Source → SourceW ait → P rocess → Source → ... • Exhaust: Specific client-server interaction that automatically

ex-haust a cursor, the object used to iterate over query results. An

example of usage is the find request: it returns a cursor that stores an initial batch of maximum 101 documents and up to 16 megabytes total size. To request the next batches, the client has to issue one or more getMore commands. To avoid the additional communication overhead, this particular interaction allows the client to ask the server to automatically create ”synthetic” getMore requests. This is the typical transition path of an internal client representing a secondary node connection.

... → Source → SourceW ait → P rocess → SinkW ait → P rocess → SinkW ait → ...

Figure 3.2: Client’s lifecycle state machine

The set of modifications proposed by this thesis mainly revolves around manipulating the business logic behind the ServiceStateMachine states.

(29)

3.3.3 Replication process

As already briefly discussed in Section 2.3, MongoDB provides a level of fault tolerance against single server failures by storing multiple copies of data on different servers. A group of independent mongod instances that maintain the same data set is called a replica set : the primary node han-dles all write operations and records all changes to its data set in the oplog collection; the secondary nodes replicate the primary’s oplog and apply the operations to their data sets asynchronously.

This replication log is a capped collection stored in the mongod instance it-self, and thus the records are normal documents, but with a fixed structure. These oplog entries describe the changes to the data set in an idempotent format and are uniquely identified by the opTime field, a tuple consisting of two values: a timestamp and a node-specific term that identifies the pri-mary node that serviced the write operation. Therefore this field denotes the order in which the operations have been carried out. Note that without such design choices, WiredTiger would not be able to efficiently keep track of the different version of a document.

A secondary mongod process performs the replication process using the following components:

• OplogFetcher: It fetches the oplog entries from the primary. It con-nects to the same endpoint as a user connection and issues a number of find and getMore commands. The entry are returned in batches tat are then stored in a buffer, calledthe OplogBuffer.

• OplogBatcher: It pulls the entries from the OplogBuffer and creates the next batch to be applied to the local replica of the data set. • OplogApplier: It applies the batches created by the OplogBatcher

to the local oplog and storage unit. In particular, it manages a thread pool of writer threads that, for the sake of performance, may not respect the chronological order of the operations by applying them in parallel (thus some operation require singleton batches, such as the drop operation).

In theory, the optimization performed by the OplogApplier is a problem, because every member of the replica set would end up with a different rep-resentation of the data set, which is equivalent to a data loss or corruption. However, the WiredTiger storage engine is able to solve this inconsistency problem thanks to its Multiversion concurrency control mechanism: by us-ing the timestamp stated by the opTime field, it is able to return to a read operation the correct version of the data depending on the instant of time.

(30)

Whenever a secondary node finishes replicating a batch, it notifies to the primary the opTime of the last applied entry. This is crucial for such scenar-ios where the users request a certain degree of data durability: the primary node will wait for a number of such notifications before sending a positive response back to a write operation in order to ensure that the change was replicated to a sufficient number of nodes.

As to be expected, waiting for the replication process significantly reduces the throughput of the database: an ideal secondary node would have close to none replication lag, namely the time difference between an operation performed on the primary and the same operation replicated on the sec-ondary. The higher is the gap, the slower is the replication process. Since the oplog is a capped collection, its entries will be overwritten eventually, and thus if the secondary nodes are too slow, the primary node is forced to block new requests to safeguard from data inconsistencies. Two simple so-lutions to the replication lag problem are: move the lagging secondary node to a better provisioned machine and/or increase the bandwidth between the primary and the secondary node.

(31)

Implementation

This chapter describes the set of modifications designed for the MongoDB database system in order to support differentiated performance for specific users/requests. The proposed modifications comprise:

1. A new database command called setClientPriority that allows the user to prioritize its session.

2. A new field for the database command runCommand that allows the user to prioritize a single request.

Both commands act on the same mechanism and carry out the same ac-tivities, but the priority declared with the latter command persists for a single standard RPC transition only (See Section 3.3.2). Therefore an high priority request issued with runCommand corresponds to a temporary high priority session.

Differentiated performance is achieved by lowering the niceness of the client worker threads that executes the ServiceStateMachine of the high-priority sessions/requests (See Section 2.4) and by integrating to the client life cy-cle a soft checkpoint system to temporarily revoke the CPU access to the worker threads managing lower priority requests/users.

4.1 Better response times through niceness

The setClientPriority and runCommand command allow the user to have direct control over the nice value of the underlying client worker thread that services its requests. In this way, it is possible to alter the thread scheduling order according to the user needs: for instance, a real-time task querying the database system could increase the priority of its session by setting a lower nice thus reducing the response times even in the presence

(32)

of long-lasting queries. Thanks to the default execution model of MongoDB (See Section 3.3.1), it is trivial to identify the target thread from which to invoke the system call setpriority (See Section 2.4), because throughout the whole life cycle of the session, the underlying worker thread instantiated to handle its connection will be always the same. The proposal currently does not support the full range of valid nice level values, but reduces it to the following three values: -20 for high-priority, 0 for normal-priority and 19 for low-priority. By default, every worker threads starts with a nice level of 0.

Note that the using a different execution model, that does not manage the thread pool on a per-connection basis, should not invalidate this modifica-tion becaus the priority value stated by the user is kept in the Client object throughout the whole session. Although, it is safe to assume that the cur-rent proposal would incur in a slight performance decay due to the frequent setpriority calls that would be required to adapt the thread niceness to the correct one.

4.2 A soft checkpoint system

The modification described in the previous section is quite effective to re-duce response times in use cases where it is not necessary to guarantee data durability, but proved to be irrelevant for scenarios where a strong write concern is required. The key issue here is that the primary mongod node must wait for a number of replica nodes to finish replicating the change, be-fore acknowledging it to the requesting user. However, the replica nodes are unaware of the declared nice levels, because they are deployed on different physical machines, and thus perform an unbiased replication. The following sequence diagram of the communication that takes place between a Mon-goDB driver and 3-member replica set should better clarify the problem (Figure 4.1):

1. A user issues a write operation with nice level -20 (high priority) and write concern 2.

2. The primary node prioritizes the corresponding worker thread, applies the change and then awaits at least one response from the secondaries. 3. The secondary nodes are not aware of the declared priority, hence they will replicate without satisfying the differentiated performance requirement.

(33)

Figure 4.1: An high-priority user sending a write operation with write con-cern of 2 on a 3-member replica set has to wait the acknowledgement by at least one secondary

This is particularly noticeable in high stress scenarios where several concur-rent users requires strong data durability. To circumvent this problem, the finite-state machine modeling the session life cycle has been integrated with a soft checkpoint system to stop low-priority and normal-priority worker threads from running when a high-priority request is issued. Therefore, the temporal interference problem is solved by temporarily enforcing a prior-itized access channel for the time window required to complete the high-priority request, by putting to sleep the lower priorities worker threads. The latter are then notified when the task completed and can resume. The adjective soft refers to the fact that the proposed mechanism follows this three principles:

1. Do not interfere with the execution of remote clients wishing to declare an higher priority or with the already prioritized ones

2. Do not interfere with the execution of internal clients (i.e. the incom-ing secondary node connections or any service activity).

3. Do not block worker threads if the machines is capable to handle the load. This is the case where the number of physical core is higher than the number of expected high-priority users.

(34)

The second point is crucial, because revoking CPU access to an internal client might cause slowdowns or even a deadlock: in fact, blocking a sec-ondary node implies stopping its replication process, since it is unable to retrieve the oplog entries, which in turn may block indefinitely a remote client that requires a strong write concern. For instance, a write concern of 2 would guarantee a deadlock in a 2-member replica set.

As already stated, the ServiceStateMachine representing an external client walks along the Standard RPC transition path until the session is closed by the user itself or by a database error (See Section 3.3.2). The check-point system is integrated to this flow in order to dynamically keep track of the number of prioritized sessions and manipulate the underlying worker threads accordingly. In short, each worker thread declares its niceness to the checkpoint system at specific moments of the life cycle in order to (even-tually) stop competing for CPU access and give precedence to high-priority sessions. The latter will then be in charge of notifying the stopped worker threads when they completed their request.

In order to do so, the checkpoint system comprises two primitives to define the entry and exit point to the CPU:

• checkIn: the worker thread declares its nice value, which corresponds to the client/request priority. If it has a higher nice value with respect to other currently running worker threads, it is put into a waiting queue.

• checkOut: the worker thread has completed the activities of the Standard RPC transition path or the client session has been closed, thus it does not need the CPU until the next request arrives and the waiting worker threads can gain CPU access again.

Note that the business logic of the latter method is designed to avoid waking up all the blocked threads if other priority sessions are still in progress: in particular, the checkpoint system assumes that each high-priority client ses-sion needs its own physical core in order to achieve maximum performance, and thus the number of wake up threads is tuned accordingly.

A high-level view of this mechanism is depicted in Figure 4.2 together with the modified version of the ServiceStateMachine, highlighting the points where it is invoked. The check-in also takes place during the processing phase to give precedence to any high-priority requests issued in the mean-time. The actual implementation of the checkpoint system comprises a condition variable, a mutex, and three atomic counters to store the number of high-priority, normal-priority and low-priority ServiceStateMachines

(35)

currently running, thus requiring no locking mechanism on hardware archi-tecture that supports atomic operations.

Figure 4.2: Checkpoint system: general view (←), ServiceStateMachine integration (→)

Note that a deadlock between two external clients writing on the same col-lection, or even on the same document, cannot incur thanks to the optimistic concurrency control mechanism embraced by WiredTiger (see Section 3.2) and to the synchronous execution model of the default ServiceExecutor. However, the checkpoint system is not unfeasible to achieve even if the latter two conditions are not met: it is possible to support a different concurrency control mechanism by providing a more stateful system that keeps track of the locks’ ownership; it is possible to avoid indefinite wait in an asyn-chronous execution model, if the worker thread pool is capable to identify that some threads have been stalled for a long time.

(36)

Experiments

This chapter describes results from extensive experimentation on MongoDB, modified as described in Chapter 4. The goal of the proposed Quality-of-Service feature is to enrich MongoDB’s capabilities without compromising the expected level of performance. In order to guarantee no hiccups or side-effects caused by the modifications, each test has been performed also on the unmodified version of the database to define a “baseline”. The database stability has been assessed using a testing framework, built in conjunction with the proposal, that tests the functionalities and measures the perfor-mance using different synthetic workloads and deployment configurations. In what follows, a brief description of the testing framework, experimental test-bed and set-up is provided, followed by a discussion of some of the most representative results that have been obtained on a real Linux-based deployment.

5.1 Testing framework

The proposal has been bundled with a carefully designed testing framework able to capture the differentiated performance measurements with minimal interference. The framework is designed as a collection of Python or Bash scripts that are unified under a single entry point from which the tester can configure the deployment and testing parameters. The former are used to customize the ip address, port and data path of each mongod instance. Note that the framework is able to identify and configure multiple instances into a replica set. The testing parameters are described in the following table (Table 5.1):

(37)

Parameter Description

nuser The number of concurrent users

p The number of high-priority users (out of nuser )

ninsert The number of insert operations for each user

w the write concern of each insert operation.

cores the taskset interval of the mongod instance

nice the nice value of the mongod

Table 5.1: Testing framework parameters. Each test goes through the following phases:

• Deployment phase: Any previous deployment is cleared and the components of the MongoDB system are re-initiated with the deploy-ment parameters provided by the tester. It is possible to skip this step and re-use the previous configuration.

• Execution phase: The database is tested with a number of insert operations on a single collection concurrently requested by several users. In particular, the execution phase is made of the following substeps:

1. Database preparation phase: If already exists, the testing collection is dropped and re-built with dummy documents to avoid the construction overhead of the first insertion.

2. User preparation phase: The users perform some pre-insertions to fill the caches.

3. Insertion phase: The users declare their priority, namely high, normal or low, and finally start the test in unison. The stated priority value will remain until the next execution phase.

The response times, timestamps of each insert operation and the to-tal execution time are measured using the CLOCK MONOTONIC RAW, a monotonic clock that is not affected by discontinuous jumps in the

system time or subject to NTP adjustments1_{. Listing 5.1 shows the}

structure of the insertion phase and how the measurements have been performed.

• Data processing phase: To avoid interference, the measurements are kept in memory until the test is finished, then are collected and processed to show different statistics.

(38)

Listing 5.1: Insertion phase Pseudo-code s e t _ w r i t e _ c o n c e r n () s e t _ p r i o r i t y _ l e v e l () b a r r i e r () t2 = g e t t i m e for d o c u m e n t in d o c u m e n t _ l i s t : t1 = g e t t i m e () i n s e r t (); t i m e s t a m p = t1 r e s p _ t i m e = g e t t i m e () - t1 e x e c u t i o n _ t i m e = g e t t i m e () - t2

5.2 Test environment

All the experiments have been performed with different combination of test-ing parameters on two twin multi-core NUMA machines provided by ReTiS

Lab Sant’Anna2_{: treebeard and cplex3. The two machines have been}

em-ployed in conjuction for every experiment using a 1 gbE connection to avoid interference from the client-side connections and the mongod instances: in particular, the first machine hosted the MongoDB deployment, while the second one hosted the concurrent users. Note that, in order to emulate a distributed replica set, the mongod instances have been deployed on differ-ent NUMA nodes (See Section 2.5). The characteristics of both machines are reported in Table 5.2 and a representation of the testing environment is shown in Figure 5.1.

Figure 5.1: Testing environment

(39)

CPU specification

Architecture x86 64

Logical cores 40

Thread(s) per core 2

Cores(s) per socket 10

Sockets 2

Model name Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz

CPU MHz the 2400

CPU max MHz 3400

CPU min MHz 1200

Table 5.2: CPU specification of both machines

5.3 Pre-assessment

In order to identify the limits for a given MongoDB deployment, and thus a baseline to quantify the stability of the database in stress scenarios, a pre-analysis on the run-time information of the mongod process has been performed.

The number of concurrent “service” tasks run by a mongod process depends on the deployment’s configuration:

• A deployment with no additional features (as shown in Figure 2.4) is made up of a single mongod process running 32 concurrent tasks, excluding those instantiated by the ServiceExecutor (See Subsection 3.3.1) to handle the incoming connections. Without going into too much details, the tasks can be grouped in two categories: those that handle the Storage Engine and those that monitor resource utilization and the database performance. Finally, a single task listen to incoming connections on the corresponding socket. For simplicity, this type of configuration will be called stand-alone or single-instance.

• A deployment with redundancy capabilities (as shown in Figure 2.5) is made of multiple mongod instances arranged in a replica set. In this case, a single mongod process requires an additional set of tasks to handle the replication process and one task per replica set member to handle the communication, doubling the total number of concurrent tasks compared to the stand-alone configuration.

Under normal circumstances, where MongoDB has full access over every physical resource of a large multi-core machine such as treebeard, both de-ployments handle high throughput write operations without issues. Thanks

(40)

to the concurrency control mechanism embraced by WiredTiger, the database is capable of efficiently handle a huge number of subsequent insertion re-quests from a single user, and since the testing framework supports only insert operations, there are also no write conflicts to solve. Moreover the CSF scheduler is capable to offer overall equal CPU share to all concurrent user connections.

Identifying the limits under these circumstances requires an extensive and time-wasting research of the correct workload. It is easier to approach the problem from the other way around: reduce the number of resources acces-sible by the database until is not able to perform efficiently. In this work, only the CPU restriction through the taskset command has been tested (See Section 2.9). Figure 5.2 shows how the average response time changes for 16 users each issuing 500 insert operations with different CPU deploy-ments. It is clear that the same workload takes 4 times more if the core is only one, as depicted on the right subfigure.

user01 user02 user03 user04 user05 user06 user07 user08 user09 user10 user11 user12 user13 user14 user15 user16

300 400 500 600 700 800 900

Avg. Response Time (usecs)

user01user02user03user04user05user06user07user08user09user10user11user12user13user14user15user16 500 1000 1500 2000 2500 3000 3500

Avg. Response Time (usecs)

Figure 5.2: Boxplot comparison of the response time(on the Y axis) expe-rienced by different clients(on the X axis), in two configurations: regular deployment on 20 CPUs (←), restricted deployment on 1 CPU only (→) The last experiment of the pre-assessment phase presents the scenario where only a single user interacts with the database (See Figure 5.3). The idea is to highlights the limit on the achievable optimal response time on the proposed testing environment. The ideal results are shown for the following CPU deployments: single-core, dual-core, quad-core and 20-core (namely, the physical machine has not been restricted with taskset).

(41)

single-core dual-core quad-core 20-core 225 250 275 300 325 350 375

Avg. Response time (usecs)

2950single-core dual-core quad-core 20-core

2975 3000 3025 3050 3075 3100

Avg. Response time (usecs)

Figure 5.3: The optimal response time achievable by a single user without interference on multiple CPU deployment: weak (←) and strong (→) data durability

5.4 Assessment

The proposal has been assessed running a set of synthetic stress scenarios through core pinning and using a combination of the parameters provided by the testing framework already discussed. Note that the core restriction will be applied to the primary node only, in order to reduce replication lag, namely the event where a secondary node cannot keep up with the num-ber of changes made to the data set, and thus the primary has to throttle the users even if they do not require strong data durability. This foresight improves response times for everyone without compromising the “virtual” stress scenario: In fact, it is the number of resources allocated for the pri-mary node that makes the real difference.

The results discussed in the next subsections will refer to the improvements observed for the replica set deployment only, because the stand-alone mode does not support data durability and, when this requirement is not enforced, the differentiated performance achieved in both deployments is comparable. The only natural difference is the total execution time of each user, which essentially doubles in replica set mode, because the primary node performs double the number of concurrent activities. This is clearly shown in the two example below: Figure 5.4 shows a comparison of the execution times in milliseconds between 16 clients in both deployment modes with no priority; Figure 5.5 shows the same comparison but between 15 low-priority clients and 1 high-priority client.

(42)

user01user02user03user04user05user06user07user08user09user10user11user12user13user14user15user16 0 100 200 300 400 500

Exec. Time (msecs)

Figure 5.4: Total execution time of 16 low-priority users, in milliseconds: stand-alone (←) replica set (→) deployment

user01user02user03user04user05user06user07user08user09user10user11user12user13user14user15user16 0 100 200 300 400 500

Exec. Time (msecs)

Figure 5.5: Total execution time of 1 high-priority and 15 low-priority users, in milliseconds: stand-alone (←) and replica set (→) deployment

The reason why a high-priority client in a replica set is capable of achieving the same performance as in stand-alone mode, despite the number of service activities doubles, is thanks to the UNIX nice levels: In fact, it does not matter the number of concurrent activities, because the high-priority client worker threads will be scheduled over any number of lower priority ones. To conclude, it is safe to exclude the stand-alone deployment from the analysis. The following two subsections will show the timelapse of a single test run and the distribution of the response times averaged between 10 runs using boxplots. Both charts measure the response time in microseconds. Note that the average distribution analysis is performed during the priority win-dow only, namely the execution period of the high-priority user, which is denoted by the two vertical dashed lines in the example timelapse plot. This is due to the fact that the interest is in assessing the improvements in response times thanks to the reduced temporal interference. In order to highlight the difference with the unmodified version of the software, an indicative boxplot will also be used to represent the response time distri-bution for a “no-priority” user. In this way it is easier to trace the median baseline value of a given test scenario and assess the improvements.

(43)

5.4.1 Single-core database, weak data durability

This subsection shows the results achieved in a stress scenario where the primary node is pinned to a single core and data durability is not enforced, therefore the response is sent back to the remote clients without waiting the replication process.

Figures 5.6, 5.7 and 5.8 present response time statistics related to scenar-ios where a single high-priority user is competing for the ownership of the core with 3, 7 and 15 normal-priority users. The left sub image refers to an example timelapse of the response times for each user, highlighting the high-priority one (light grey); the right image refers to the distribution of the response times during the time period where the high-priority user performs its insert operations. Note that, even when not mentioned, the mandatory service activities of the mongod instance are also running on the same core. The separation between the high-priority user and the normal-priority ones is quite visible in the timelapse. The number of outliers for the high-priority user distribution may seem high, but there are two consideration to observe: 1. Write operations with weak write concern requirements are extremely fast to complete, because they do not wait for the replication pro-cess, thus there may be a slight overhead due to the frequent check-in/check-out from the checkpoint system.

2. The service thread and secondary node worker threads cannot be blocked, as already discussed in Section 4.2, and they must complete their periodic tasks in order to keep the node alive and up to date with the rest of the topology. Therefore, they are also competing for the same CPU.

For each subsequent scenario, the latencies increase remarkably for normal-priority user, while the high-normal-priority one is always timely serviced. This is particularly evident in the most “crowded” scenario: the normal-priority users lose almost a millisecond with respect to the “no-priority” case and also experience higher variation, while the high-priority user is able to write as fast as in the less-crowded 8-user scenario. This is the natural course of events, because the normal-priority users have to give up on CPU access

frequently during the priority window. Therefore, the response time to

the high-priority user is lower than in the no-priority scenario, because the number of competing users is reduced, thus achieving differentiated performance. In particular, based on 10 re-runs, the high-priority users is serviced as fast as physically possible with variations of no more than 120 microseconds.