Enanched discovery of Docker images

(1)

images

Davide Neri

Supervisors: Antonio Brogi

Jacopo Soldani

A thesis submitted for the degree of

MSc in Computer Science and Networking

University of Pisa and SSSUP Sant’Anna

(2)

(3)

I would like to express my profound gratitude to my supervisors Antonio Brogi and Jacopo Soldani for the useful comments, remarks and engagement through the learning process of this master thesis.

(4)

Docker is a platform for developing, shipping and running applications using container-based virtualisation technologies.

Docker can pack any application in an isolated container with all the dependencies that the application needs to run. A Docker image is a template used to create containers. Docker images are stored in Docker registries where users can pull, push, or search images.

The currently available support for searching images in registries is however limited: Available registries (such as Docker Hub) only permit searching images by specifying a term occurring in the image name, in the image description or in the user name. As a consequence, users cannot search images by specifying other attributes, such as the image size, or the software that is installed and supported by an image.

The objective of this thesis is precisely to overcome the aforementioned limitation by designing and prototyping a tool (Docker Finder ) that permits searching Docker images by specifying other attributes, such as the image size or the software packages that are installed and supported by an image.

(5)

List of figures vii

List of tables ix

1 Introduction 1

1.1 Context and motivation . . . 1

1.2 Our contribution . . . 2

1.3 Thesis structure . . . 4

2 Background: Docker 5 2.1 What is Docker ? . . . 5

2.2 Docker concepts and terminology . . . 6

2.2.1 Getting started with Docker . . . 8

2.3 Docker fundamentals . . . 9

2.3.1 Build an image . . . 10

2.3.2 Managing containers and images . . . 12

2.3.3 Docker volumes . . . 13

2.3.4 Docker networking . . . 14

2.3.5 Docker compose . . . 15

2.3.6 Distributing images . . . 16

2.4 Concluding remarks . . . 18

3 Docker Finder: design 19 3.1 Extending the search capabilities of Docker . . . 19

3.2 Docker Finder: high level overview . . . 21

3.2.1 Docker Finder: design choices . . . 23

3.3 Docker Finder architecture . . . 26

3.3.1 Analysis part . . . 26

(6)

3.3.3 Discovery part . . . 32

4 Docker Finder: Implementation 35 4.1 A concrete registry of Docker images . . . 35

4.2 Implementation of the analysis part . . . 38

4.2.1 Crawler and scanner . . . 39

4.2.2 RabbitMQ . . . 45

4.3 Implementation of the storage part . . . 46

4.3.1 Images Service . . . 46

4.3.2 Images Database . . . 49

4.4 Implementation of the discovery part . . . 50

4.4.1 Software service and software database . . . 50

4.4.2 Search API . . . 53

4.4.3 Graphical User Interface (GUI) . . . 54

4.5 Docker Finder deployment with Docker . . . 55

4.5.1 Deployment of the Analysis part with Docker . . . 58

4.5.2 Deployment of the Storage part with Docker . . . 62

4.5.3 Deployment of the Discovery part with Docker . . . 64

5 Conclusions 66 5.1 Summary . . . 66

5.2 Related work . . . 67

5.3 Possible extensions . . . 68

References 71

(7)

2.1 Containers vs Virtual machine . . . 6

2.2 Docker engine architecture. . . . 7

2.3 Docker image layering . . . 10

2.4 Docker volume . . . 14

2.5 Linking of Docker containers . . . 15

3.1 Web-based search interface of Docker Hub . . . 22

3.2 The high-level overview of the architecture of Docker Finder . . 23

3.3 Description of an image in Docker Finder . . . 25

3.4 Complete Docker Finder architecture . . . 27

3.5 Analysis part architecture . . . 27

3.6 Storage part architecture . . . 30

3.7 Storage part architecture with two distinct services . . . . 31

3.8 Discovery part architecture . . . 32

4.1 Docker Hub the concrete registry of Docker images. . . 36

4.2 Implementation of the Analysis part . . . 39

4.3 Implementation of the crawler. . . . 39

4.4 Implementation of the scanner. . . . 41

4.5 Implementation of RabbitMQ. . . 45

4.6 Implementation of the storage part. . . 46

4.7 Implementation of the discovery part. . . 50

4.8 GUI implementation . . . 54

4.9 Dashboard web page of the GUI . . . 56

4.10 Web page results of the GUI . . . 57

4.11 Docker Finder deployment in Docker. . . 58

4.12 Time required to scan 100 images respect to the number of scanners. . . 62

(8)

1 Source code statistics of Docker Finder. . . 74

2 Source code statistics of the analysis part. . . 74

3 Source code statistics of the storage part. . . 75

(9)

4.1 Summary of the HTTP methods available on the /api/images interface. . . 46 4.2 GET parameters of the /api/images interface . . . 47 4.3 GET parameters of the /search interface . . . 49

(10)

Introduction

1.1 Context and motivation

DevOps [14, 22] is a conceptual framework that aims at easing the collaboration and communication of software developers (Dev) and operators (Ops). With the advent of DevOps, the design, development, and production of software have experienced a paradigm shift. Developers are expected not only to develop their projects, but also to follow the whole lifecycle of such projects (including deployment, installation, monitoring, reconfiguration, termination), which once was managed by operators’ teams. The requirements newly imposed by the DevOps conceptual framework have led to the development of new technologies, whose objective is essentially to enable a repeatable process for building, testing, deploying and managing software.

A notable example among the newly proposed platforms and technologies is Docker [3]. Docker is a platform that allows to package any application, together with its dependencies, into isolated virtual containers. Docker containers are lightweight and portable, as they can be quickly deployed and executed on any host where Docker is installed. It is worth observing that Docker containers are particularly suited for both cloud-based and microservices-based applications:

• On the one hand, Docker (and more in general container-based tech-nologies) provides a portable way to package and automatically manage applications, which is of paramount importance for (PaaS) cloud platforms. Each Docker container can hold packaged, self-contained, ready-to-deploy applications, together with the middleware and business logic (in binaries and libraries) they need to run [20].

(11)

• On the other hand, the emerging microservices-based architectural style [19] aims at decomposing monolithic applications into much smaller functional units called microservices. Each microservice must be lightweight and inde-pendently manageable, and different microservices can interact each other only through lightweight communication mechanisms. Container-based technologies (such as Docker) can accelerate the widespread adoption of the microservices-based architectural style, as (Docker) containers provide a natural way of packaging and independently managing the microservices building up an application [17].

The main notions behind Docker [2] are images, containers, and registries. A Docker image is a read-only template used to create containers. To ease the distribution of containers, Docker permits distributing images of Docker con-tainers through so-called Docker registries, where users can remotely download or upload images of containers by exploiting a fixed set of simple commands. However, the support for searching images in registries is currently limited. For instance, Docker Hub [4] — the official registry for storing and retrieving Docker images — offers a web-based graphical user interface and a command line tool to remotely look for images. Docker Hub permits specifying a term, which is then exploited to return all images where such term occurs in the image name, in the image description or in the user name. As a consequence, users cannot search images by specifying other attributes, such as the image size, or the software that is installed and supported by an image. Similar considerations apply to all other Docker registries, which only permit looking for (images of) containers “by name”.

1.2 Our contribution

The objective of this thesis is precisely to try to overcome the aforementioned limitations, by designing and prototyping a tool (Docker Finder ) that permits searching Docker images by specifying other attributes, such as the image size or the software packages that are installed and supported by an image.

Docker Finder crawls all images publicly available in the Docker Hub, and generates a description for each of them. The description contains all metadata already available in the Docker Hub (e.g., size of image, number of downloads per image, popularity of an image), and a list of the software packages installed

(12)

within the image. The former is obtained by crawling images’ information directly from the Docker Hub, while the latter is obtained by running a container from each image and extracting the software packages that are installed within it. The descriptions of the images are stored in a local database, which is made accessible to end-users through a graphical user interface and a RESTful API. Docker Finder permits searching images by querying on all metadata contained in the description of an image extending the discovery of the images.

Docker Finder is developed by adopting the microservices-based architectural style [19], which means that its components are lightweight, independent, and that they communicate through lightweight mechanisms. The microservices-based architecture of Docker Finder can be deployed as a multi-container Docker application.

Docker Finder is not intended to be a substitute for Docker registries, but rather a support for discovering Docker images in existing registries, and to ease their research by end-users. If Docker were directly including the discovering capabilities of Docker Finder into registries (e.g., in the Docker Hub), there would be no need for an external tool like that we are proposing, as the enhanced search would be provided directly by Docker itself. This is one of the reasons why Docker Finder is implemented as a composition of independent microservices, which could be easily integrated with Docker to extend its search capabilities.

Docker Finder has to produce a description for each image that is available in the Docker Hub, which is currently containing hundreds of thousands of public images (and the number of images increases daily). This is another reason why Docker Finder is implemented as a suite of loosely coupled microservices, which can be scaled independently to increase the performance of Docker Finder.

In summary, the main contribution of this thesis is the design and prototyp-ing of Docker Finder, a tool supportprototyp-ing an enhanced, automated discovery of Docker images having certain features – a necessary step towards the ultimate goal of automating the deployment and management of microservices-based applications over heterogeneous containers. Beyond that, other contributions worth mentioning are (i) the way descriptions of images are generated (not only by retrieving metadata already available “outside” of an image, but also by analysing the “inside” of — a container run from — such image), and (ii) the fact that Docker Finder is itself an experience report of integrating different and heterogeneous software components in a microservices-based architecture.

(13)

1.3 Thesis structure

The rest of the document is organised as follows.

Chapter 2 provides the necessary background on Docker.

Chapter 3 discusses the main design choices that drove the definition of the

architecture of Docker Finder.

Chapter 4 describes the implementation of each component of Docker Finder,

and how to deploy Docker Finder as a multi-container Docker application.

Chapter 5 discusses related work, illustrates future work and draws some

(14)

Background: Docker

In this chapter we provide a concise introduction to Docker. After presenting the main concepts and terminology about Docker, we describe some of its components (such as Docker engine, Docker image, Docker container, Docker

Hub, Docker network, Docker volume) which are fundamental to the work

presented in this thesis. We conclude the chapter by mentioning some of the current limitations of Docker.

2.1 What is Docker ?

Docker [2] is a platform for developing, shipping and running applications using container-based virtualisation technology. Container-based virtualisation uses the kernel on the host’s operating system to run multiple guest instances [21]. Each guest instance is called container, and each container has its own root filesystem, processes, memory, devices and network stack. At its core, Docker provides a way to run almost any application securely isolated in a container. Docker uses the resource isolation features of the Linux kernel such as cgroups and kernel namespace, and a union-capable file system to allow independent containers to run within a single host.

Docker, as a platform, provides others mechanisms in order to manage the containers infrastructure. The main objectives of Docker are to provide a way to:

• put an application into a Docker container, • distribute and ship containers to others users,

(15)

• deploy and orchestrate the containers to a production environment. Containers are more lightweight than virtual machines because they use less resources in terms of CPU, memory, and storage space [15]. As shown the Figure 2.11_{, virtual machines include applications, binaries, libraries, and}

the guest operating system necessary to run such applications. In addition they require an hypervisor running on the host operating system. Docker, instead, exploits a Docker engine and containers that include the application and all of their dependencies. The containers share the kernel with other containers, running as isolated processes in user space directly on the host operating system.

Fig. 2.1 Docker containers vs. Virtual machines.

Since container-based virtualisation adds a little or no overhead to the host machine, container-based virtualisation has near-native performances. A more detailed performance comparison between virtual machines and Linux containers can be found in [15].

2.2 Docker concepts and terminology

At the core of the Docker platform is the Docker engine. The Docker engine is a client-server application that builds and runs the Docker containers. The Docker daemon is the server that creates and manages Docker objects such as images, containers, networks, and data volumes, by exploiting using the Linux

(16)

kernel features. The client is an external program that takes user inputs and sends them to the daemon in order to build, ship and run containers.

Fig. 2.2 The client-server architecture of the Docker engine.

As illustrated by Figure 2.22_{, the Docker engine exposes a REST API}

interface between the Docker daemon and the command line interface (CLI) client. The CLI exploits the Docker REST API to control and interact with the Docker daemon process. There also exists some graphical user clients offering an user interface (e.g., kitematic [8]). The Docker client and daemon can also communicate via sockets.

The fundamental concepts used in the Docker platform are:

Docker image An image is a read-only template used to create containers.

Each Docker image references a list of read-only layers that represent file system differences. Layers are stacked on top of each other to form a base for a container’s root file system. The Docker storage driver is responsible for stacking the layers and providing a single unified view. An user can build its own image or use images created by other docker users. Docker images are the build component of Docker.

Docker container A container is a runtime instance of a docker image. The

technical difference between an image and a container is that the container

(17)

adds a writable layer on top of the read-only layers of the image. The container adds or modifies existing data in this writable layer. Multiple containers can be started from the same image. A container can be run, started, stopped, moved, and deleted. Docker containers are the run component of Docker.

Docker registry A registry holds the images created by users. An user can

install a local registry or use a public registry, such as Docker Hub [4]. In each registry, images are organized in repositories. A single repository can hold multiple images. An image is identified by a repository name and a tag. Docker Hub serves a large set of ready-to use images that can be downloaded and used. There are also official repositories certified from vendors that contain supported and optimized images and safe to use. Docker registries are the distribution component of Docker.

The Docker platform provides also orchestration tools in order to manage and control Docker containers on the same host or in remote hosts:3

• Docker Machine provisions Docker hosts and install Docker Engine on them.

• Docker Swarm is native clustering for Docker. It turns a pool of Docker hosts into a single, virtual Docker host.

• Docker Compose is the tool to create and manage multi-container applications on the same host.

2.2.1 Getting started with Docker

In the following section we show the steps needed to run a first container in Docker. The CLI client is used and it is assumed that the docker daemon is installed and running on the host. The docker run command is used to start a new container from an image:

d o c k e r run [ o p t i o n s ] i m a g e [ command ] [ args ...] The run command does two things:

3_{A comprehensive description of the Docker orchestration tools is outside of the scope of}

this thesis. We wish only provide some additional details about the Docker compose in the next sections.

(18)

1. Docker creates the container starting from the image specified in the argument. If the image is not present locally it pulls the image from the registry.

2. Docker starts the container created and run the command with the arguments in the running container.

The name of the image is specified by the repository:tag syntax. The command is any software that can be executed in the container and the args are the arguments of the command. The example below starts a container from the ubuntu:16.04 image and runs the echo command inside.

$ d o c k e r run u b u n t u : 1 6 . 0 4 echo " h e l l o w o r l d "

If the ubuntu:16.04 image is not present locally, Docker pulls the image from the Docker Hub. When the image is pulled, the Docker daemon starts the container and the echo command is executed inside the container. As a result the string “hello world” is printed on the screen.

A container runs as long as the process with PID 1 inside the container is running. When the process stops, the container is shut down. In the previous example this means that, as soon as the echo command has been executed by printing the “hello world” on screen, the container is shutdown.

Two possible options accepted by the run command are the following: -i --interactive that keeps STDIN open even if not attached and -t, --tty that allocates a pseudo-TTY to the container. An example of using these options is:

$ d o c k e r run - it u b u n t u : 1 6 . 0 4 / bin / bash

In this case the bash shell is executed and given that the option -it is passed an interacting pseudo-TTY is opened in the container.

2.3 Docker fundamentals

We now introduce the core components of the Docker platform that permit developing applications inside containers. We first describe how to build a Docker image. We then show the operations that permit to managing Docker containers. Finally, after describing Docker volume and Docker networking, we illustrate how to distribute Docker images.

(19)

2.3.1 Build an image

In this section we describe the building phase of Docker images. An user can use an existing Docker image (created by other users) or she can build her images.

A Docker image is made up of file systems layered over each other where each layer is just “another image”. Every layer is a file system mounted as read-only layer. The image that is below another image is the parent image, the image that is on the button of all the images is the base image.

When a container is launched from an image, docker adds a writable container layer on top of all the read-only layers. All the changes made in the container are stored on top of the writable layer. An image is built by committing the changes made in a container. Figure 2.34 shows an example of the layer structure of an image and the writable container layer added on top.

Fig. 2.3 Example of an ubuntu:15.04 image composed by read-only layers and

with the top writing container layer.

The docker commit command saves the changes performed in the writable container layer and builds the new image. The syntax of the command is the following:

d o c k e r c o m m i t [ o p t i o n s ] c o n t a i n e r [ r e p o s i t o r y [: tag ]]

4_{Source:https://docs.docker.com/engine/userguide/storagedriver/images/}

(20)

The container is the name or the identifier of the container to be saved in a new image. The name of the image created is specified with the repository:tag syntax. With this command an user can create her image starting from a container. The best solution to manage and create images in a documented and maintainable way is Dockerfile.

Dockerfile

Another method used for building an image is writing a Dockerfile. A Dockerfile is a configuration file that contains instructions for building a Docker image. It is a more convenient approach to build images compared with docker commit command. With the commit command the user must start the container manually, perform the desired changes inside the container and build the image. With Dockerfile the user writes the set of instructions in a file (e.g., install a program, mounting a volume) and the image is built reading that file.

A simple example of a Dockerfile with only three instructions is shown below.

# D o c k e r f i l e

FROM u b u n t u : 1 6 . 0 4

RUN apt - get i n s t a l l curl CMD ping 1 2 7 . 0 . 0 . 1

FROM The FROM instruction specifies what is the base image used to start

with. In this case the new image is based on ubuntu:16.04.

RUN The RUN instruction is used to execute any kind of command in the

container. In the example it is executed the command to install the curl binary. Every run instruction executes the command in a new layer on the top of the writable layer and than commit the changes in a new image. This image is used for the next step in the Dockerfile5_.

CMD The CMD instruction defines the default command that has to be executed

whenever a container is started from the image (built with this Dockerfile). This instruction has no influence during the build phase, but only at start up of the container the command of the cmd instruction is run. Only

5_{If there are five run instructions in the Dockerfile five commits are executed where the}

image is built from the Dockerfile. In order to avoid this, the developers can aggregate multiple run instructions by using && between the commands.

(21)

one cmd command can be specified in a Dockerfile. The command can be override at run time when the container is started. In the Dockerfile example the instruction specify that the ping instruction is executed in the container when it is started.

Starting from the Dockerfile an user can build an image with the docker build command.

d o c k e r b u i l d [ o p t i o n s ] path

This command takes as input the path of the build context. The build context is the folder that is used to reference the file during the building phase. During the building the Docker client packs all the files that are stored in the build context in a .tar and sends the pack to the daemon. By default, the Docker daemon searches for a file called Dockerfile in the root of the build context and starts to build the image reading the instructions. The following command builds an image with current directory as path of the build context.

$ d o c k e r b u i l d - t m y u s e r / m y a p p l i c a t i o n : l a t e s t .

The -t or --tag=[] options is used to assign a name to the image. In the example the new image has myuser/myapplication as repository name and latest as tag. The build context is the current folder.

2.3.2 Managing containers and images

In this section are listed some of the main Docker commands that permit managing the lifecycle of containers and images. A complete list can be found in [3]. The docker start command is used to run an already created container:

d o c k e r s t a r t < c o n t a i n e r ID >

The docker stop command is used to stop a running container: d o c k e r stop < c o n t a i n e r ID >

The docker exec command executes a command in a running container: d o c k e r exec < c o n t a i n e r ID > < command >

A container that is not running can be deleted with the command docker rm specifying the container ID or the name of the container:

(22)

d o c k e r rm [ o p t i o n s ] c o n t a i n e r [ c o n t a i n e r ...]

A local image can be deleted with the docker rmi command specifying the image identifier or the repository name.

d o c k e r rmi [ o p t i o n s ] i m a g e [ i m a g e ...]

In order to display all the details about a container or an image, docker inspect is the command to use. The output is a JSON array with some low-level information on a container or image.

d o c k e r i n s p e c t [ o p t i o n s ] c o n t a i n e r | i m a g e

It is possible to see the process output of the container with the docker logs command. The command shows whatever the process with PID one writes to stdout.

d o c k e r logs < c o n t a i n e r name >

2.3.3 Docker volumes

When a container is shut down all the data created in the container does not persist anymore. A solution to manage data into container manages data is to use data volumes. A data volume is a designated directory in one or more containers, which is designed to persist data, independently of the container’s lifecycle. Docker never automatically deletes volumes when a container is removed nor will it “garbage collect“ volumes that are no longer referenced by a container. A volume can be mapped to a host folder in the container and volumes can be shared between containers. Figure 2.46 _{shows a single Docker}

host running two containers. Each container exists inside of its own address space within the Docker host’s local storage area (/var/lib/docker/...). There is also a single shared data volume located at /data folder on the Docker host. This is mounted directly into both containers.

A data volume is mounted using the -v flag with the docker create and docker run command. There are two ways for mounting a volume:

$ d o c k e r run - v / m y v o l u m e u b u n t u : 1 6 . 0 4

$ d o c k e r run - v / host / path :/ cont / path u b u n t u : 1 6 . 0 4

6_{Source: https://docs.docker.com/engine/userguide/storagedriver/images/shared-volume.}

(23)

Fig. 2.4 An example of a volume shared among two containers.

The first command executes a new container and mounts the folder /myvolume into the file system of the container. The second commands mounts the folder in the host path to a folder in the container path.

In a Dockerfile mounting a volume can be done with the volume instruction: V O L U M E / m y v o l u m e

This command mounts the data volume /myvolume in the container. The host directory is, by its nature, host-dependent. For this reason, it is not possible to mount a host directory from Dockerfile because building images should be portable. A host directory would not be available on all potential hosts. Mounting a host directory can be useful for testing but generally is not recommended in production.

2.3.4 Docker networking

With Docker it is possible to create virtual networks. By default Docker installs three networks (bridge, none, host) on the host. The containers can connect to one of these networks and communicate with containers connected on the same network. The networks are useful for better isolating the communication among containers.

A container can join more than one network and containers can only communicate within networks but not across networks. A container attached to two networks can communicate with member containers in either network.

(24)

When a container is connected to multiple networks, its external connectivity is provided via the first non-internal network, in lexical order.

Docker also has a linking system that allows you to link multiple containers together and send connection information from one to another. This allows the recipient to see selected data describing aspects of the source container. The container’s name is set by using the --name flag.

$ d o c k e r run - d -- name db t r a i n i n g / p o s t g r e s

The command runs a new container with the name db from the training/postgres image.

Links allow containers to discover each other and securely transfer infor-mation about one container to another container. When you set up a link, you create a conduit between a source container and a recipient container. The following command creates a new web container and link it with the db container:

$ d o c k e r run -- name web -- link db : db t r a i n i n g / w e b a p p p y t h o n app . py

The web container is now linked to the db container as it is shown in the Figure 2.5.

Fig. 2.5 Example of linking among two containers. The db container is the recipient

and the web is the source container of the link.

2.3.5 Docker compose

Docker compose is a tool for creating and managing multi container applications.

(25)

runs in a container and the containers are linked together. With compose the user can define all the containers in a single file called docker-compose.yml and with a single command is able to spin up all the containers.

The configuration file (docker-compose.yml) is a YAML7 _{file that specifies}

the services that compose the application. Each service is described by a set of instructions that are used for building and running the container that contains the service. An example of a compose application is shown below:

j a v a c l i e n t : b u i l d : . command: java H e l l o W o r l d l i n k s : - r e d i s r e d i s : i m a g e : r e d i s

The docker-compose file specifies two services: javaclient and redis service. The javaclient service is composed by three instructions: build, command and links. The redis service has only the image instruction.

The build command takes as input the path of the Dockerfile that will be used to build the image. The image command instead take as input an already existing image. The links takes as input the <service name>:<alias> to linking with (if no alias is not specified, the service name will be used). The alias is used by the javaclient service for communicating with the redis service. The alias name is added into the hosts file in the container such that the container is able to resolve the address.

To run an application with Docker compose, the following command has to be issued:

$ docker - c o m p o s e up

The up command builds the images for each service, creates the containers and starts them and connecting the services into the same network.

2.3.6 Distributing images

A Docker image can be distributed by uploading it into a Docker registry. Others users can pull the same image and run a container or to build its own

(26)

image. A registry can be private or public. Docker Hub is a public registry of Docker images.

Docker Hub

Docker Hub is a public registry and contains a large number of ready-to-use images. At the time of writing, the number of Docker Hub public repositories are greater than 343000, not counting that every repository has more than one image (i.e., there are multiple tags). Docker Hub is composed by repositories. An user can create its own repositories where to push its images. A repository on Docker Hub can be public (i.e., every user can access it) or it can be private (i.e., only the owner of the repositories can access it). Docker Hub provides the

following major features:

• Image Repositories: Find, manage, push, and pull images from community, official, and private image libraries.

• Automated Builds: Automatically create new images when you make changes to a source code repository.

• Webhooks: A feature of Automated Builds, Webhooks let you trigger actions after a successful push to a repository.

• Organizations: Create work groups to manage access to image repositories. • GitHub and BitBucket Integration: Add Docker Hub and your Docker

Images to your current workflows.

The docker push command is used to upload a local image into Docker Hub. To push the image the user must be logged into the Docker Hub. The syntax of the command is:

d o c k e r push [ r e p o s i t o r y : tag ]

The image to push is identified by the repository and the tag. The push command sends the image to the docker registry (layer by layer). If the Docker Hub registry already has a layer this is not sent again.

An user can find public repositories from Docker Hub in two ways. She can “search“ from the Docker Hub website, or can use the docker search command. For example if an user is looking for a java image, she can run the following command:

(27)

$ d o c k e r s e a r c h java

The result is a (limited) list of repositories into the Docker Hub where the search term java appear into the repository name, in the description, or in the user name.

2.4 Concluding remarks

Docker platform allows to encapsulate an application in a container with an operating system and all the dependencies that it needs to run. Containers are fast and can be launched in few seconds reducing the development, testing and deployment phases. In terms of scalability, containers platform easily spin up new containers if needed. An application in a container is portable because it is built in one environment that can be ship into another environment (e.g., cloud, VMS, physical server), with the only requirements that the host has the Docker engine installed.

Microservices naturally lend themselves to containerisation [18], and Docker is certainly one of the best candidates for container-based virtualisation archi-tecture.

A limitation of the Docker platform is the image search mechanism currently supported. Users can search Docker images only by specifying a term, and Docker returns all images where such term occurs in the image name, in the image description or in the user name. Currently users cannot search images by specifying other attributes, such as the image size, or the software that is installed and supported by an image.

The objective of the thesis is precisely to try to resolve the above explained limitation, by designing and implementing a tool to discover Docker images not only “by name“, but also based on what they feature.

(28)

Docker Finder: design

This chapter introduces Docker Finder, by first expanding the discussion about the current limitations of Docker that Docker Finder aims to resolve. The Docker Finder architecture is presented by focusing on the design choices that led the development of the project. The chapter follows a top-down approach: First, an

high-level overview of Docker Finder is presented by dividing the components

of the architecture in three groups (analysis, storage, and discovery). Then, the design choices are elucidated for each group by discussing their advantages and drawbacks.

3.1 Extending the search capabilities of Docker

Docker Finder is a tool enabling a more powerful search of Docker images with respect to the tools currently provided by the Docker platform. Such tools permit searching images based only on the repository name, the user name, or the description of an image. With Docker Finder, it is instead possible to search Docker images based on additional information (like the version of the software installed or the size of the image). The main steps performed by Docker Finder are the following:

1. Download Docker images from the Docker registry,

2. generate and store a new description of the images into a local storage, and

(29)

As anticipated in the previous chapter, Docker Hub is a public registry of Docker images. In each registry, images are organized in repositories with a name and a tag. Some of the fields used by Docker Hub to describe an image in a repository are the following:

• repository name - the name of the repository,

• stars - the number of votes that an image received by others users, • pulls - the number of times the image was downloaded,

• official - indicates whether the image is officially promoted by Docker, • automated - indicates whether the repositories is linked to GitHub or

BitBucket and it is automatically built,

• short description - a description of the repository in a few words, • full description - a full description of the repository, and

• owner - the user owner of the repository.

Each repository has a list of tags. Each tag is a particular version of an image. Each tag contains:

• name - given by the user when the image is created (default value is

latest),

• size - the size in megabytes of the image identified by the tag, and • last updated - the date of the last update of the image.

Docker currently exploits the following fields to search or filter images: repository name, owner, stars, pulls, official, automated, and the descriptions. Users can search Docker images in the Docker Hub with two tools:

1. The docker search command line tool, and 2. the web search interface of Docker Hub.

Both methods input a term and return the available public repositories on Docker Hub where the image name, the user name or description match exactly or partially the given term. The results can be filtered by the number of stars or pulls, or by the official or automated features.

(30)

The docker search command line tool is the first method to search Docker images. The example below searches the java term and returns the images that are automated and with at least ten stars.

$ d o c k e r s e a r c h -- s t a r s =10 -- a u t o m a t e d java The output of the command is the following:

NAME DESCRIPTION STARS OFFICIAL AUTOMATED

anapsix/alpine-java Oracle Java 8... 125 [OK]

develar/java 46 [OK]

isuper/java-oracle This repository... 39 [OK]

lwieske/java-8 Oracle Java 8 ... 27 [OK]

nimmis/java-centos This is docker ... 13 [OK]

Docker returns a (partial) list of the automated repositories with at least ten stars where the term java appears in the name, in the user or in the description fields. The docker search command can return only up to 25 results.

In the Docker Hub search interface a user can insert the term in an input box, she can order the results by the amount of stars or pulls, and she can filter the images that are automated or official.

Figure 3.1 shows the results of a search with the Docker Hub search interface. The term java is inserted into the input box (which is at the top right corner of the interface), and the results are filtered by selecting the features in a dropdown menu. In this example the images are filtered by the number of stars. The total number of repositories found (6332) is displayed top left.

In summary, the search tools currently available in Docker have limited capabilities. For example, suppose that an user develops an application that requires both python 3.4 and java 1.8 that she wants to use a Docker image to ship its application, and that the size of the image must be as small as possible. With the search tools currently provided by Docker, there is no way to submit a query that returns the Docker images that matches her requirements. Docker Finder, instead, is designed to support this type of search.

3.2 Docker Finder: high level overview

We first introduce a high-level overview of the architecture of Docker Finder. As illustrated in Figure 3.2, the components of Docker Finder can be grouped

(31)

(32)

into analysis, storage, and discovery, which correspond to the three phases carried out by Docker Finder.

Fig. 3.2 High-level overview of the architecture of Docker Finder.

Analysis - The analysis part is the core of Docker Finder. It crawls Docker

images from the Docker registry1, and analyses each image’s feature to generate a description. All generated descriptions are then passed to the

storage part. The analysis phase runs continuously2 in background, by downloading and analysing images from the Docker registry.

Storage - The storage part essentially consists in a local repository where to

store the description of the images produced by the analysis part.

Discovery - The discovery part exposes a RESTful API and a Graphical User

Interface (GUI) to users, which permits them to look for Docker images and interact with Docker Finder. The discovery part interacts with the

storage part for resolving the search queries submitted by the users.

3.2.1 Docker Finder: design choices

This section lists some of the main choices made in the design of Docker Finder. For each choice the advantages and drawbacks are underlined.

1_{As we will see in Section 4.1, the current prototype of Docker Finder crawls from the}

official Docker Hub registry.

2_{The images within Docker registry are updated frequently by the users, therefore Docker}

(33)

Microservices-based architecture - The microservices-based architectural [13]

style permits structuring distributed applications as suites of microserices.

Each microservice is a minimal independent process which interacts with other microservices via lightweight messages.

Docker Finder is designed to be deployed as a microservices architecture. The modules of Docker Finder are independent and the entire architecture interacts via HTTP messages.

The microservices-based architecture has been adopted mainly to achieve scalability, but also to improve fault isolation, eliminate long-term com-mitment to a single technology stack, and make it easier to understand the functionality of each microservice.

Description of the image - Docker Finder produces a description for each

crawled Docker image. Such description is the internal representation of an image used by Docker Finder to permit searching for images. The description is fundamental for the expressiveness of the search, because a detailed description of images helps to find the right image.

The two main sources of information for producing the image description exploited by Docker Finder are the following:

• Docker registry: Docker Finder takes all the information already available into Docker registry, and

• Docker image: Docker Finder runs the image in a container and extracts the information by inspecting the running container. The Figure 3.3 shows the structure of the description of an image in Docker Finder. The repo_name is the name of the repository and tag is the tag of the image. last_scan is the time of the last scan of the image, last_updated is the time of the last update of the image, full_size is the size in megabytes of the image, star_count are the number of stars, pull_count is the number of pulls, description is the description of the image, distro is the operating system distribution of the image, softwares is an array of objects where each object has a software name and the version (ver) found in the image.

The description is stored as a JSON data type. An example of the longshoreman/controller image currently available in the Docker Hub can be described as follows:

(34)

Fig. 3.3 Description fields used by Docker Finder for describing a Docker image. { " _id ": "57 b b f f 4 6 2 5 e d f 8 1 3 0 0 0 4 6 3 9 7 " , " r e p o _ n a m e ": " l o n g s h o r e m a n / c o n t r o l l e r " , " tag " : " l a t e s t " , " l a s t _ s c a n ": "2016 -08 -23 T07 : 4 6 : 1 4 . 2 4 8 Z " , " l a s t _ u p d a t e d ": null , " s t a r s ": 0 , " p u l l s ": 65 , " d i s t r o ": " D e b i a n GNU / L i n u x j e s s i e " , " d e s c r i p t i o n ": "" , " size ": 3 2 2 4 8 4 5 9 1 , " __v ": 0 , " s o f t w a r e s ": [ { " s o f t w a r e ": " p y t h o n " , " ver ": " 2 . 7 . 7 " } , { " s o f t w a r e ": " perl " , " ver ": " 5 . 1 8 . 2 " } , { " s o f t w a r e ": " curl " , " ver ": " 7 . 3 7 . 1 " } ,

(35)

{ " s o f t w a r e ": " npm " , " ver ": " 1 . 4 . 2 1 " } ] }

The example reports the description of the latest tag of the

longshoreman/controller image. The last_scan of the image was performed on 16th August 2016. The last_update field is null (since the image was never updated in the Docker Hub). The image has 0 stars and it has been downloaded 65 times. The image is based on the Debian Linux distribution and it is a size of 322484591 bytes. Within the image Docker Finder has found four software distributions: python 2.7.7, perl 5.18.2, curl 7.37.1, and npm 1.4.21

Docker Finder-in-Docker Docker Finder is thought to be deployed as a

multi-container Docker application. Its microservice-based architecture, indeed, permits shipping each microservice in an independent container, and such containers can then be orchestrated by exploiting Docker Com-pose (see Section 4.5 for further details).

Notice that, by deploying Docker Finder as a multi-container Docker application we would ease and improve its (auto-) scaling capabilities [16].

3.3 Docker Finder architecture

The final architecture of Docker Finder is depicted in Figure 3.4. The analysis,

storage, and discovery parts (as well as the modules they contain) are detailed

in the following sections.

3.3.1 Analysis part

The analysis part is composed by three services: crawler, message broker, and

scanner (Figure 3.5). The crawler selects the names of the Docker images from

the Docker Hub and sends them to the message broker. The message broker enqueues the docker images name and make them available to the scanners. When a scanner receives the name of an image, it pulls the image from the

(36)

Fig. 3.4 Complete architecture of Docker Finder.

Docker registry, generates the description of the image (by downloading all metadata already available in the registry, and by “scanning” a running instance of the pulled images), and finally sends the obtained description to the storage

part.

Each scanner interacts also with the discovery part. From the discovery

part the scanner receives the list of software distributions to be searched within

an image (e.g., python, perl, curl, etc ).

(37)

Crawler

The crawler crawls from the Docker registry the names of Docker images to be scanned. It does not download any image but it sends only the name to the message broker. The crawler can implement also some filtering policies on the images that it crawls. For example, the crawler can crawl only images with a particular tag in order to reduce the workload of each scanner.

As we already mentioned, it is a task of scanners to pull images (if necessary). This choice is motivated by efficiency reasons: If the crawler were pulling images, then the scanner would have to either reside on the same host of the crawler (to have pulled images available), or to pull each image again. In the former case, having crawler and scanner necessarily on the same host might have caused issues for scaling the independently. In the latest case, we would have paid the cost of pulling images twice.

Notice that Docker Finder is currently designed to crawl from a single registry (either Docker Hub or a private registry), but it can easily be extended to crawl from multiple Docker registries. In order to enable Docker Finder to crawl from multiple registries, an unique identifier must be assigned to each registry because a scanner must know from which registry an image has to be to pulled. Another interesting extension is to support parallel crawlers, which crawl from (possibly) different Docker registries.

Message broker

The Message broker is an intermediary program module that translates a message from the formal messaging protocol of a sender to the formal messaging protocol of a receiver [23]. The purpose of the broker is to take incoming message from an application and perform some action on them such as route the message to one or more destinations, perform message aggregation, or decompose messages into multiple messages.

In Docker Finder the message broker stays in between of the crawler (i.e., the sender) and the scanners (i.e., the receivers). The message broker receives the messages, stores them into a queue and schedules the message to the scanners. The messages are the images’ name crawled.

(38)

• Once the crawler has sent a message to the message broker, it can continue to do its work, being confident that the message broker retains the message until a scanner retrieves it,

• the crawler and the scanner become loosely coupled components,

• the message broker manages the workload queue for multiple scanners, providing reliable storage, and guaranteeing message delivery and trans-action management.

Scanner

The scanner builds the description of each Docker image. The main steps performed by a scanner are the following:

• It asks for a Docker image’s name to the message broker,

• it checks if the image has to be scanned (i.e., whether its description is out-of-date, or if it is a newly added image),

• it scans the image and builds its description (by adding the metadata that it can retrieve from the Docker registry, and by looking for the software distributions3 _{installed in the image), and}

• it sends the description to the storage part.

The scanner is the core service in the analysis phase. As we expected that the time needed to crawl the image names from the remote registry will be lower than the time needed to pull and scan them an image, the number of scanners can be scaled to control the size of the queue of the message broker. The scanner is designed to be an independent entity that can be launched in the Docker Finder architecture, so that the number of running scanners can be increased to increase the number of images analysed in a period of time.

As we will see in Section 4.5.1, in the current prototype of Docker Finder the number of scanners can be scaled manually. The system starts with a given number of scanners and the system administrator can launch any number of additional scanners to scan the images present in the queue of the message broker. A possible improvement is to equip Docker Finder with an auto-scaling

3_{As we anticipated, the list of software distributions searched in an image is taken from}

(39)

feature so that the number of scanners is increased or decreased automatically and dynamically (if needed).

The scanner, before starting to scan an image, checks whether the image must be scanned or not. An image must be scanned if it is a new image or if the last update of the image in the registry was performed after the last scan time. The last scan time is updated in the description every time the image is scanned.

3.3.2 Storage part

The role of the storage part is to manage the descriptions of images produced by the scanners. In this perspective, it is composed by the images service and the images database (Figure 3.6). The images service exposes the RESTful APIs and it is in charge to search, add, delete, and update images description into the images database. The images database stores the descriptions of the Docker images.

Fig. 3.6 Microservices composing the storage part.

Images service

The tasks of the images service are the following:

1. To manage the Docker Finder description images (i.e., to get, add, delete, or update an image description),

2. to search for images by looking into their description.

Each task is accessible by a RESTful API, but the search endpoint is not exposed to the external environment. An user cannot submit a search query directly to the images service. The only way that an user can submit a search

(40)

operation is by calling the search API module or by the GUI in the discovery part. The choice of not making the images service public available is for controlling the operations submitted by the users. An user should only searches the images and not add, delete or update an image description. With this solution there is no need of an authentication method for the users.

The images service is the core module of the storage part. An implementa-tion of the images service must take into account for the following issues:

• workload: Both the scanners and the search API requests are managed by this service.

• robustness: If the service goes down all the system can get stuck because the scanners cannot upload new image description and the search API cannot work without the image service.

A solution for the robustness problem is to divide the service in two modules as shown the Figure 3.7. The imagesService2analysis manages the description (add, update, and delete) into the database and the searchAPI 2discovery provides the search operations. With this approach, even if one service is not available to satisfy a request the other one can still work without break down the entire system.

Fig. 3.7 Storage part architecture: how to improve its robustness.

As we will see in Section 4.3, Docker Finder currently implements the solution with a single Images service that provides both the management operations and the search API of the Docker Finder description. A future improvement is to implement the solution proposed (Figure 3.7) in Docker Finder.

(41)

Images database

The images database stores the description of the Docker images. Since the description is by default in JSON format, the most appropriate databases are those natively supporting JSON documents (e.g., NoSQL database such as

MongoDB and Redis).

3.3.3 Discovery part

The discovery part (Figure 3.8) includes the GUI, the Search API, the Software

service, and the software database. The discovery exposes to users the interfaces

to interact with Docker Finder.

Fig. 3.8 Microservices composing the discovery part.

Software service

The most important type of information extracted by Docker Finder is the software that is installed inside an image. The Software service stores the list of software name used by the scanner to search the softwares that are installed in an image.

The software service is accessed both by the scanners and by external users. The scanners interact with this service for obtaining the list of softwares to search in the Docker images. An external user, instead, is able to add a software name to the collection calling the exposed RESTful API.

(42)

The advantages of this choice are the following:

• The list of software is “open” and every user can add a software to search within the images, and

• the list of software can be easily updated because it is offered as a service (instead of being hard coded in the system).

On the other hand, the potential drawbacks of this choice may be the following:

• Malicious users may send wrong software names. A possible improvement is to add an authentication method for the users.

• Communication delay is introduced as each scanner requires the list of software to search at every scan. An improvement is to design the scanners in such a way the list of software is received only if the list is updated with new software.

Search API

The module exposes the Docker Finder search API to an external application or user. Search API does not require any authentication of the user because all the operations exposed are safe to use by an user. The API has not rate limits per user or per application.

The search API module is only a proxy, the search API does not execute the search query into the database, but it forwards the requests to the Images

service that searches into the database and returns the response to the search

API.

The search API has the advantage that it exposes only the permitted operations to the external users filtering the requests, but at the same time introduces latency because the request crosses multiple hops (i.e., from the user to the SearchApi, from the SearchApi to the Images service). The API is used by both external user and by the GUI.

Graphical User Interface (GUI)

The (GUI) is a web-based application that allows the users to interact with Docker Finder. More precisely, it permits users to build search queries in a more friendly graphical environment, to submit such queries to Docker Finder

(43)

(and, more precisely, to its search API), and to obtain a visual representation of the results of each submitted query.

(44)

Docker Finder: Implementation

This chapter focuses on the implementation details of Docker Finder (whose microservices-based architecture has been illustrated in Chapter 3). More precisely, after showing how Docker Hub can serve as the remote registry of Docker images, we illustrate how we concretely implemented the analysis,

storage, and discovery parts of Docker Finder. Then, we show how Docker

Finder can be deployed as a multi-container Docker application. All the source code of Docker Finder is public available on Github1_.

4.1 A concrete registry of Docker images

As discussed in Section 3.2, Docker Finder has to crawl images from a remote Docker registry. In its current version, the concrete registry accessed by Docker Finder is the Docker Hub [4]. Docker Hub is a cloud-based registry service which allows users to build images and test them, to store manually pushed images, and to link to code repositories (e.g., GitHub or Bitbucket). Docker Hub stores Docker images in repositories where users can find, manage, and push images from community, official, and private images libraries.

Docker Finder accesses (Figure 4.1) Docker Hub registry — more pre-cisely the hub.docker.com endpoint — for retrieving the images to scan. hub.docker.com endpoint provides the methods for pushing, pulling, deleting, and searching Docker images. The main methods of the hub.docker.com endpoint invoked by Docker Finder are listed below.

(45)

Fig. 4.1 Docker Hub the concrete registry of Docker images.

Searching images Docker Finder searches Docker images into Docker Hub.

All the images stored within Docker Hub are returned by performing the following request.

GET hub . d o c k e r . com / v2 / s e a r c h / r e p o s i t o r i e s /? q u e r y =*

The result of the previous request is a JSON object containing the (partial) list of the images contained in the registry.

{ " c o u n t ": 372246 ,

" next ": " h t t p s :// hub . d o c k e r . com / v2 / s e a r c h / r e p o s i t o r i e s /? q u e r y =%2 A & page =2" , " p r e v i o u s ": null , " r e s u l t s ": [{" s t a r _ c o u n t ": 0 , " p u l l _ c o u n t ": 0 , " r e p o _ o w n e r ": null , " s h o r t _ d e s c r i p t i o n ": " C o m p l e t e the c o n f i g u r a t i o n work " " i s _ a u t o m a t e d ": false , " i s _ o f f i c i a l ": false , " r e p o _ n a m e ": " m e m o r y 2 0 1 6 / c o n f i g u r a t i o n "} , {" s t a r _ c o u n t ": 0 , " p u l l _ c o u n t ": 0 , " r e p o _ o w n e r ": null , " s h o r t _ d e s c r i p t i o n ": " ubuntu python -t o o l s " , " i s _ a u t o m a t e d ": false , " i s _ o f f i c i a l ": false , " r e p o _ n a m e ": " x u c h u / x i n 7 c "} ,

(46)

... ]}

The total number of repositories returned by the Docker Hub are 372246 (i.e., count field). The results contains a list of the returned images. By default, the endpoint returns only the first ten images within the registry. next is the request for getting the successive (ten) images, previous contains the request for getting the previous (ten) images.

Retrieving information about repositories The informations associated

with a single repository are retrieved by submitting the following request. GET hub . d o c k e r . com / v2 / r e p o s i t o r i e s /: r e p o n a m e

The reponame is the name of the repository to be retrieved. For exam-ple, the informations associated with the memory2016/configuration repository can be obtained by performing the following request.

GET hub . d o c k e r . com / v2 / r e p o s i t o r i e s / m e m o r y 2 0 1 6 / c o n f i g u r a t i o n

The response is a JSON object with all the available informations about the memory2016/configuration repository.

{ " user ": " m e m o r y 2 0 1 6 " , " name ": " c o n f i g u r a t i o n " , " n a m e s p a c e ": " m e m o r y 2 0 1 6 " , " s t a t u s ": 0 , " d e s c r i p t i o n ": " C o m p l e t e the c o n f i g u r a t i o n work " , " i s _ p r i v a t e ": false , " i s _ a u t o m a t e d ": false , " c a n _ e d i t ": false , " s t a r _ c o u n t ": 0 , " p u l l _ c o u n t ": 1 , " l a s t _ u p d a t e d ": null , " h a s _ s t a r r e d ": false , " f u l l _ d e s c r i p t i o n ": "" , " p e r m i s s i o n s ": {" read ": true , " w r i t e ": false , "

a d m i n ": f a l s e } }

Docker Finder exploits this informations to construct part of the descrip-tions of images. For example, the fields star_count and pull_count are

(47)

included in a description to identify the stars and the pulls associated with the corresponding repository.

Retrieving information about images A single repository contains

multi-ple images and each image is identified by a tag. The information about the image identified by a tag are retrieved by the following request.

GET hub . d o c k e r . com / v2 / r e p o s i t o r i e s /: r e p o n a m e / tags /: tag

For example, the nimmis/java image with the latest tag informations are retrieved by the request:

h t t p s :// hub . d o c k e r . com / v2 / r e p o s i t o r i e s / n i m m i s / java / tags / l a t e s t /

The results is a JSON object containing information concerning the specified image. { " name ": " l a t e s t " , " f u l l _ s i z e ": 1 8 2 3 3 9 9 9 4 , " id ": 76781 , " r e p o s i t o r y ": 94799 , " c r e a t o r ": 162525 , " l a s t _ u p d a t e r ": 162525 , " l a s t _ u p d a t e d ": "2016 -09 -20 T05 : 1 1 : 1 4 . 1 3 6 5 0 0 Z " , " i m a g e _ i d ": null , " v2 ": true , " p l a t f o r m s ": [5] }

The informations taken by Docker Finder are: the size of the image in bytes (i.e., full_size), and the time of the last update (i.e., last_updated).

4.2 Implementation of the analysis part

Figure 4.2 zooms on the components of the analysis part by highlighting the interactions among internal and external services of the architecture. The

Crawler and the Scanner are written in Python, while the message broker is

an instance of the RabbitMQ [10]. The complete source code of the analysis part is public available on GitHub2_.

(48)

Fig. 4.2 Implementation of Analysis part.

4.2.1 Crawler and scanner

The crawler and scanner are two Python classes which are both contained in the module pyfinder3_{. As shown in Figure 4.2, both the scanner and the}

crawler communicate with the Docker Hub API and the RabbitMQ server. Each scanner communicates also with the ImagesService through the Images

API, and with the Software Service through the Software API.

Crawler

Figure 4.3 shows the modules that compose the crawler class: Client DockerHub is used to interact with Docker Hub API, and the Publisher RabbitMQ is the

Pika client4 used to communicate with the RabbitMQ server.

Fig. 4.3 Details of the Crawler implementation.

The main methods of the crawler class are the following:

3_{The module pyfinder contains all the source code needed by scanner and crawler.} 4_{Pika is is a pure-Python implementation of the AMQP 0-9-1 protocol [9].}

(49)

• crawl(): is a generator function5 _{that crawls all the existing images}

names from the Docker Hub by exploiting the ClientDockerHub module. • run(): it starts the PublisherRabbitMQ module that sends the images

crawled by the crawl() function to the RabbitMQ server.

A snippet of the code of the crawl() function is shown below. The method is a generator function that yields the name of the Docker images crawled in a JSON data type.

def c r a w l ( self , f r o m _ p a g e =1 , p a g e _ s i z e =10 , m a x _ i m a g e s = 1 0 0 ) : s e n t _ i m a g e s = 0 for l i s t _ i m a g e s in self . c l i e n t _ h u b . c r a w l _ i m a g e s ( page = f r o m _ p a g e , p a g e _ s i z e = p a g e _ s i z e , m a x _ i m a g e s = m a x _ i m a g e s , f i l t e r _ i m a g e s = self . f i l t e r _ t a g _ l a t e s t ) : for i m a g e in l i s t _ i m a g e s : r e p o _ n a m e = i m a g e [ ’ r e p o _ n a m e ’ ] s e n t _ i m a g e s += 1 y i e l d json . d u m p s ({ " name " : r e p o _ n a m e })

The method crawls the images from Docker Hub by calling the crawl_images() method of the Client DockerHub module (self.client_hub.crawl_images()). The filter_images parameter of the crawl_images method is a function that is called by the ClientDockerHub to filter the images6. The filter function inputs the name of an image and return True if the image must be downloaded, otherwise returns False if the image must be discarded. Any filter function can be passed to filter the images. In Docker Finder the filter_tag_latest() function filters only the images that have the latest7 _{tag. The code of the}

filter_tag_latest() function is shown below:

def f i l t e r _ t a g _ l a t e s t ( self , r e p o _ n a m e ) :

5_{Generators functions allow you to declare a function that behaves like an iterator, i.e. it}

can be used in a for loop.

6_{The crawler filters the images sent to the RabbitMQ server. In this way the scanners}

receives only the images that must be scanned reducing the workload of the scanners.

7_{Docker Finder is a prototype and scans only the latest tag of a repository. That does}

not rule out the possibility to scan all the tags of an image, because it is enough to delete the code that filters only the image with the latest tag.

(50)

l i s t _ t a g s = self . c l i e n t _ h u b . g e t _ a l l _ t a g s ( r e p o _ n a m e ) if l i s t _ t a g s and ’ l a t e s t ’ in l i s t _ t a g s : return True else: return F a l s e

The function filter_tag_latest downloads the list of the tags associated with the image (repo_name) from the Docker Hub. It returns True if the list contains the latest tag, False otherwise.

Scanner

The scanner class is the core of the analysis part. Figure 4.4 shows the five modules that are exploited by the scanner to interact with the external components:

Fig. 4.4 Details of the Scanner implementation.

ClientDockerHub The scanner exploits this module to interacts with the

DockerHub API (to get the metadata about a Docker image — e.g., last update, size, owner, description).

ConsumerRabbitMQ The scanner is the consumer of the rabbitMQ server. ConsumerRabbitMQ implements the consumer client that receives, asyn-chronously, the name of an image. CosumerRabbitMQ exploits a pika client to interact with the RabbitMQ server.