• Non ci sono risultati.

Evolution and further developments

2.5 Model 3: GAME

3.1.4 Evolution and further developments

One of the key aspects of our strategy is to achieve the exploration of massive data sets (MDS) coming out from ground segments of large telescopes or space mission facilities, in such a way that it is not required to physically move huge data over the network from their original repositories. First of all because there should be any restrictive policy to prevent such action. Second, because it can be simply impossible, due to the huge amount of data to be moved respect of the available network channel bandwidth. In these cases, the problem is indeed to be able to move data mining toolsets directly to the locations hosting data centers.

Current solution strategies, under investigation in some communities, such as the VO in the astronomical context where it is under design a web based protocol for application interoperability (named Web Samp Connector), result only partial, because they always require to exchange data over the web between application sites.

In other words, any of such interoperability scenarios, based on data communication between all possible combinations of desktop and web applications, results not able to process MDS in a realistic and affordable way.

Our approach is completely different. We do not move data between sites. The strategy is based on the concept of creating a standardized web application repository cloud, named as “Hydra Lernaen” as described in sec. 3.1.4 from the name of the ancient nameless serpent-like a water beast, with reptilian traits that possessed many independent but equal heads.

Other foreseen improving features are:

• Parallelization: Introduction of MPI (Message Passing interface) technology in the Framework and DBMS components of DAME, by investigating its deployment on a multi-core platform, based on GPU+CUDA computing technique (Maier, 2009). This could improve computing efficiency in data mining models, such as Genetic Algorithms, naturally implementable in a parallel way.

Figure 3.3: According to the Nielsen’s Law Network bandwidth double every 21 month; instead data in astrophysics are growing exponentially, with a doubling time of 1.5 years (courtesy G. S. Djorgovski).

• Scripting and workflows: A workflow is a sequence of “nuclear” tasks, that could include “loops”, “case of” and so on. At the moment this can be done executing all the task manually using the GUI, as soon as possible it will be feasible to write a script using our API.

• Statistical and Visualization tools: our models already produce some graphics, but at the moment they cannot be produced by the user independently from the execution of a job: We are working to implement statistical and visualization tools that can be performed as single task.

• More editing tools: we want to enlarge the set of dataset editing tools for pre and post processing.

Lernaean Hydra

There are at least two reasons for not moving data over the network from their original repositories to the user’s computing infrastructures. First of all the fact that the transfer could be impossible due to the available network bandwidth and, second, because there could be restrictive policies to data access.

In these cases, the problem is to move the data mining toolsets to the data centers.

Current strategies, under investigation in some communities such as the VObs, are based on implementing web based protocols for application interoperability (Derrierre & Boch 2010 and Goodman et al. 2012). Possible interoperability scenarios could be:

1. DA1 ⇔ DA2 (data+ application bi-directional flow)

(a) Full interoperability between DA (Desktop Applications);

(b) Local user desktop fully involved (requires computing power);

2. DA ⇔ WA (data+ application bi-directional flow) (a) Full WA ⇒ DA interoperability;

(b) Partial DA ⇒ WA (Web Applications) interoperability (such as remote file storing);

(c) MDS must be moved between local and remote apps;

(d) user desktop partially involved (requires minor computing and storage power);

3. WA ⇔ WA (data+ application bi-directional flow)

(a) Except from URI exchange, no interoperability and different accounting policy;

(b) MDS must be moved between remote apps (but larger bandwidth);

(c) No local computing power required;

All of these mechanisms are only a partial solution since they still require to exchange data over the web between application sites. Since the International Virtual Observatory Alliance (IVOA) Interop Meeting, held in Naples in 2011, we proposed a different approach3:

4. WA ⇔ WA (plugin bi-directional exchange) (a) All DAs must become Was;

(b) Unique accounting policy (google/Microsoft like);

(c) To overcome MDS flow, applications must be plug & play (e.g. any WAx feature should be pluggable in WAy on demand);

The plugin exchange mechanism foresees a standardized web application repos-itory cloud named “Lernaean Hydra” (from the name of the ancient snake-like monster with many independent but equal heads). A HEAD (Hosting Environment Application Dock) cloud is in practice a bunch of software containers of data min-ing model and tool packages, to be installed and deployed in a pre-existmin-ing data warehouse. In such cloud, the HEADs can be in principle different in terms of number and type of available models, originally provided at the installation time.

This is essentially because any hosting data center could require specific kinds of data mining and analysis tools, strictly related with specific data and knowledge search types. But all such HEAD would be practically the same in terms of internal structure and I/O interfaces, being based on a pre-designed set of standards, which

3http://www.ivoa.net/cgi-bin/twiki/bin/view/IVOA/InterOpMay2011KDD

completely describe their interaction with external environment and application plugin and execution procedures. If two generic data warehouses host a HEAD on their site, they are able to engage the mining application interoperability mechanism by exchanging algorithms and tool packages on demand. On a specific request the mechanism could indeed start a very simple automatic procedure which moves applications, organized under the form of small packages (some MB in the worst case), through the Web from a HEAD source to a HEAD destination, install them and makes the receiving HEAD able to execute the imported model on local data.

Further refinements of the above mechanism could be introduced at the design phase, such as for instance to expose, by each HEAD, a public list of available models, in order to inform other sites about services which could be imported. Of course, such strategy requires a well-defined design approach, in order to provide a suitable set of standards and common rules to build and codify the internal structure of HEADs and data mining applications (for example any kind of rules like PMML, Predictive Model Markup Language). These standards can be in principle designed to maintain and preserve the compliance with data representation rules and protocols already defined and currently operative in a particular scientific community (such as VO in Astronomy). Considering also that HPC (High Performance Computing) resources started to be no more prohibitive and more easily approachable by data warehouses, our scenario seems to be really feasible. Any data Center can provide a suitable computing infrastructure hosting a HEAD and become a sort of mirror site of data mining application repository. The first step towards the realization of a prototype of such strategy is the currently available resource, named DMPlugin. It consists of an SW package useful to import any third-party routine/program into the DAMEWARE platform, according to the standard definition of plugin in the Web terminology. By following graphical wizard steps, the user is indeed able to configure and define all the I/O requirements to wrap his own package, ready to be automatically embedded into the Web App REsource.

It consists in Web Application Repositories (WAR) of data mining model and tool packages, to be installed and deployed in a generic data warehouse. Different WARs may differ in terms of available models since any hosting data center might require specific kinds of data mining and analysis tools. If the WARs are structured around a pre-designed set of standards which completely describe their interaction with the external environment and application plugin and execution procedures, two generic data warehouses can exchange algorithms and tool packages on demand.

On a specific request the mechanism starts a very simple automatic procedure which moves applications, organized under the form of small packages (some MB in the worst case), through the Web from a WAR source to a WAR destination, install them and makes the receiving WAR able to execute the imported model on local data. More refinements of the above mechanism can be introduced at the design phase, such as for instance to expose, by each WAR, a public list of available models, in order to inform other sites about services which could be imported. Such strategy requires a standardized design approach, in order to provide a suitable set of standards and common rules to build and codify the internal structure of WARs

Figure 3.4: The main steps of the application plugins exchange mechanism among data mining web warehouses.

and of the data mining applications themselves, such as, for example, any kind of rules like Predictive Model Markup Language (PMML Guazzelli et al. 2009).

These standards should be designed to maintain and preserve the compliance with data representation rules and protocols already defined and currently operative in a particular scientific community (such as the VObs in Astronomy). In case of scheme 4, no local computing power is required. Also smartphones can run DM applications. Then it descends the following series of requirements:

• Standard accounting system;

• No more MDS moving on the web, but just moving applications, structured as plugin repositories and execution environments;

• standard modeling of WA and components to obtain the maximum level of granularity;

• Evolution of existing architectures to extend web interoperability (in particular for the migration of the plugins);

After a certain number of such iterations the scenario will become:

• No different WAs, but simply one WA with several sites (eventually with different GUIs and computing environments);

• All WA sites can become a mirror site of all the others;

• The synchronization of plugin releases between WAs is performed at request time;

• Minimization of data exchange flow (just few plugins in case of synchroniza-tion between mirrors).

Any data center could implement a suitable computing infrastructure hosting the WAR and thus become a sort of mirror site of a world-wide cross-sharing network

Figure 3.5: The scenario after several cycles of the application plugins exchange mechanism. At the end, all web repositories are equivalent mirrors, able to perform data mining experiments on local massive data archives.

of data mining application repository in which it could be engaged a virtuous mechanism of a distributed multi-disciplinary data mining infrastructure, able to deal with heterogeneous or specialized exploration of MDS. Such approach seems the only effective way to preserve data ownership and privacy policy, to enhance the e-science community interoperability and to overcome the problems posed by the present and future tsunami of data. By following this approach, the DAMEWARE web application, described in the previous chapters, represents a first prototype towards a WAR mechanism.