• Non ci sono risultati.

BIG DATA FOR LABOUR MARKET INTELLIGENCE

N/A
N/A
Protected

Academic year: 2021

Condividi "BIG DATA FOR LABOUR MARKET INTELLIGENCE"

Copied!
60
0
0

Testo completo

(1)
(2)

Paper written by Mario Mezzanzanica and Fabio Mercorio for the European Training Foundation.

The contents of this paper are the sole responsibility of the authors and do not necessarily reflect the views of the ETF or the EU institutions.

© European Training Foundation, 2019

(3)

PREFACE

This introductory guide addresses key conceptual, methodological and organisational aspects in using Big Data for labour market intelligence (LMI). The target readers and users are statisticians,

researchers, policy analysts and decision-makers in the European Training Foundation’s (ETF) partner countries who are confronted with the challenges of anticipation and dissemination of insights on the dynamics of demand for jobs, skills and qualifications.

Big Data is all around us, but its potential and the ways it can be used in social research remain a novelty for many state institutions and stakeholders in ETF partner countries and beyond.

This introductory guide clarifies how Big Data can be used to go beyond the frontiers of conventional approaches to labour market information systems (LMIS) and add value to established statistics. Traditional approaches to providing LMI, based essentially on surveys, have important caveats: cost, timeliness, accuracy, usage, integration and coverage. These challenges are addressable but this will need the attention of governments, stakeholders and their donor partners to resolve them.

Big Data sources and analysis supplement and enrich established statistics. Big Data analytics can be used to map skills by occupation, to identify discrepancies in skills, to identify obsolete skills, to do predictive analysis of demand for new occupations and new skills – in quasi real time. Big Data analytics allow more refined (granular), space-related insights in real time, as well as predictive analysis.

The volume, variety and velocity of Big Data will continue to increase. Vast amounts of digital data are generated by people, organisations, smart sensors, satellites, surveillance cameras, the internet and countless other devices. The endeavour to make sense out of that data brings about exciting

opportunities. Creating knowledge out of the data is the major goal of Big Data analysis. In other words: it is about value.

Big Data is associated with non-negligible challenges and issues, in particular veracity. This refers to the quality of the data, which can vary greatly and requires adequate approaches, rules and

techniques. There are also issues related to data protection and privacy requiring safeguards. But before diving into the techniques of Big Data analytics, an interested organisation or group of stakeholders needs to start by asking: What is the problem at large in our domain? How do we see ourselves solving it? Who needs and who will use the insights we will deliver? What will be the scope, granularity and visualisation of the insights? Who will make sense of the data-driven insights?

The application domains of Big Data analysis are wide; fortunately, the phenomena and dynamics of the jobs and skills markets can be screened and analysed using Big Data. However, a number of important themes might not be captured as yet through Big Data analytics, for example the features and trends of informal employment, which in many countries is very large.

(4)

The ETF thanks the team of experts who authored this introductory guide – Mario Mezzanzanica and Fabio Mercorio – for demonstrating flexibility throughout the drafting process to adapt the information to the needs of the target users and for sharing their experience and insights, taking account of their own research (CRISP, University of Milano-Bicocca) and other relevant projects across the world used as examples in this guide.

The ETF is grateful to all organisations that contributed to this paper with examples and cases helpful to illustrate the key messages. ETF expert Eduarda Castel-Branco coordinated the work and

(5)

CONTENTS

PREFACE ... 3

EXECUTIVE SUMMARY ... 6

1.

BIG DATA AND LABOUR MARKET INFORMATION: HOW TO ENHANCE LMI

IN THE DIGITAL ERA – OVERVIEW, STATE OF PLAY, POTENTIAL AND

LIMITATIONS... 8

1.1

Background and definitions ... 8

1.2 Big Data meets LMI ... 13

1.3 Literature on Big Data for LMI ... 21

1.4 Big Data for LMI in operation ... 22

2.

INCORPORATING BIG DATA ANALYTICS IN LMI: SYSTEMATIC STEPS... 28

2.1 Big Data architecture components ... 28

2.2 State-of-the-art architectures, technologies and tools ... 32

2.3 The role of AI for LMI: algorithms and frameworks to reason with LM data ... 36

3.

USE OF BIG DATA ANALYTICS FOR LMIS: A SELECTION OF CASES

TO BE USED AS PRACTICAL REFERENCE ... 42

3.1 CyberSeek.org – United States of America ... 42

3.2 WheretheWorkIs.org – United Kingdom ... 43

3.3 Bizkaia Basque Talent Observatory – Spain ... 44

3.4 The data-driven skills taxonomy – United Kingdom ... 45

3.5 Technical, entrepreneurial and vocational education and training – Malawi ... 46

3.6 Transfer Occupations and Tensions Indicators projects – The Netherlands ... 47

3.7 Real-time labour market information on skill requirements – All EU Member States .... 48

4.

CONCLUSIONS AND RECOMMENDATIONS ... 51

4.1 Summary of recommendations and steps for the ETF and its partner countries ... 51

4.2 Ideas for pilot projects ... 52

ACRONYMS ... 54

(6)

EXECUTIVE SUMMARY

In the past few decades, significant forces and factors have dramatically changed the nature and characteristics of the labour market in both advanced and developing countries. On the one side, technical progress, globalisation and the re-organisation of the production process – with outsourcing and offshoring – have radically altered the demand for certain skills: several jobs are disappearing while new jobs are emerging. Of these, some are simply a variant of existing jobs while others are genuinely new jobs that were non-existent until a few years ago. Notably, population ageing in advanced economies intensifies the need for continued training, and is likely to affect the structural demand for certain competences: the quantity and quality of the demand for skills and qualifications associated with the new labour market has changed substantially. Not only are new skills needed to perform new jobs but also the skills requirements of existing jobs have changed considerably. On the other side, in recent years, the amount of labour market information (LMI) conveyed through specialised internet portals and services has grown exponentially, encouraging and supporting the realisation of many internet services and tools related to the labour market, such as job matching services, advertising of job positions, services for sharing curricula, and the creation of a network of professionals who share and exchange labour market opportunities freely.

In such a dynamic scenario, some questions arise relating to observing, understanding and analysing the internet-based labour market phenomenon and its dynamics properly, such as: Which occupations will grow in the future and where? What skills will be most in demand for companies and firms in the next few years? What skills should one acquire during their lifelong learning path? Which jobs are really new, and which ones are just an evolution of old existing jobs that require new or technological skills? What is the impact of digitalisation within professions? What is the role that soft skills play within existing jobs and which are the most significant soft skills to gain?

Those are just a few of the questions at the forefront of the policy debate among economists, policymakers, and labour market experts in general. Nowadays, these questions need to be

addressed focusing on data-driven paradigms that allow us to observe and monitor a phenomenon in a (i) timely, (ii) inductive way, i.e. data are used to draw and confirm hypotheses instead of the other way round, and (iii) at a very fine-grained level.

Indeed, internet LMI provides a great opportunity for real-time labour market monitoring, to better understand labour market dynamics, capturing skills needs and trends focusing on different

dimensions (e.g. territory, sectors) at a detailed level, i.e. Big Data related to LM intelligence (Big Data 4 LMI). Not surprisingly, there has been a growing interest in designing and implementing real LMI applications for internet labour market data to support the policy design and evaluation activities through evidence-based decision-making, and that represents the aim of LMI, a field that is becoming increasingly relevant to European Union (EU) labour market policy design and evaluation.

In 2016, the European Commission highlighted the importance of vocational and educational activities, as they are ‘valued for fostering job-specific and transversal skills, facilitating the transition into

employment and maintaining and updating the skills of the workforce according to sectorial, regional, and local needs’. In 2016, the EU and Eurostat launched the ESSnet Big Data project, involving 22 EU Member States with the aim of ‘integrating Big Data in the regular production of official statistics, through pilots exploring the potential of selected Big Data sources and building concrete applications’. In the same year, the European Centre for the Development of Vocational Training (Cedefop)

(7)

or tool to analyse vacancies and emerging skills requirements across all EU Member States, realising a fully-fledged multilanguage (32 languages) system that collects vacancies, extracts skills, and performs real-time monitoring across all 28 EU Member States to support decision-making.

Though these initiatives differ, the common baseline relies on recognising the huge informative power behind internet LMI. This informative power can be exploited by putting together computer scientists, statisticians, economists and labour market experts to derive useful labour market knowledge from raw data to understand internet labour market dynamics and trends moving towards a data-driven decision-making process through LMI.

(8)

1. BIG DATA AND LABOUR MARKET INFORMATION:

HOW TO ENHANCE LMI IN THE DIGITAL ERA

OVERVIEW, STATE OF PLAY, POTENTIAL AND LIMITATIONS

1.1 Background and definitions

This section briefly introduces some terms and background notions related to labour market (LM) data to help the reading of this document.

Labour market information/intelligence

These two terms – often used interchangeably – refer to data related to LM phenomena and dynamics that are useful for supporting decision-making, policy design and evaluation. However, it is not clear from the use of LMI whether I stands for Information or Intelligence.

Specifically, I as Information describes all kinds of data and information used to support operational activities related to the LM (no analytics) as well as any information related to LM demand and supply. Examples include job vacancy postings, skills, occupations and job applicants’ CVs.

By contrast, I as Intelligence is an emerging concept in the whole LM community, especially in the European Union (EU). Although there is no unified definition of LM intelligence, it can be described as the design and use of Artificial Intelligence (AI) algorithms and frameworks to analyse data related to the LM (aka Labour Market Information) for supporting policy and decision-making (see, e.g. [1], [2], [3]).

Q&A

When does I as Information become Intelligence?

Roughly, we would say that the use of raw or aggregated data, including data monitored over time, to support operational activities is still Information. It becomes Intelligence when an automated algorithm (today mainly exploiting AI) processes the data to generate insights useful for the purposes of analytics (e.g. forecasting for decision-making activities, machine learning for classification, or information extraction for skills on CVs). Notably, the ability to handle and analyse masses of data in real time enables the knowledge obtained from the Intelligence process to be employed in systems usually devoted to supporting operational activities.

(9)

Q&A

Can information and intelligence work together within a system (or a framework) to support decision-making activities?

Yes, this is the way LM information and intelligence should interact, namely within a system (back-end) that collects LM information and uses AI to create LM intelligence. The result, LM knowledge, is then provided to a set of stakeholders according to their needs and abilities in understanding labour market dynamics. This process describes how a labour market information system (LMIS) should work.

Labour market information system

In the previous section, we clarified the difference between LM information (i.e. raw data normally used to exchange information inside operative service processes relating to the LM), and LM intelligence (tools, algorithms and procedures to manipulate LM information). These two concepts participate in the realisation of an LMIS, where an information system is commonly defined as a set of interconnected components (technological and architectural) that work together to collect, retrieve, process, store and distribute information to facilitate activities such as planning, control, coordination, analysis and decision-making in business organisations. Hence, the value of information accessed through an information system is twofold: one, it supports operational processes, and two, it helps decision-makers achieve their analysis objectives.

LMIS (intuition)

An LMIS can be seen as an instance of a classical information system that employs LM information and intelligence to support both operational and decision-making activities.

In essence, the LMIS concept can be described as a set of tools able to extract, analyse and disseminate LM-related information. Nevertheless, there is no unified definition of what an LMIS should be, and no one-size-fits-all practical advice on developing an LMIS, as its architecture, data and methods depend on analysis needs, which are context-dependent (e.g. country, institution, important issues, policy priorities and data infrastructure). Practical and different examples of LMIS appear in [4], [5], [6], [2], to cite just a few recent works. Some of these will be discussed later in the document. In this respect, the availability of data on the internet (see Chapter 2) sheds light on the importance of upgrading and evolving LMIS in order to include internet data and to exploit AI

algorithms to derive useful insights and formulate predictions on LM dynamics and trends (as recently argued by Johnson in [7] and shown by Frey and Osborne [8] to predict the risk of robotisation). These reasons led analysts and LM experts to include the internet as an additional source of LM data and information in their own work, to better describe and understand the LM as a whole.

Data sources for LMI

(10)

Administrative data. In essence, a reliable definition of administrative data is ‘data sets collected by government institutions or agencies for tax, benefit or public administration purposes’[9]. This means that these data also refer to information collected from (or about) individuals, who may need to take action to become part of a system that uses administrative data (e.g. registration of farmers in the tax and social security system) or not (e.g. Italian labour law states that a system has to automatically monitor the start/end of every employment contract (see [10]).

Statistical data. Statistical data (also known as survey data) are collected to fit a specific and predefined statistical purpose to ensure a given coverage of population, definitions, methodology, quality and time in order to meet the stakeholder’s analytical needs (see, e.g. [11]). Clearly, the use of administrative data for statistical purposes is far from straightforward, as it involves challenging issues such as identification of the population, the target population and the size of the sample, and the difficulty of selecting the model variable to sample the population.

Although statistical and administrative data differ in terms of purposes, they share some interesting characteristics, as shown in Table 1.1.

TABLE 1.1 MAIN CHARACTERISTICS FOR LM DATA SOURCES

LM source

type Data type

1 Generation

rate

Data model

paradigm Quality Coverage Analysis paradigm

Believ-ability Value Statistical Structured Periodically Relational Owner’s

responsibility Owner’s responsibility Top-down and model based

Owner’s

responsibility Intrinsic

Admin-istrative Structured or semi-structured

Periodically Relational Owner’s

responsibility Owner’s and user’s responsibility Top-down and model based Owner’s and user’s responsibility Intrinsic Web Structured, semi-structured or unstructured Near-real time or real time Relational and non-relational (NoSQL) User’s

responsibility User’s responsibility Bottom-up and data driven

User’s

responsibility Extrinsic

Statistical data are often structured data (e.g. tables with numbers with a well-defined structure and type), while administrative data may also include semi-structured data, where the structure is partially defined and free text can appear. Nonetheless, these data can be easily stored using classical

relational paradigms (e.g. Standard Query Language (SQL)). The guarantee that statistical data are of good quality is the responsibility of the data producer or data owner who also designed the data collection/study. This might not be true for administrative data, whose quality might be considered sufficient for the data owner but poor for the data consumer. This is not surprising, as data quality is defined as ‘fitness for use’, thus quality satisfaction may vary as the user varies. Administrative data are collected to monitor a phenomenon rather than for analytical purposes (see, e.g. [12]). This also means that the believability of statistical data – meaning ‘the extent to which data are accepted or regarded as true, real and credible’ [13] – depends on the trustworthiness of the data producer/owner, and this might also be true for administrative data. As both these kinds of data are collected from a system (administrative) or designed for a specific analysis goal (statistical), their value is intrinsic.

1 Structured data refers to clearly defined data types whose structure and recurrent pattern make them easily

(11)

In other words, the data are inherently of value. Clearly, this value can be improved by analysing and linking data, but it still remains.

This scenario changes when dealing with internet data, which simply refers to all data coming from internet sources. As one might imagine, these data can have any kind of structure, thus they may be structured (e.g. tables collected from the internet), semi-structured (e.g. XML2, such as tweets) or

completely unstructured (everything else). These data are generated continuously from one or more internet sources, over which the data user has no control, and this forces the user to continuously monitor and collect data. As the structure of internet data can vary in an unpredictable way, relational paradigms (which require a fixed and defined data structure) cannot be used to store internet data as they flow from the web. NoSQL3 paradigms have been developed to deal with this issue. Furthermore,

quality depends on user ability in identifying issues within the data (duplications, missing data, typos, synonyms, etc.) as well as on coverage, which has to be estimated and measured by the data user, often combining multiple internet data sources. Consequently, believability depends on the

trustworthiness of the data user rather than of the data owner. Finally, internet data do not have intrinsic value; their value depends on their ability to describe and explain a phenomenon. In other words, internet data are raw, and their value has to be uncovered/discovered by the user.

Internet data could be likened to a block of granite that needs to be worked by the artist, who may decide to use a milling machine or a chisel, to shape one figure rather than another.

These differences, mainly between internet data as opposed to statistical and administrative data, also force the user to shift their approach to the analysis: from a model-based approach using a top-down process, to a data-driven approach requiring a bottom-up method, as shown in Figure 1.1.

FIGURE 1.1 A PARADIGM SHIFT – FROM TOP-DOWN TO BOTTOM-UP

2 Extensible Markup Language (XML) refers to a markup language used to define a set of rules for encoding

documents in a format that is both human-readable and machine-readable.

3 NoSQL (not only SQL) refers to a growing movement to support the storage and querying of unstructured data.

(12)

Q&A

Due to the massive presence of unstructured data/texts, it seems that the use of Big Data makes it impossible to perform data quality tasks, which are well defined for structured data. Is this the case?

The application of data quality (and cleaning) tasks on internet data is still an open debate. Some people think internet data should be managed like classical structured data, while others say that the ‘garbage in, garbage out’ principle does not apply to Big Data, since the volume will act as a denoising factor. In our experience, the quality of Big Data depends primarily on the reliability of the sources used to collect the data. Thus, ranking internet sources is crucial. Any data quality techniques can be applied: rule-based (if a data model can be identified) or statistical (to identify outliers and denoise data).

General Data Protection Regulation issues related to the LM

The General Data Protection Regulation (GDPR) came into effect in May 2018 in all EU Member States. It represents a first step towards regulation of personal data manipulation and processing. If data do not contain personal information, the GDPR does not apply. Otherwise, if data contain personal information related to a subject (e.g. CV data, personal preferences, historical career paths or personal skills), then the LMIS that uses the data must be compliant with the GDPR.

In essence, the GDPR aims to guarantee the data subject’s fundamental rights, and to increase the accountability of companies that control and process personal data. The GDPR establishes a number of restrictions and constraints on personal data usage.

■ First, the data subject’s right to access information collected about him/her places restrictions on automated decision-making by companies and organisations using these data.

■ Second, entities designated to process personal data (i.e. data processors) must notify the data subjects about the data collected (Articles 13–15).

■ Third, transparency assumes a key role, forcing the data processor to handle data in a

transparent manner (Article 5, §1a), through transparent data processing (Article 13, §2 and 14, §2), and to notify the data subject if an automated decision-making process is applied to their personal data (Article 22). Furthermore, Articles 13 and 14 state that, when profiling takes place, a data subject has the right to ‘meaningful information about the logic involved’.

From a technical perspective, this also applies to all extraction, transformation and loading (ETL)4

processes and routines, which extract data, transform them from one format into another, and finally load processed and aggregated data into data warehouses for analytical purposes. In this way, when a key performance indicator or a general business indicator is computed, it is not possible to identify the information source or which data related to a data subject have been used. This also applies to (personal) data related to the LM. Roughly, this means that the data processor is responsible for guaranteeing, among other things: (i) that the subject to whom data refer cannot be identified either directly or indirectly, where the subject’s identifiers are their name, an identification number, location data, an online identifier or one or more characteristic elements of their physical, physiological,

4 ETL is an approach supporting data pre-processing and transformation tasks in the knowledge discovery in

(13)

genetic, psychological, economic, cultural or social identity (Article 4); (ii) that data are processed lawfully, fairly and in a transparent manner in relation to the data subject (Article 5); and (iii) that data are collected for specified, explicit and legitimate purposes and not further processed in a manner that is incompatible with those purposes (Article 5).

1.2 Big Data meets LMI

The reason behind the growing interest in the manipulation and use of Big Data has enables

managers to measure their businesses more effectively, hence creating knowledge that can improve their decision-making and performance (see e.g., [14]). This is a very general statement that also applies to the LM. However, what Big Data actually is, what makes data big and what does not, as well as the challenges and opportunities involved in dealing with Big Data, are all questions that are still open to debate.

FIGURE 1.2 KEY ELEMENTS DRIVING THE RISE OF BIG DATA

In recent years, the community has tried to answer these questions using a variety of Big Data ‘models’, based on the dimensions (or five ‘Vs’) that a Big Data application/approach should possess. Although several models have been proposed, here we suggest a five Vs model adapted for the LM domain, which characterises Big Data with respect to five fundamental dimensions: volume, velocity, variety, veracity and value.

Volume. In 2017, there were about 4 billion internet users worldwide. This number is growing at increasing speed: the first billion was reached in 2005, the second billion in 2010 and the third billion in 2014. Around 40% of the population has access to an internet connection. In 2018, there are about 2 billion active websites (excluding the ‘deep web’, i.e. webpages that cannot be indexed by search engines) and more than 3.5 billion Google searches are run every minute5. More data cross the

internet every second than were stored in the entire internet just 20 years ago. This gives companies a unique opportunity to access and collect these data in order to make better decisions and improve their businesses. It is estimated, for example, that Walmart is able to collect about 2.5 petabytes (i.e. 2.5 quadrillion bytes) of data every hour from its customer transactions. Although the classic Big Data approach measures volume in terms of bytes, which works well for system-generated user data (e.g. logs and transactions), this unit of measure does not apply to LM information, as the scale changes considerably. In the LM domain, it might be of interest to measure the number of records or items collected that relate to LM demand or supply, or the number of LM sources considered.

(14)

Velocity. This dimension refers to the rate at which data are generated, or collected in the case of LMI. These data are collected from third-party sources who may decide to allow data to be collected autonomously through (i) application programming interfaces (APIs)6, (ii) batch procedures executed

periodically or (iii) near real-time crawling or scraping7 of the source and obtaining data at close, fixed

intervals. Clearly, the lower the frequency of data collection, the higher the volume of data collected, and the greater the need for higher computing and large storage resources.

Variety. This dimension refers to the variety of data types within Big Data sources, as discussed in Table 1.1. The source may be structured, semi-structured or completely unstructured. As a

consequence of using large volumes of unstructured content, the lexicons employed within each source vary, and large-scale use of natural language means the heterogeneity of the data is considerable.

Veracity. Data veracity indicates how accurate or truthful a data set may be. As discussed above, the quality of internet data cannot be manipulated at the source, but has to be assessed when data are collected and stored, through ad hoc analyses, procedures and tools. Bias, abnormalities or

inconsistencies, duplication and volatility are some of the aspects that need to be removed to improve the accuracy of Big Data. As one might imagine, for a given data source, the higher the variety, the higher the veracity. Indeed, the use of natural language brings a great deal of noise containing no information into a text (e.g. prepositions, terms not related to the topic of interest, conjunctions and acronyms that have to be expanded). All these issues have to be properly addressed to enable unstructured data to produce knowledge in the knowledge discovery in databases (KDD) steps. Value. Finally, data have to be of value to a specific domain or purpose. In other words, as discussed above, internet data have no intrinsic value. Their value is the knowledge the user extracts from the data to explain a phenomenon, or to support decision-making through analyses and

recommendations. It is worth mentioning the importance of analysis of stakeholder needs, which should identify which portion of knowledge is of interest to a given stakeholder and which is not. In the case of job vacancies advertised on the internet, a user looking for a new job would be interested in performing a gap analysis with respect to his/her skills, while an LM analyst might be interested in observing the internet LM as a whole at a specific regional level. The same knowledge might have different points of access depending on stakeholder needs.

6 In the web context, API refers to a set of procedures that a user employs to communicate with the server (e.g.

for collection of data). Consequently, communication between user and server is regulated, well defined and monitored.

7 Crawling collects data from websites as they are, while scraping identifies certain parts of a website to be

(15)

FIGURE 1.3 FIVE Vs BIG DATA MODEL, ADAPTED FOR LMI APPLICATION

Turning Big Data into LM information

The extraction of knowledge from (big) LM data has been addressed. To this end, Figure 1.4 summarises the main challenges involved in dealing with internet LMI, as discussed in [15]. FIGURE 1.4 MAIN ELEMENTS OF INTERNET LM SCENARIO, STAKEHOLDER NEEDS AND ACTIONS PROPOSED

(16)

FIGURE 1.5 KDD PROCESS AND BIG DATA Vs INVOLVED IN EACH STEP

Source: Figure based on the figure from [16].

Selection. Selection of data sources is the first step. Each internet source has to be evaluated and ranked in terms of the reliability of the information. For example, this phase should take into account the vacancy publication date, the website’s update frequency, the presence of structured data, and any downloading restrictions. At the end of this phase, a ranking of reliable web sources is produced. This step involves all five Big Data Vs, including also veracity, i.e. the biases, noise and abnormalities in the data. Some key questions posed by the selection phase for LM experts include:

1. [Statistical] How do we identify criteria to be included in the source model, and how do we extract these criteria (i.e. variables) from sources? How do we rank sources?

2. [Technical] How do we identify a data model paradigm (e.g. relational, document, key value, graph) to store huge volumes of data at scale? How do we collect data automatically? Do we need API access, or do we need to develop a web scraper/crawler? How do we schedule automatic data collection processes?

3. [LM domain expert] How do we select the right sources? Have we selected the right sources? Pre-processing. This step includes data cleaning to remove noise from the data or inappropriate outliers (if any), deciding how to handle missing data as well as identifying a function to detect and remove duplicated entries (e.g. duplicated vacancies or vacancies with missing values). Data quality and cleaning are essential tasks in any data-driven decision-making approach, to guarantee the believability of the overall process, i.e. ‘the extent to which data is accepted or regarded as true, real and credible’ (see, e.g.[12], [13], [17]).

Identification of duplicated job vacancies is far from straightforward. Job vacancies are usually posted on multiple websites, and this is a duplication, whereas re-use of the same text to advertise a similar position is not. Identification of appropriate features for correct recognition of duplicates is crucial in the internet LM domain. The pre-processing step reduces the complexity of the Big Data scenario, mitigating the impact of the veracity dimension through data quality and cleaning. Key questions raised by step 2 for LM experts include:

1. [Statistical] How do we evaluate data consistency? How do we measure data accuracy? How do we estimate data significance?

2. [Technical] How do we identify duplicate data records? How do we identify missing values? 3. [LM domain expert] How do we identify LM domain synonyms that help improve data accuracy?

How do we identify criteria that characterise missing values and duplicates?

(17)

the use of dimensionality reduction or transformation methods to reduce the effective number of variables or to find invariant representations for the data. Like step 2, the transformation step reduces the complexity of the data set by addressing the variety dimension. It is usually performed by means of ETL techniques, which support the data pre-processing and transformation phases in the KDD

process. Roughly speaking, through ETL, the data extracted from a source system undergoes a series of transformation routines that analyse, manipulate and then clean the data before loading them into a knowledge base. By the end of this step, the outcome of which is a clean, well-defined data model, the Big Data variety issue should be resolved. Key questions raised by the transformation phase for LM experts include:

1. [Statistical] How do we measure the completeness of the target model identified? Does the target model still maintain data significance at the end of the ETL process?

2. [Technical] How do we develop Big Data procedures to transform raw data into a target model in a scalable manner?

3. [LM domain expert] How do we identify the destination data format and taxonomy8?

Data mining and machine learning. The aim of this step is to identify appropriate AI algorithms (e.g. classification, prediction, regression, clustering, information filtering) by searching for patterns of interest in a particular representational form, based on the purpose of the analysis. More specifically, in the context of LMI, it usually requires the use of text classification algorithms (e.g. ontology-based or machine learning based) to build a classification function for mapping data items into one of several predefined classes. This step is crucial as it is mainly devoted to the extraction of knowledge from the data. Key questions raised by the data mining and machine learning phase for LM experts include: 1. [Statistical and technical]How do we select the best algorithm? How do we tune their

parameters? How do we evaluate algorithm effectiveness? How do we implement it at scale? 2. [LM domain expert]Which knowledge should be selected and which should be discarded? What

is the LM significance of the knowledge obtained? Which novel insights have been discovered through LMI? How do we explain the results of the mining process from an LM perspective? Interpretation/evaluation. This final step employs visual paradigms to visually represent the knowledge obtained, depending on the user’s objectives. In the LMI context, it means taking into account the user’s ability to understand the data and their main goal in the LMI field. For example, government agencies might be interested in identifying the most sought-after occupations in their local area; companies might focus on monitoring the skills trends and identifying new skills for certain occupations so that they can design training paths for their employees. In the last few years, a lot of work has focused on producing off-the-shelf visual libraries that implement a variety of narrative and visual paradigms. A powerful example is D3.js [18], a data-driven responsive library for producing dynamic, interactive data visualisation even in the Big Data context (see, e.g. [19]). Key questions raised by the interpretation/evaluation phase for LM experts include:

1. [Statistical and technical]How do we select the visualisation paradigm? How do we select an appropriate visualisation model for the knowledge we want to visualise?

8 The LM is characterised by several standard taxonomies, such as ISCO/O*NET/SOC for occupations, ESCO for

(18)

2. [LM domain expert]How do we deliver appropriate knowledge according to stakeholder needs? How do we identify visual navigation paths for each stakeholder? How do we retrieve feedback (if any) from LM users? How do we put LM knowledge into business?

As one might observe, the number of technical and statistical issues decreases as the KDD process advances, while the number of issues and challenges facing the LM expert increases. In fact, while technical specialists have the main responsibility for dealing with four of the Big Data Vs, it is up to the LM expert to address the fifth Big Data V, Value.

To clarify this point, in Table 1.2 we indicate some LM items for possible analysis, along with the information sources and the main advantages/issues involved in the analysis of the sources. TABLE 1.2 LM ANALYSIS OPTIONS

LM item Source Benefit Issue

Occupations and skills: Demand Supply ■ Job postings ■ Curricula ■ Fine-grained information

■ Focus on relevant information only

■ Specific lexicon

■ Focus on terms used by market rather than taxonomies

■ Identification of similar occupations on the basis of skills requirements

■ Collecting data

■ Distinguishing between terms/concepts and noise

■ Understanding lexicon and domain-dependent terms

■ Use of standard classification systems to reduce complexity and enable cross-border comparison

■ Multilanguage management Future skills and new emerging occupations ■ Job postings ■ Curricula ■ Focus on LM expectations ■ Identification of future/ongoing trends and dynamics

■ Gap analysis between candidates and new professions/skills to design learning paths

■ Identification of similar occupations on the basis of skills requirements

■ Training an algorithm to understand when a term is a skill

■ Formalisation of the meaning of new occupation (How many times and how long does a job opportunity have to be posted to become a new profession?)

Rate of digital skills by occupation

■ Job

postings ■ Understanding the pervasiveness of digital skills within occupations

■ Planning of policies and learning paths to fill the gap with LM expectations

■ Training an algorithm to understand what a digital skill is

■ Computing skill rates (including soft/hard non-digital skills)

Transversal (also known as soft) skills (valid and required for many occupations) ■ Job postings ■ Curricula

■ Understanding the impact of transversal skills within LM

■ Designing and planning learning paths to acquire soft skills

■ Formalisation of the meaning of transversal skills

■ Training algorithms to recognise transversal skills in all their forms, which may vary significantly in natural language (e.g. problem-solving, ability to solve problems, troubleshooting)

Skills

mismatch ■ Job postings

■ Curricula

■ Surveys

■ Ability to identify overqualification/ underqualification and skills obsolescence

■ Improvement of LM policies in accordance with LM trends and dynamics

■ Support for the design of educational and vocational paths

■ Data collection

■ Data cleaning and integration

■ Design and selection of integrated analysis model

(19)

Q&A – Predictive power of Big Data

How can Big Data contribute to skills anticipation (in the short and medium term)? Which methodological aspects need to be considered, and what should be the scope and depth of analysis? Any examples of how skills anticipation has been improved by adding insights from Big Data?

Big Data is crucial for skills anticipation for two reasons. First, Big Data, such as that deriving from vacancies advertised on the internet, is the only source of detailed information about skills as an alternative tool to skills surveys, which contain only a limited set of skills that can be assessed. Second, the demand for skills may vary across occupations, sector and region. In order to track these variations, it is necessary to have very granular and detailed data that can only be obtained through Big Data.

Methodologically there are several challenges that need to be addressed. First, the data have to be linked to the occupation/sector/region. Second, the skills need to be classified into a

meaningful taxonomy that can be used for the analysis. Third, a consistent time series is required to project skills changes over time.

There are several examples of the use of Big Data for skills anticipation. In the United States (US), skills extracted from vacancies advertised on the internet are used by universities to anticipate LM trends and tailor educational programmes to the needs of the LM.

Q&A

How can one successfully cooperate with statistical and state bodies that own

data/registers? How can people overcome their scepticism and concerns regarding the novelties of Big Data? Are there any good examples of such cooperation?

According to our experience, the use of Big Data does not prevent the relevance nor the importance of using official statistics and surveys in analysing LM dynamics and trends for decision-making. To give an example, one of the major criticisms related to the use of Big Data is that it relies on statistical significance as Big Data should be jointly used and matched with external statistical sources (e.g. labour force surveys) with the goal to improve the ability to observe LM dynamics across a wider spectrum.

Furthermore, one should consider that the population of the internet is variable and partially observable by construction, and this makes it difficult to identify a stable sample as

representative of the whole internet population. Hence, one might use official statistics to estimate the relevance of each class on the internet data collected so that classes overrepresented (or underrepresented) can be weighted accordingly.

(20)

Q&A – Data quality

We know data quality is a large and important part of data science. However, what are the main aspects and dimensions that need to be considered and taken into serious account for a Big Data project? What are the fundamental features of data quality in Big Data?

As discussed in Chapter 1, data quality evaluation and cleaning are crucial and mandatory tasks for guaranteeing the believability of any data-driven decision-making process. Not surprisingly, the veracity dimension of Big Data clearly refers to the quality issues that remain significant even in the presence of a huge amount of data. A formal and reliable list of data quality issues that are most relevant for a Big Data application does not exist yet, and this is still an open debate in academia. Indeed, some researchers think the high volume of data is enough to stem the effects (if any) of poor data quality. Others think that a strict and formal data quality approach (that estimates dimensions such as consistency, completeness, accuracy and soundness) has to be applied even in cases of huge data. The former approach ignores the fact that poor-quality data are prevalent in both large databases and on the web; the latter does not take into account the costs (in terms of both computational time and power) to evaluate quality and cleanse the data in huge data sets. That said, we think both approaches should be applied for analysis purposes. To give an example, focusing on the collection of vacancies from internet sources, a strict and rigorous approach should be used to rank sources on the basis of source features or variables, thus evaluating consistency, accuracy and completeness (at least) for web source variables. This might include, but is not limited to:

■ typology: refers to the typology of the source, which might be recruitment agencies, national newspapers, specialised websites, or public, sectoral or company websites;

■ size: refers to the number of vacancies published on the website at the moment of the analysis;

■ update time: refers to the frequency with which the web source owner provides fresh and updated data, along with the presence of timestamps to mark exactly when a job vacancy has been published;

■ quality of description: identifies how standardised and complete the detailed vacancies page is. By contrast, the identification of ‘active’ job vacancies also refers to a data quality dimension (accuracy, in this case). However, given a vacancy published weeks or months ago, there is no field within a vacancy that guarantees that the vacancy is still valid (i.e. the job has not been filled yet). In such a case, a classic approach would not work as this information cannot be accessed in an automated and scalable manner (million items to be processed at a time). On the contrary, including a vacancy in the analysis regardless of its validity might have

unpredictable effects on the believability of the analyses. For these reasons, one should build up a statistical model to infer the validity date for such a vacancy on the basis of the historical data related to similar vacancies (e.g. taking into account similar sources, similar positions

advertised, the company that is advertising the vacancies). Such a model would guarantee that the validity dates have been evaluated according to the data set characteristics.

We refer the reader to Saha, Barna and Divesh Srivastava, ‘Data quality: The other face of big data’, 2014 IEEE 30th International Conference on Data Engineering, IEEE, 2014; and to Sadiq, Shazia and Paolo Papotti, ‘Big data quality: Whose problem is it?’, 2016 IEEE 32nd International

Conference on Data Engineering (ICDE), IEEE, 2016 – two recent papers that discuss and propose

(21)

1.3 Literature on Big Data for LMI

In the last few years, several forces and factors have dramatically changed the nature and characteristics of the LM, in both advanced and developing countries. Technical progress, globalisation and the re-organisation of production processes have radically modified demand for certain skills, intensifying the need for continuous training, especially in jobs that are highly specialised or heavily affected by digitalisation. At the same time, the availability of data is growing thanks to digitalisation, the spread of web services (including services related to the LM) and the availability of a huge number of technical solutions for dealing with Big Data. In this scenario, the number of calls for projects related to the LM is growing apace. Below we report on some examples of initiatives (European and non-European) related to Big Data and LM information/intelligence.

The EU panorama

There is growing interest in designing and implementing real LMI applications for internet LM data in order to support policy design and evaluation through evidence-based decision-making. In 2010 the European Commission published A new impetus for European cooperation in vocational education and training to

support the Europe 2020 strategy [20] aimed at promoting education systems in general, and vocational

education and training in particular. In 2016, the European Commission highlighted the importance of vocational and educational activities, as they are ‘valued for fostering job-specific and transversal skills, facilitating the transition into employment and maintaining and updating the skills of the workforce

according to sectorial, regional, and local needs’ [21]. In 2016, the EU and Eurostat launched the ESSnet Big Data project [22], involving 22 EU Member States, with the aim of ‘integrating Big Data in the regular production of official statistics, through pilots exploring the potential of selected Big Data sources and building concrete applications’. Furthermore, in 2014 the European Centre for the Development of Vocational Training (Cedefop) agency – set up to support the development of European vocational education and training – launched a call for tenders for a feasibility study and development of a working prototype able to collect and classify job vacancies on the internet from five EU countries [23]. The rationale behind the project is to turn data extracted from job vacancies on the internet into knowledge (and thus value) for policy development and evaluation through fact-based decision-making. Given the success of the prototype, a further call for tenders has been launched for the creation of an internet-based LM monitor for the whole EU, including 28 EU Member States and all 24 languages of the Union [24]. Also worthy of mention is the LMI4All project [25], an online data portal that connects and

standardises existing sources of high-quality, reliable LMI, for the purpose of informing careers

decisions. The data are made freely available via APIs for use in websites and third-party applications.

Beyond the EU: non-European projects

(22)

300 Ontarians through interviews, public consultations and an online survey [27]. The study by [28] for the World Economic Forum estimated the impact on the LM of automation and technological

advances, using both occupations and jobs as data.

By contrast, companies need to automate human resource department activities; therefore, a growing number of commercial skill-matching products have been developed in the last few years in the EU and beyond, such as Burning Glass, Workday, Pluralsight, EmployInsight, Textkernel and Janzz. Worthy of mention is Google Job Search API, a pay-as-you-go service announced in 2016, which classifies job vacancies through the Google Machine Learning service over O*NET, the US standard occupation taxonomy.

Project/feature matrix – A comparative model to clarify which projects addressed

a specific concern/challenge related to Big Data for LMI

TABLE 1.3 summarises a number of LM information and intelligence projects and their features. It also provides a reference for each project.

TABLE 1.3 AN LMI PROJECT/FEATURE MATRIX

Reference Data source(s) Goal Countries involved Type Languages handled

[23] Administrative and

focus groups Estimate probability of automation US Study EN

[24] Web data (job

postings) Evaluate the effectiveness of internet LMI for real-time monitoring Italy, Germany, UK, Czech Republic, Greece Prototype and

research study EN, GR, IT, CZ, DE

[25] Web data (job

postings) Development of a real-time LM monitor for the EU All 28 EU countries System and research study 32 languages

[23] Survey and web data

(job postings) Inclusion of Big Data in official statistics EU Project and research study N/A

[27] Administrative

(PIAAC9) Estimate risk of automation due to robotisation and AI Estimate on US -> apply to 21

OECD countries

Study EN

[29] Web data (job

postings) Estimate risk of automation due to robotisation and AI Italy Study EN

[30] Administrative (by

government) Use of government data sets to achieve a fuller understanding of LM flows

UK Project EN

1.4 Big Data for LMI in operation

This section examines some working projects that use the KDD approach described above to put LMI into practice for different purposes. On the one side, we present the WollyBI approach, which has been successfully rolled out in Italy and in Spain (Bizkaia project).

9 PIAAC – Programme for the International Assessment of Adult Competencies. The PIAAC data are a unique

data source which contains micro-level indicators on socio-economic characteristics, skills, job-related

(23)

On the other side, we present the ESSnet Big Data project, an EU initiative to include LM Big Data in official LM statistics. These two applications shed light on the importance of exploiting Big Data to analyse LM dynamics and trends for a large range of stakeholders.

Other projects and initiatives will be discussed in greater depth in Chapter 3.

Real-time LM monitor: the cases of Italy and Spain

The WollyBI10 project started in early 2013 as a Software as a service (SaaS) tool11 for collecting and

classifying job vacancies advertised on the internet on the International Standard Classification of Occupations/European Skills, Competences, Qualifications and Occupations (ISCO/ESCO) standard taxonomies, and extracting the most requested skills from job descriptions. It was designed to provide five distinct entry points for the user depending on their analysis purposes, namely:

■ geographical area – to find the most frequently searched occupations on the web and related skills, at a very detailed geographical level;

■ skill – to input a selection of skills and to find the most frequently searched occupations that include those skills (i.e. profile gap analysis);

■ firm – to obtain a ranking of occupations that specify a particular industry sector in the vacancy;

■ occupation – to navigate through the ISCO/ESCO classifications and to exploit the details related to each occupation;

■ free queries (i.e. customised) – for free classic, drill-down and roll-up operations over the OLAP cubes12.

The implementation of WollyBI closely follows the KDD approach, as illustrated in [15]. Here, we focus on the navigation paths. These vary for each stakeholder, but each entry point has been designed on the basis of the three-click rule so that results are available to the user with no more than three ‘next’ clicks. Figure 1.6 shows a sequence of snapshots from WollyBI starting from the occupation entry point. Firstly, the user has to select the occupations group from the first ISCO level down to the third one and some other parameters, such as the geographical area and the time horizon of the analysis. The larger the size of the bullets, the greater the number of occupations in the particular group. A first report showing the results for the chosen occupations group is returned. Once the user selects an occupation for deep exploration, a range of information is returned, including the name of the

occupation, its ISCO code, its definition according to the ISCO taxonomy, the experience required, the contract typology, the top five sub-sectors and a list of top five skills (both hard and soft). A demo video of WollyBI in action on the occupation dimension is available at: https://youtu.be/zBNsAS5L04g.

10www.wollybi.com, powered by TabulaeX, accredited spin-off of University of Milano-Bicocca, Italy.

11 As a web service, accessible any time by anyone with a valid account, SaaS avoids the need to download and

install tools.

12 An OLAP cube is a multidimensional database optimised for data warehouse and online analytical processing

(24)

FIGURE 1.6 WOLLYBI – A SEQUENCE OF SNAPSHOTS FROM A THREE-STEP ANALYSIS OF THE OCCUPATION DIMENSION

Figure 1.7 shows a sequence of snapshots from WollyBI on the geographical area dimension. Firstly, the user has to select the geographical area of interest, the time horizon and a specific ISCO level (optional). A list of ISCO fourth-level occupations is returned, along with the number of vacancies classified on that occupation code, the percentage over the total number and a checkmark indicating the presence of cross-sector occupations. Once the user selects a job occupation, a range of

(25)

FIGURE 1.7 WOLLYBI – A SEQUENCE OF SNAPSHOTS FROM A THREE-STEP ANALYSIS OF THE GEOGRAPHICAL AREA DIMENSION

The Bizkaia Talent project

WollyBI was used as the baseline – in terms of both technological and methodological aspects – for deployment of the LM monitor in the Basque Country, Spain. Basque Talent Observatory is the world’s first public access platform for monitoring the region’s highly qualified LM, and was released in 2017. The initiative was developed by Bizkaia Talent together with TabulaeX, a spin-off of the University of Milano-Bicocca, in order to manage knowledge transfer. It is based on a tool that investigates the Basque LM with the focus on highly qualified professionals, by analysing Big Data from multiple online sources properly selected and ranked, within the Basque Region.

The goal of the project – supported by the Biscay Economic and Territorial Ministry – is to contribute to the competitiveness of the Biscay region and enable real-time collection of information about the LM in the geographical area of Biscay, Gipuzkoa and Álava, using both international and local online

(26)

The LM knowledge base has been organised over two distinct entry points, namely:

1. for citizens: a dashboard to analyse and browse information on the internet LM based on the vacancies active in the last 12 months;

2. for analysts: a dashboard that visualises data from the last 12 months, with selection of the period of interest: last month, last three months, last six months, last year. Data are collected daily and updated monthly.

The tool has been made publicly available and can be browsed by anyone at the following web address: https://basquetalentobservatory.bizkaiatalent.eus/visual/public/index#.

Further details about this project can be found in Chapter 3.

The ESSnet Big Data project

Focusing on the statistical perspective of Big Data analysis for LMI, Eurostat – the official statistics office of the EU – has launched a project named ESSnet Big Data aimed at integrating Big Data about LM information in the regular production of official statistics, using pilots to explore the potential of selected Big Data sources and build concrete applications [23].

(27)

Q&A

Who are the users of these platforms?

The users have to be correctly identified at the beginning of the system design so that the knowledge can be organised according to user needs and ability to understand the data. For example, WollyBI uses the same knowledge for separate types of stakeholders: employment agencies, business associations and unions, government employment agencies, schools and training organisations. Bizkaia Talent is designed for both citizens and analysts, while the ESSnet project is intended for LM specialists and analysts as a whole.

What is the update frequency of these LM tools?

Update frequency should be decided in relation to three elements: (i) updating of sources, (ii) the cost in terms of the computing power needed to perform the update, and (iii) stakeholder needs. Usually, weekly updates are appropriate for analysis purposes.

To what extent is the KDD process deployed by these platforms automated (AI 100%, AI 70%, AI 40% ...)?

Human effort decreases as the KDD process proceeds. Significant effort is required in source identification and ranking, as well as in identification of business needs and selection of good AI algorithms and the corresponding tuning parameters. Once these activities have been

(28)

2. INCORPORATING BIG DATA ANALYTICS IN LMI:

SYSTEMATIC STEPS

In this chapter we introduce the main building blocks needed to develop a Big Data architecture for LMI, and examine some important necessary (pre-)conditions: where to start, what to consider, phases and main steps using the KDD approach discussed earlier as a guideline.

2.1 Big Data architecture components

The KDD approach (Figure 1.5) represents a baseline for implementing any data management architecture designed to extract knowledge from data, even before the Big Data era, as shown in Figure 2.1. Each KDD step discussed in Chapter 1 is implemented to (i) collect data from several structured data sources; and (ii) transform data from multiple and distinct data models and formats into a single unified data model, using ETL techniques and tools. The data are then usually stored in a data warehouse suitable for optimised queries and analyses. Finally, a dashboard allows analysts to run queries and use the knowledge returned in the form of reports and charts to support decision-making. This is a classic Business Intelligence (BI) architecture, which works well on structured data. FIGURE 2.1 A CLASSIC BI ARCHITECTURE BEFORE THE ADVENT OF BIG DATA

Although these types of architecture are very successful in many real-life contexts, they suffer from limitations that prevent their use in Big Data scenarios, as Table 2.1 shows.

TABLE 2.1 MOST SIGNIFICANT LIMITATIONS OF BIG DATA ARCHITECTURE

Issue (most significant) Cause Conceptual blocks of Big Data architecture

Schema-free data are out: only structured data sources can be manipulated. Roughly, this means that only data that obey a rigid, well-defined data model can be handled, to the exclusion of all ‘unstructured’ data, such as free text, comments and web content in general.

Variety Data ingestion; NoSQL models

No adaptability to change: the addition of a new source requires the whole process to change, and this makes it difficult to scale the

architecture over multiple (albeit structured) sources. Variety, velocity Data lake

Rigid ETL: the procedures that transform content from source formats to target formats have to be precisely written to fit the desired data

structure (e.g. data warehouse). Variety

Schema free; data-driven approach (bottom-up rather than top-down)

Time-consuming: the larger the volume of data to be processed, the greater the time needed to complete the process. ETL procedures are usually high time and memory consumers, as they need to ‘scan’ all the data sources at any time to transform source data.

Volume, variety,

(29)

Table 2.1 shows some of the key issues that prevent use of classic BI architectures in the Big Data scenario, mainly relating to one or more Big Data dimensions (the Vs of the Big Data model discussed in Chapter 1). Great efforts have been made in recent years by both academia and practitioners to identify paradigms, models and tools suitable for a Big Data scenario.

Here, we introduce some conceptual blocks that the reader should know for a better understanding of how a Big Data architecture works: data ingestion, NoSQL models, data lake, and scale-out. Then, we discuss how these blocks could work together to realise a Big Data architecture for LMI. Clearly, there are a plethora of solutions and tools, as the Big Data ecosystem continues to expand and evolve. Data ingestion. This term refers to the process of collecting data from several sources in an

automated way. It can be run in real time (each data item is collected as it is emitted by the source) or through a batch process (data items are collected in discrete chunks at regular intervals).

Furthermore, data from a single source may be collected using three distinct approaches: API, crawling or scraping.

■ API – as noted in Chapter 1 – stands for application programming interface, a software component exposed by the source owner to any programmer to enable data collection (e.g. Twitter, Facebook). The collection process is controlled by the data owner, who also decides what data can be emitted, as well as the data structure.

■ Crawling is a software activity that automatically indexes all the contents of a web page and adds them to a database. It iteratively follows all the hyperlinks included in the page and also indexes that data to the database (including images, tables and style sheets). A classic example of crawling is the search activity performed by Google.

■ Scraping is a two-step software activity. First, it automatically requests a web page, then it collects only a limited amount of information from the page, leaving out the remaining data. This means that a scraper (partially) knows the website structure so that it can identify only the content of interest for the analysis. For example, a web crawler might download all the products listed on an e-commerce website, while a scraper might collect only the product names and prices, leaving out links to banners, comments and metadata related to the page layout.

NoSQL models. In recent years the NoSQL movement has brought to the fore new data model paradigms that differ significantly with respect to the classic relational model at the basis of any BI architecture. Four NoSQL data store paradigms have ultimately emerged (i.e. key values, document databases, column-oriented databases and graph databases). All these new paradigms share some interesting features compared with the classic relational model (see, e.g. [31]), such as a flexible schema that can always evolve to fit the data, the ability to horizontally scale with ease, and native support for sharing. For these reasons, these paradigms have become a common backbone of any Big Data architecture, as they allow data to be stored in their native form. Intuitively, a relational database can be considered as a set of tables that can be navigated by identifiers (aka IDs). Clearly, the number of columns in a given table is fixed and defined a priori. In NoSQL data stores, by contrast, the number of columns can vary for each row of each table, enabling any kind of data item to be stored, irrespective of its structure, as the schema is free to evolve with the data.

(30)

contact addresses, or locations, or skills, with respect to another, and this makes it difficult to identify a ‘common schema’ (or data model) suitable for any vacancy (including vacancies that still have to be collected). A solution proposed to deal with this issue is the introduction of a ‘data lake’. In simple terms, a data lake is a data repository that leaves all the collected data in their native format, assigning a unique identifier to each data item. This allows the item to be retrieved when it is needed to perform a given action (e.g. analyse vacancies from source X, or from country Y).

Scale-out. Roughly, scalability means that the architecture is able to continue working in spite of the workload. Clearly, no architecture can scale infinitely, thus the meaning of ‘in spite of the workload’ should be taken as involving some specific requirements for scalability that are deemed important. For example, a Big Data architecture that collects and processes web documents should scale to

guarantee low latency and high throughput. Latency is the time needed to perform an action (or to produce a result) measured in units of time (minutes for real-time processing). Throughput is the number of actions executed or results produced per unit of time. Latency ensures that users will not wait indefinitely for results to be produced, while throughput indicates the ability of the architecture to process data. Thanks to the introduction of NoSQL data stores, distributed file systems13 and

distributed computing14, Big Data architectures are able to scale-out (to provide vertical scalability).

The simplest way to understand how a system can really scale-out, reducing latency and increasing throughput, is to understand the MapReduce operating principle, the core engine of any Big Data architecture (see box below).

MapReduce – a working example

Imagine having millions of documents arriving from different web sources in real time. Your architecture has collected all these documents in their native form, thanks to the use of NoSQL data stores. The goal is to be able to process all these documents by producing a wordcount table that summarises all the words appearing in all the documents along with their occurrences, as shown in Figure 2.2. The MapReduce operating principle (introduced by [32] while working at Google) distributes the workload to several machines so that each machine has to perform a very simple task through the ‘map’ function. First, the input document is partitioned (line by line) so that each line can go to a different mapper (i.e. a machine). Each machine executes a very simple function, or ‘map’, which emits a pair containing the word (the key) and a counter (the value) starting with 1. In the next stage, each pair (or ‘tuple’) is shuffled (by key) into a logical order (alphabetical order in this case). Intuitively, the aim of shuffling is to have all the tuples with the same key allocated to the same reducer, thus saving time. Finally, each reducer receives, for each key, all the values found in the tuples. The goal for that reducer is now relatively simple: to add up all the values for its key, and to emit a resulting tuple consisting of the key and the sum, which represents the number of occurrences of that key, through the ‘reduce’ function.

As we can see in Figure 2.2, thanks to the MapReduce paradigm, four machines have been employed to perform a job, reducing the overall time needed (latency), and increasing the throughput of the architecture in processing millions of documents.

13 Distributed file systems can be considered a disk structure residing on multiple machines.

14 Distributed computing refers to the ability of a machine to divide its workload over several machines, thus

(31)

FIGURE 2.2 EXAMPLE OF A MAPREDUCE ALGORITHM PERFORMING A ‘SKILLS COUNT’ TASK

Thanks to this conceptual block introduced by Big Data, the classic BI approach shown in Figure 2.1 can move towards a Big Data approach, as shown in Figure 2.3. Notably, the classic BI workflow shown in Figure 2.1 still exists, as relational and structured data are still present in many real-life domains, including web data. However, the data processing workflow now follows two distinct paths: the classic approach handles structured data while a Big Data approach, as discussed above, is used to handle unstructured data.

Riferimenti

Documenti correlati

After three days of formula milk nutrition, urine had higher levels of glucose, galactose, glycine and myo-inositol, while up-regulated aconitic acid, aminomalonic acid and adipic

Because if, on the one hand, it is true that the role of language and linguistic theory was reconsidered and limited by Ricœur in his 1965 essay, then on the other it is true

Finally, cetuximab (moAb) and irinotecan (chemotherapy) drugs are analyzed as first-line treatment of colorectal cancer with few KRAS mutated cells.. Results show that this

The aim of this study is to analyze some problems regarding the transition from University to Labour Market, in particular by exploring the ties between degree

Benedetto, la Regola, la santità nelle testimonianze dei manoscritti cassinesi (Catalogo della mostra, Montecassino, 10 luglio-31 ottobre 1998), a cura di M.. Dell'Omo,

The framework is based on the UCON model that aims at providing continuous control on the usage of industrial equipment through the enforcement of fine-grained security

In the example, the Iris dataset wad reduced with PCA in two features and represented in scatter plot. Each color corresponds to the

address string Indirizzo di parcheggio del veicolo city string Città nella quale è avvenuto il parcheggio engineType string Identificativo del tipo di motore del