Segregation aware data mining

(1)

Universit`a degli Studi di Pisa

Dipartimento di Informatica

Dottorato di Ricerca in Informatica

Ph.D. Thesis

Segregation aware data mining

Alessandro Baroni

Supervisor Prof. Salvatore Ruggieri

Chair

Prof. Pierpaolo Degano

(2)

(3)

List of Figures

1.1 Racial spatial segregation in New York City, based on Census 2000 data [75]. One dot for each 500 residents. Red dots are Whites, blue dots are Blacks, green dots are Asian, orange dots are Hispanic, and

yellow dots are other ethnicities. . . 14

3.1 Input table (left) and segregation data cube with the D index (right). 36 3.2 A dynamic environment inspired by [113]. . . 41

4.1 Attributed graph (left) and its edge distribution table (right) . . . 45

4.2 An example of clustering using only topology (top), or both topology and semantics (bottom). . . 47

4.3 A sample attributed graph. Attributes include: sex (categorical) and spatial coordinate x, y (quantitative). Edges are colored by topologi-cal distance between the connected nodes. Topologitopologi-cal distance dT() considers 1-neighborhoods. . . 48

4.4 Cumulative distributions of dT(), for varying l-neighborhoods, and of dS() for datasets DIRECTORS (left) and DBLP (right). . . 52

4.5 Computing the l value for αT = 0.3 and τ = 0.6 on the DBLP dataset. 55 4.6 Size distribution of clusters found by SToC, GBAGC and Inc-C for some of the tests in Tables 4.2, 4.3, 4.4. . . 61

5.1 KDD process for segregation discovery supported by SCube. . . 64

5.2 SCube architecture. . . 65

5.3 SCube wizard and segregation data analysis in Microsoft Excel. . . . 66

5.4 Sequence diagram of SCube. . . 67

5.5 The SCube extention for segregation discovery in dynamics network. . 74

5.6 Graph Builder space/time performances. . . 76

5.7 GraphClustering space/time performances. . . 76

5.8 Table Builder space/time performances. . . 77

5.9 SegregationDataCube Builder elapsed time (Estonian dataset). . . 78

5.10 SegregationDataCube Builder memory occupancy (Estonian dataset). 79 5.11 SegregationDataCube Builder elapsed time (Italian dataset). . . 80

5.12 SegregationDataCube Builder memory occupancy (Italian dataset). . . 81

(6)

6

5.14 SCube web interface on the SoBigDataLab VRE. . . 82 5.15 SCube web interface, on the SoBigDataLab VRE, during an execution. 82 5.16 SCube web interface, on the SoBigDataLab VRE, exhibits the output

after the execution. . . 83 6.1 Sample social network of companies. . . 86 6.2 Director distributions: by gender/age (left), proportion of female

di-rectors by residence province (right). . . 87 6.3 Distributions: BoD size (left), director presence (right). . . 88 6.4 Distribution of size of CCs before (left, without the giant component)

and after (right) splitting the giant component by removing the edges with weight ≤ 3. . . 90 6.5 Dissimilarity (left) and Isolation (right) index for Italian provinces’

population and for female director minority group. . . 92 6.6 Segregation indexes over the 19 company sectors for minority groups

of: female directors (left), foreign directors (center), and age-band directors (right, only dissimilarity index). . . 92 6.7 Director distribution by gender and age. . . 95 6.8 Distributions: BoD size in the 1998 (left), BoD size in the 2015 (right). 96 6.9 Distributions: Director presence in the 1998 (left), Director presence

in the 2015 (right). . . 96 6.10 Distributions: Distribution of size of CCs in the 1998 (left),

Distribu-tion of size of CCs in the 2015 (right) splitting the giant component by removing the edges with weight ≤ 3. . . 97 6.11 The proportion P of directors over 20 years by gender. . . 97 6.12 Segregation indexes over time for the minority group of female directors. 98 6.13 Segregation indexes over time for the minority group of male directors. 98 6.14 Segregation indexes over time for the minority group of foreign directors. 98 6.15 Proportion of foreign, female, and male directors in the year 2016 by

industry sector. . . 99 6.16 Segregation indexes over the 19 company sectors in the time point

January 2016. . . 100 6.17 Dissimilarity index in January of 2011, 2013, and 2016 across industry

sectors for minority groups of directors in a specific age-band. . . 101 7.1 A sample dataset of social network. . . 105

(7)

List of Tables

2.1 An instance of the Modifiable Areal Unit Problem. . . 28

3.1 A sample transaction database (left) and some example of cover and support (right). . . 33

3.2 Function fs() and gs(). Here, E(p) = −p · log p − (1 − p) · log (1 − p). 39 4.1 Summary of experimental datasets. . . 56

4.2 SToC results (mean values over 10 runs, StDev in brackets). . . 58

4.3 Inc-C results for the DBLP dataset. . . 59

4.4 GBAGC results. . . 60

4.5 SToC, SC and ToC on the DBLP dataset. . . 62

4.6 Results of SToC on the DBLP dataset with discretized attributes (averages of 10 runs, StDev in brackets). . . 62

5.1 Input tables of Table Builder module. . . 72

5.2 finalTable.csv for inputs from Table 5.1, using (5.7). . . 73

5.3 Size of the Italian dataset and samplings. . . 75

5.4 Size of the Estonian dataset and samplings. . . 75

6.1 Sample input for segregation discovery. . . 89

(8)

(9)

Acknowledgments

“

Dicebat Bernardus Carnotensis nos esse quasi nanos gigantium humeris insidentes, ut possimus plura eis et remotiora videre, non utique proprii visus acumine, aut eminentia corporis, sed quia in altum subvehimur et extollimur magnitudine gigantea

”

Iohannes Saresberiensis, Metalogicon: III, 4, 1159

(10)

(11)

Abstract

This thesis tackles the social segregation problem from a data science perspective by proposing a segregation-aware data mining framework for the discovery of segrega-tion from relasegrega-tional data and from attributed graphs. The approach is implemented in an efficient system and experimented on two challenging case studies in the do-main of occupational segregation in company boards.

The framework builds on quantitative measures of segregation, called segregation indexes, proposed in the social science literature. The segregation discovery problem is first introduced for relational data. It consists of searching sub-groups of popu-lation and minorities for which a segregation index is above a minimum threshold. A search algorithm is devised that solves the segregation problem by computing a multi-dimensional data cube that can be explored by the analyst. The machinery underlying the search algorithm relies on frequent itemset mining tools.

The approach is then extended to graph data consisting of bipartite attribute graphs, which model real networks by enriching their nodes with attribute values. Segregation indexes assume a partition of the population into organizational units (e.g., schools, neighborhoods, etc.), which are not obvious for graphs. We propose a fast and scalable algorithm for partitioning large attributed graphs. The approach does not require the user to guess in advance the number of clusters. Experimental results demonstrate its ability to efficiently compute high-quality partitions.

Our implementation of the framework, called SCube, supports an analyst in dis-covering context of social segregation. Users of the system include social scientists, policy decision makers in socially sensitive fields (urban development, public trans-portation and services, medical and health managers, etc.), and control authorities. The system is developed in Java 8, hence portable, and thanks to state-of-the-art libraries achieve good performances on large datasets.

We demonstrate the applicability of the proposed methodology and tools in a complex scenario, reflecting the risks of modern segregation in occupational so-cial networks. The scenario considers glass-ceiling barriers for women in accessing boards of company directors. Two case studies are presented, one considering Ital-ian companies and the other EstonItal-ian companies. The latter case incluse temporal information, thus allowing for temporal analysis of segregation.

(12)

(13)

1. Introduction

“

During the 1970s and 1980s a word disappeared from the American vocabulary. . . It was not articulated by civil rights leaders speaking out against the persistence of racial inequality; and it was nowhere to be found in the thousands of pages written by social scientists on the urban underclass. The word was segregation.

”

Massey and Denton, American apartheid: segregation and the Making of Underclass, 1998

1.1 Motivation

The term social segregation refers to the “separation of socially defined groups” [125]. People are partitioned into two or more groups on the grounds of personal or cul-tural traits that can foster discrimination, such as gender, age, ethnicity, income, skin color, language, religion, political opinion, membership of a national minority, etc. [160]. Contact, communication, or interaction among groups are limited by their physical, working or socio-economic distance. Members of a group are often observed to cluster together when dissecting the society into organizational units (neighborhoods, schools, job types).

Early studies on residential segregation trace back to 1930’s [67]. In this context, social groups are set apart in neighborhoods where they live in, in schools they attend to, or in companies they work at. As sharply pointed out in Figure 1.1, racial segregation (a.k.a. residential segregation on the grounds of “so-called race”) very often emerges in most cities characterized by ethnic diversity. Schelling’s segregation model [50, 166] shows that there is a natural tendency to spatial segregation, as a collective phenomenon, even if each individual is relatively tolerant – in his famous abstract simulation model, Nobel laureate Schelling assumed that a person changes residence only if less than 30% of the neighbors are of his/her own ethnicity.

Recently, [127] argued that segregation is shifting from ancient forms on the grounds of racial, ethnic and gender traits to modern socio-economic and cultural segregation on the basis of income, job position, and political-religious opinions.

(14)

14 CHAPTER 1. INTRODUCTION

Figure 1.1: Racial spatial segregation in New York City, based on Census 2000 data [75]. One dot for each 500 residents. Red dots are Whites, blue dots are Blacks, green dots are Asian, orange dots are Hispanic, and yellow dots are other ethnicities.

With its ubiquitous presence and pervasiveness, ICT is assuming increasing influ-ence and relevance in a broad range of techno-socio-economic domains. This raises challenging research questions on how to analyze social phenomena starting from big data of human interactions and decisions [87, 106], e.g., from friendship relations in Facebook, from posts in Twitter, from GPS/GSM data on human mobility, etc. An earlier comparison of ideological segregation of the American electorate online and offline is offered in [85]. The paper found that segregation in news consumption is higher online than offline, but significantly lower than the segregation of face-to-face interactions with neighbors, co-workers, or family members. More recently, it has been warned that the filter bubble generated by personalization of online social net-works may foster segregation [76], opinion polarization [123], and lack of consensus between different social groups. People are only reinforced in what they already believe and lack exposure to alternative viewpoints and information [18, 148]. Po-larization in social media may also lead to unfriending peers who expressed different opinions [94]. Consequently, online social network users are sometimes led to self-censorship acts [64] and to isolation [97] and audience segregation [181] for fear of public opinion on personal thoughts. This is an instance of the spiral of silence phenomenon studied in social science [144]. A recent paper [128] points out that web user navigation reflects a racially segregated traffic pattern, despite the hetero-geneity of website racial contents.

From a data analysis perspective, the key problem of assessing the presence, extent, nature, and trends of social segregation has been investigated so far by hy-pothesis testing, i.e., by formulating one or more possible contexts of segregation

(15)

1.2. CONTRIBUTION 15

against a certain social group, and then in empirically testing such hypotheses – see, e.g., [140]. For instance, a suspect case of segregation of female students in high schools from NYC is studied first by collecting data on gender of high school students in NYC (reference population), and then by computing and analysing seg-regation indexes over female students (minority group). A segseg-regation index is a measure of the degree of segregation of minorities in organizational units. The for-mulation of the hypothesis, however, is not straightforward, and it is potentially biased by the expectations of the data analyst of finding segregation in a certain context. In this process, one may overlook cases where segregation is present but undetected. This may occur due to multiple causes. First, the definition of the possibly segregated group is stated as part of the hypothesis, hence biased by the expectations of the analyst. Segregation may occur for some sub-groups of minori-ties (a form of intersectionality [49, 80, 169]) that is overlooked by the analyst. We rather aim at discoverying cases where the segregated group is a-priori unknown. Nevertheless, such cases should be representative in terms of the size of the segre-gated group. Second, segregation can result undetected when the analyst targets an actually segregated minority but considering a reference population that is too small/large (property (P4) discussed in Section 2.2). Third, segregation is unde-tected if analysed at a wrong granularity level (property (P5)), e.g., at city level when segregation is at neighborhoods level. However, an analyst does not typically know a-priori which granularity is the most appropriate one. Even worse is the case when the organizational units to base the segregation analysis on are not known a-priori. In fact, traditional contexts, such as residential segregation, start from background knowledge on the organizational units such as neighborhoods, schools, jobs, and so on [67, 110, 120, 152]. Such a background knowledge is widely agreed and accurately designed, e.g., because used in population census or given by natu-ral/administrative boundaries. The fourth case of undetected segregation is when the data under analysis has no clear or agreed partition into organizational units. This is the case, for instance, of online social networks, of the web, or, as we will discuss in our case studies, of company networks.

1.2 Contribution

In this thesis, we will consider the social segregation problem from a data science perspective by proposing a segregation-aware data mining approach for the discovery of segregation from relational data and from attributed graphs. The approach is implemented in an efficient system, called SCube, and experimented on two challenging case studies in the domain of occupational segregation in company boards.

Human decisions and inter-actions are increasingly being sensed, stored and anal-ysed at extreme detail and resolution, at individual, group and society levels. The analysis of those digital traces can create new comprehensive pictures of individual

(16)

16 CHAPTER 1. INTRODUCTION

and group behaviour, through the discovery of patterns and models, with the po-tential to transform the understanding of our lives, organizations, and societies. We propose a data-driven approach, which complements hypothesis testing, by driving the search (the “discovery”) of contexts and social groups where a-priori unknown segregation factors are quantitatively prominent. Recall the previous example on school segregation. The analyst has to collect data on gender and other possible segregation attributes such as age and ethnicity of students, and on location, school type, annual fees and other context attributes that may distinguish conditions of segregation. Although no segregation may be apparent in the overall data, it may turn out that for a specific combination of context attributes (e.g., high schools located in a particular area), a specific minority group identified by a combination of segregation attributes (e.g., black female students) is at risk of segregation. We quantify such a risk through a reference segregation index from the social science literature, and assume that a value of the index above a given threshold denotes a situation worth for further scrutiny. We call the problem of discovering a-priori unknown minority groups and reference populations for which segregation indexes are above a given threshold, the segregation discovery problem.

We tackle the segregation discovery problem for relational and graph data. In the relational case, we assume in input a dataset which records the characteristics of a population of individuals, including minority groups, distributed over a num-ber of organizational units. We devise an algorithm that builds a multidimensional data cube (called segregation data cube) filled with segregation index values at the variation of the characteristics of the minority and of the context attributes. The algorithm relies on theory and tools from frequent itemset mining. Its output can readily be explored using traditional multidimensional reporting system, such as pivot tables and charts. This favors its wide applicability by smoothing the learning curve for an analyst. In the case of graph data, we assume as input an attributed graph, which is a graph where nodes are assigned a tuple of values. First, we mo-tivate the inapplicability of homophyly measures as measures of segregation. Then, we propose to reduce the problem to the relational case by applying a clustering al-gorithm, and design an attributed graph clustering algorithm for large graphs called SToC.

The segregation discovery algorithms and a few graph clustering algorithms, in-cluding SToC, have been implemented in a complex software system called SCube that also includes modules for pre-processing the data of the two case studies that we will consider. SCube includes a number of state-of-the-art libraries and tools for managing compressed bitmaps, for storing large graphs, for extracting closed fre-quent itemsets, for connecting to Excel sheets. We present the detailed architecture of SCube and experimental results demonstrating its computational efficiency on dataset that are at least two orders of magnitude larger than the ones considered so far in the literature.

We demonstrate the applicability of the proposed methodology and tools in a complex scenario, reflecting the risks of modern segregation in occupational

(17)

so-1.3. PLAN OF THE THESIS 17

cial networks. The scenario considers glass-ceiling barriers for women in accessing boards of company directors. Two case studies are presented. One regards the com-pany network of Italian companies, and the other the comcom-pany network of Estonian companies. The latter dataset is smaller but it includes time information on the membership of a director of a company board, thus allowing for temporal analysis of segregation. The case studies provide confidence that a deeper understanding of segregation phenomena through the design of analytical processes can proactively support policy makers and control authorities in discovering and in anticipating potential segregation problems.

1.3 Plan of the thesis

The thesis is organized as follows. Chapter 2 introduces segregation indexes used in social sciences, and critically discuss some mathematical properties exploited later on by the segregation discovery algorithm. The segregation discovery problem from relational data is formally introduced in Chapter 3, characterized using frequent itemset mining theory, and solved through the computation of a segregation data cube. We also discuss extensions that cover multi-valued attributes and temporal data. Chapter 4 deals with the case of attributed graph data by proposing an approach that clusters nodes and then reduces the problem to the relational case. The SToC algorithms for large graph partitioning is presented and applied it to two case studies. In Chapter 5 we present the architecture of the SCube system for segregation discovery from both relational data and attributed graphs. It includes experimental results on the efficiency of the computationally-demaning modules of the system. Chapter 6 discusses in detail the data analysis process of the two case studies. Finally, Chapter 7 summarizes the contributions of the thesis and directions open for future work.

(18)

(19)

2. Segregation analysis

“

Segregation is that which is forced upon an inferior by a superior. Separation is done voluntarily by two equals.

”

Malcom X, The Race Problem, 1963

Since segregation is a multifaceted phenomenon that can be analysed with re-spect to several perre-spectives, several indexes have been proposed in the social science literature in order to model such different perspectives [32, 71, 173]. These indexes are extensively adopted in empirical social studies. They are at the basis of the ex-post measurement of the effectiveness of policies to combat segregation (deseg-regation) [167]. In this chapter, we first review segregation indexes, and critically discuss some mathematical properties they (should) have. Finally, we survey related approaches based on statistics and social science simulation.

2.1 Segregation indexes

A segregation index provides a quantitative measure of the degree of segregation of social groups (e.g., Blacks, Whites, Hispanics, etc.) among units of social or-ganization (e.g., schools, neighborhoods, jobs, etc.). In this thesis, we restrict to consider binary indexes, which assume a partitioning of the population into two groups1_{, say majority and minority (but could be men/women, native/immigrant,}

White/NonWhite, etc.). The surveys [71, 104] represent the earliest attempts to cat-egorize them. Afterward, [126] provided a shared classification with reference to five key dimensions: evenness, exposure, concentration, centralization, and clustering. We will follow here such a categorization.

Let T be size of the total population, 0 < M < T be the size of the minority group, and P = M/T be the overall fraction of the minority group. Assume that there are n organizational units (or simply, units), and that for i ∈ [1, n], ti is the

size of the population in unit i, mi is the size of the minority group in unit i, and

pi = mi/ti is the fraction of the minority population in unit i.

(20)

20 CHAPTER 2. SEGREGATION ANALYSIS

2.1.1 Evenness indexes

Evenness indexes measure the difference in the distributions of social groups across organizational units. The indexes mostly used in the social science literature include dissimilarity, information index, Gini, and Atkinson.

The dissimilarity index D is the weighted mean absolute deviation of every unit’s minority proportion from the global minority proportion:

D= 1 2 · P · (1 − P ) n X i=1 ti T · |pi− P |

The factor 2 · P · (1 − P ) normalizes the index in the range [0, 1]. Since D mea-sures dispersion of minorities over the units, higher values of the index mean higher segregation. Dissimilarity is minimum when for all i ∈ [1, n], pi = P , namely the

distribution of the minority group is uniform over units. It is maximum when for all i ∈ [1, n], either pi = 1 or pi = 0, namely every unit includes members of only

one group (complete segregation).

Example 2.1.1. With basic algebra, it is readily checked that an equivalent defini-tion of dissimilarity is:

D= 1 2 n X i=1 mi M − ti− mi T − M

Dmeasures how different are the distributions of percentages of total minorities and of total majorities in the units. When n = 2, the formula boils down to:

D= m1 M − t1− m1 T − M (2.1)

The second widely adopted index is the information index, also known as the Theil index in social sciences [138] and normalized mutual information in machine learning [132]. Let the population entropy be E = −P · log P − (1 − P ) · log (1 − P ), and the entropy of unit i be Ei = −pi·log pi−(1−pi)·log (1 − pi). The information

index is the weighted mean fractional deviation of every unit’s entropy from the population entropy: H = n X i=1 ti T · (E − Ei) E

Information index ranges in [0, 1]. Since it denotes a relative reduction in uncertainty in the distribution of groups after considering units, higher values mean higher segregation of groups over the units. Information index reaches the minimum when all the units respect the global entropy (full integration), and the maximum when every unit contains only one group (complete segregation).

The third evenness measure is the Gini index, defined as the mean absolute differ-ence between minority proportions weighted across all pairs of units, and normalized

(21)

2.1. SEGREGATION INDEXES 21

to the maximum weighted mean difference. In formula:

G= 1 2 · T2_{· P ·}(1 − P ) · n X i=1 n X j=1 ti· tj · |pi − pj| (2.2) Here Pn i=1 Pn

j=1ti· tj · |pi− pj| is the weighted mean absolute difference. The

nor-malization factor is obtained by maximizing such a value. The definition of the Gini index stems from econometrics, where it is used as a measure of the inequality of income distribution [83].2 _{In our context, it measures the inequality of the majority}

group distribution among units. The Gini index ranges in [0, 1] with higher values denoting higher segregation. The maximum and minimum values are reached in the same cases of the dissimilarity index.

An equivalent formulation of the Gini index (see [71, 186]) can be stated under the assumption that p1, . . . , pn are in descending order:

G= 1 M ·(T − M) · n X i=1 (Xi−1· Yi− Xi · Yi−1) (2.3) where, for i ∈ [1, n]: Xi = i X j=1 mi Yi = i X j=1 (ti− mi)

the Xi’s (resp., Yi’s) are the cumulative sums of minority (resp., majority) population

in units 1, . . . , i. Formulation (2.3) easily derives from the geometric interpretation of the Gini index (see footnote 2). From a computational perspective, it allows for computing G in O(n · log n), whilst formula (2.2) requires O(n2_{). This will be}

particularly relevant in experimental case studies, where the number n of units will be in the order of millions.

The last evenness measure is the Atkinson index [15], which is aimed at a better sensitivity of inequality scenarios. The contribution of units with a high level of seg-regation are weighted differently from the one of units with low level of segseg-regation. This is achieved by means of a a parameter b ∈ [0, 1]. The Atkinson index is defined as follows: Atk(b) = 1 − _P 1 − P · 1 P · T · n X i=1 h (1 − pi)1−b· pbi · ti i (_1−b1 ) (2.4)

2_{The geometric interpretation of the Gini index is provided in the space [0, 1] × [0, 1]. The}

Lorenz curve plots the cumulative fraction of minority against the cumulative fraction of majority. Formally, assume that p1, . . . , pn are in descending order. The Lorenz curve f () is the piece-wise linear function such that f (0) = 0, f (1) = 1, and, for i ∈ [1, n], f ( ˆXi) = ˆYi where ˆXi is the cumulative fraction of the minority group up to unit i, and ˆYi is the cumulative fraction of the majority group up to unit i. The diagonal represents the perfect equality of distribution of majority vs minority population. The Gini index is twice the area between the Lorenz curve and the diagonal. See [71, 186] for details.

(22)

It ranges over [0, 1] with higher values denoting higher segregation. Notice the impact of parameter b ∈ [0, 1]. When b ≤ [0, 0.5] the units where minorities are under-represented (pi <0.5) contribute more to the segregation index (because the

weight (1 − pi)1−b· pbi is higher). When b ≤ [0.5, 1], we have the opposite behavior,

namely pi >0.5 have higher weights.

2.1.2 Exposure indexes

Exposure indexes measure the degree of potential contact, or possibility of interac-tion, between members of social groups. The most used measure of exposure is the isolation index [29], defined as the likelihood that a member of the minority group is exposed to another member of the same group in a unit. For a unit i, this can be estimated as the product of the likelihood that a member of the minority group is in the unit (mi/M) by the likelihood that she is exposed to another minority

member in the unit (mi/ti, or pi) – assuming that the two events are independent.

In formula: I = 1 M · n X i=1 mi· pi

The right hand-side formula can be read as the minority-weighted average of minority proportions in units. The isolation index ranges over [P, 1], with higher values denoting higher segregation. The minimum value is reached when for i ∈ [1, n], pi = P , namely the distribution of the minority group is uniform over the

units. The maximum value is reached when there is only one k ∈ [1, n] such that mk = tk = M, namely there is a unit containing all minority members and no

majority member.

Another measure derived by the isolation index is the modified index of isolation (MII) [63] (also called Eta2 in [126]):

MII = _{1 − P}1 · n X i=1 m_i M · pi− P ≡ I − P 1 − P

It ranges in [0, 1] as the isolation index, and, as shown in the formula, it is basically a normalization of the isolation index.

A dual measure is the interaction index, which is the likelihood that a member of the minority group is exposed to a member of the majority group in a unit. By reasoning as above, this leads to the formula:

Int = 1 M · n X i=1 mi·(1 − pi)

It clearly holds that I + Int = 1. Hence, lower values denote higher segregation. A more general definition of interaction index occurs when more than two groups are considered in the analysis, so that the exposure of the minority group to one of the other groups is worth to be considered [126].

(23)

2.1. SEGREGATION INDEXES 23

2.1.3 Concentration indexes

Concentration indexes measure the relative amount of physical space occupied by social groups in an urban area.

The Location Quotient of a unit i is defined as: LQi =

pi

P

Location Quotient varies in the range [0, ∞] and it expresses the relative concen-tration of the minority within an unit i (for this reason local). It is local index of dispersion from the global proportion of the minority. LQi = 1 means the minority

respects the perfect integration concept expressed in [79], namely in the unit i the minority local presence reflects the global minority proportion. If LQi < 1 the

mi-nority is under-represented, while LQi >1 indicates a local over-representation. [38]

states that LQ should be classified in a hybrid dimension concentration-evenness. Another measure of concentration is called G∗

i [86] and it aims to quantify local

agglomerations of individuals of an unit i, i.e., clusters that significantly differ from the average spatial distribution. Its formulation is:

G∗_i = PN j wij ·(pj − P) ·PNj wij S · r N · PN j w 2 ij−( PN j wij) 2 N −1 where S = r PN j p 2 j

N − P2. Weights wij ∈ {0, 1} model the spatial distance between

units i and j – where 1 is the weight of adjacent units, and 0 the weight of non-adjacent ones.

2.1.4 Centralization indexes

Centralization indexes measure the degree to which a group is spatially located near the center of an urban area. A first such index is Proportion in Central City (PCC):

P CC = Mcp M

where Mcprepresents the minority within the bounds of the central point (the

bound-aries are extracted from the interest area). PCC ranges in [0,1] and a high value in an area of interest (e.g. a train station) means a high concentration of the minority in the area.

Another measure called Relative CEntralization index (RCE) is: RCE = n X i Xi−1· Yi− n X i Xi· Yi−1

where the units are ordered by distance from the central point of interest, and Xi = 1/T ·Pij=1mj and Yi = 1/T ·Pij=1(ti − mj) are respectively the cumulative

(24)

proportions of minority and majority. RCE ranges in [−1, 1] and it reaches the minimum when the minority group is concentrated far away from the central point of interest, while it reaches the maximum when the minority is concentrated around the central point of interest. When RCE = 0 the minority has the same distance from the central point of the majority. A variant of the index where the minority distribution is correlated by its own -spatial- distribution (and not with the majority) is called Absolute CEntralization index (ACE):

ACE = n X i Xi−1· Ai− n X i Xi· Ai−1

where the units are ordered as in RCE, aj represents the area of unit j, AT is the

total area, and Ai = 1/AT ·Pij=1aj represents the cumulative spatial proportion.

The interpretation is similar to RCE: negative values indicate the minority uses to live on the boundaries of the city while positive values indicate they use to live in the center of the city, 0 indicates the minority is uniformly distributed trought the space.

2.1.5 Clustering indexes

Clustering indexes measure the degree to which group members live disproportion-ately in contiguous areas. This class of indexes is particularly relevant in modelling spatial segregation3_.

Moran’s M-I index, derived by the original paper [139], is a measure of spatial autocorrelation that quantifies the spatial clustering of a minority group among the n units: M-I = N W · P i Pn j wij(pi− P)(pj− P) Pn i(pi− P)2 where W = P i P

jwij and wi is the spatial distance (see Sec. 2.1.3). M-I index

varies in the range [−1, 1], M-I< 0 means negative spatial autocorrelation. M-I> 0 means positive spatial autocorrelation.

A related index is the Geary’s C index [84]:

C = N −1 2W · P i P jwij(pi− pj)2 P i(pi− P)2

ranging in [0, 2]. C = 1 means no spatial autocorrelation. C < 1 means positive spatial autocorrelation. C > 1 means negative spatial autocorrelation.

3_{Tobler’s first law of geography states: “Everything is related to everything else, but near things}

(25)

2.2. PROPERTIES OF SEGREGATION INDEXES 25

2.1.6 An extension of segregation indexes for overlapping

units

In the literature on segregation measurement, it is typically assumed that individu-als are partitioned among the units of analysis. Each individual belongs to one and only one unit. The size of the overall population is then the sum of the population in each unit, and similarly for the size of the minority population. Such an assumption readily holds e.g., in cases of residential segregation. The experimental case studies that we will present later on, however, break such an assumption, since individu-als (company directors) may belong to more than one unit (group of companies). This situation resembles the case of segregation analysis in e.g., sports club where players may be associated with more than one team at a time [81, 122] or in movie productions, where actors may play in movies of more than one producer [170]. We extend then the previously introduced definitions of segregation indexes by consid-ering every instance of an individual in an unit as a distinct one. In practice, this turns out to revise the definition of T and M as follows:

T = n X i=1 ti M = n X i=1 mi

namely, the size of the total population is by definition the sum of the sizes of the unit populations, and similarly for the minority population.

2.2 Properties of segregation indexes

There is a long standing debate in social sciences about which mathematical prop-erties segregation indexes are expected to have. Despite pros and cons of adopting a specific index, strong correlation among evenness indexes has been observed in practice by empirical analyses [126].

2.2.1 Properties of evenness and exposure indexes

The following are some useful mathematical properties and differences among the evenness and exposure indexes introduced earlier. Such properties will be crucial in our approach for segregation discovery.

(P1) All indexes are insensitive to units i with ti = 0, i.e., by adding such “empty”

units the value of an index does not change.

(P2) I and Int are insensitive to units i with mi = 0, whilst D, H, and G are not.

By adding units with only majority members, the likelihood of interaction among minority members does not change. The distributions of populations over units, instead, do change.

(26)

(P3) D, H and G are symmetric, i.e., by inverting the minority and majority groups the index remains unchanged, whilst I and Int are not. Intuitively, distance between minority and majority distributions is a symmetric concept, whilst likelihood of contact depends on the groups being considered.

(P4) I is anti-monotonic w.r.t. the addition of a majority member, i.e., by adding one majority member to a unit the value of I decreases. Int is monotonic, and D, H, Atk, and G are neither monotonic nor anti-monotonic.

(P5) D, H and G are subject to the Simpson’s paradox. E.g., for the dissimilarity index, there may exist datasets X and Y such that:

DX∪ Y > DX and DX∪ Y > DY

where DX, DY and DX∪ Y are the dissimilarity indexes for datasets X, Y and

X ∪ Y respectively.

(P6) also called homogeneity [101], all the indexes are insensitive to scalar multi-plication of the number of minority and majority groups in each unit.

(P7) also called symmetry, all indexes are insensitive to a permutation of the orga-nizational units,

(P8) also called principle of transfers, H, G, and Atk decrease if an individual of the minority group moves from unit i to unit j such that pi > pj. D does not

satisfy such a principle [104], but a variant of it can be devised that does. Let us discuss the two key properties (P4) and (P5) in detail. For the former, anti-monotonicity is typically a desired property since it enables computational op-timizations in the calculation of measures, as it happens for instance in frequent itemset mining [98]. Regarding the latter, Simpson’s paradox makes it unclear what is the right level of aggregation at which indexes should be considered. Let consider an example for (P4).

Example 2.2.1. Assume n = 2, with m1 = m2 = 1, t1 = 2, and t2 = 4. We have

M = 2, T = 6. Using (2.1), it turns out D = 1/2 − 1/4 = 0.25.

Consider adding one majority member to unit 1 (the most segregated because p1 = 0.5, p2 = 0.25 and P = 0.33). We have T = 7 then D = 1/2 − 2/5 =

0.1. Thus, the dissimilarity index has decreased. Consider now instead adding the majority member to unit 2 (the most integrated). Again, we have T = 7. But now D= 1/2 − 1/5 = 0.3. The dissimilarity index has now increased.

Simpson’s paradox is a well-known case in presence of ratios and differences of distributions, in which a trend appears in different groups of data but disappears or reverses when these groups are combined [150]. In our context, this occurs when seg-regation appears in combined dataset X ∪ Y , but disappear when looking separately at X and Y , or vice-versa.

(27)

2.2. PROPERTIES OF SEGREGATION INDEXES 27

Example 2.2.2. Assume two university departments (n = 2). Faculty (X) and administrative staff (Y) are employed in each department. Assume the numbers on the left hand side of the following table:

m1 t1 m2 t2 M T D H G

X 99 100 1 2 100 102 0.490 0.2903 0.490 Y 1 10 9 100 10 110 0.010 0.0002 0.010 X ∪ Y 100 110 10 102 110 212 0.811 0.8354 0.811

Using formula (2.1.1), DX∪ Y = 100/110 − 10/102 = 0.811, i.e., segregation at

university level appears to be high. However, when considering separately faculty and administrative staff, we have DX = 99/100 − 1/2 = 0.49 and DY = 1/10 − 9/100 =

0.01, i.e., segregation is much lower in both sub-groups. Similar conclusions can be drawn for H and G, as shown in the right hand side of the table above.

The previous example is a contrived one. Actually, the reversed effect (segrega-tion in combined dataset X ∪ Y lower than in X and Y separately) is more likely to be observed, as we will show later on.

2.2.2 Secondary factors

James [104] presented an appendix titled “mathematical analysis of index behavior” that inspired this section. We discuss here some factors that, in extreme cases, may affect the meaningfulness of segregation indexes: the small size problem, described in [9, 46] and the Modifiable Areal Unit Problem (MAUP) [102].

The small size problem is a potential disturbing factor that occurs when the minority group is (evenly) distributed in a large number n of small units. Consider for instance the dissimilarity index, with M = 10, T = 100 and with n = 11 units such that t1 = . . . = t10 = 2 and t11 = 80. If the M individuals are evenly assigned

to units 1, . . . , 10, we would have D = 1/(2 · 0.1 · 0.9 · 100) · (20 · |1/2 − 1/10| + 80 · |0 − 1/10|) = 16/18, which denotes an over-estimation of the segregation degree. Assume, in fact, that the M individuals are instead all (unevenly) assigned to unit 11. We would have D = 1/(2·0.1·0.9·100)·(10·|0−1/10|+80·|10/80−1/10|) = 12/180. Some works (e.g., [158]) argued that the former is a case of actual segregation, yet an extreme one. However, having one giant unit with a large fraction of majority and minority populations, as in the latter, prevents a fine-grained segregation analysis. As a limit case, if the whole population would be in a single unit, there is no possibility of segregation analysis. We will see in Chapter 6, that having giant units typically occurs in our cases studies, and then we will devise methods to break the giant unit into smaller ones.

The Modifiable Areal Unit Problem (MAUP) is related to the temporal/spatial dynamics of the organizational units. It was first observed empirically in [146]. In Figure 2.1, taken from [107], we show how the spatial aggregation of units from (a)

(28)

28 CHAPTER 2. SEGREGATION ANALYSIS a) b) c) 2 4 3 6 6 1 3 5 1 5 5 4 4 2 5 4 3 3.5 4.5 4 3 4.5 3 4.5 3.75 3.75 3.75 3.75 ¯ m= 3.75 m¯ = 3.75 m¯ = 3.75 σ2_{= 2.60} _σ2_{= 0.50} _σ2_{= 0} d) e) f) 2.5 5 4.5 3 3 4.5 4.5 3 2.5 4.75 4.5 3 4 3.67 1 4 ¯ m= 3.75 m¯ = 3.75 m¯ = 3.17 σ2_{= 0.93} _σ2_{= 1.04} _σ2_{= 2.11}

Table 2.1: An instance of the Modifiable Areal Unit Problem.

by merging adjacent cells of the same row (b) and cells in the same quadrant (c). Numbers in (b) and (c) report the number of individuals per areal unit. The average ( ¯m) of all such averages is constant at different levels of spatial aggregation, but the variance (σ2_{) is very different. The granularity of aggregation, called zoning, is}

then a critical problem, which has no clear answer from a sociological point of view. Some sociologists consider contiguous areas delimited by streets. Others aggregates more with the rationale that points of interests, rather than streets, are aggregation drivers. Different zoning may critically affect the resulting distribution, as shown in Figure 2.1 (c,d,e). The MAUP is relevant in our approach for discrimination discovery, because we will propose a system that analyze segregation indexes at different levels of aggregation. Since the approach builds all possible combinations of aggregation (using logical and’s), this mitigates the risk of missing an aggregation level that unveils high segregation.

2.3 Related approaches

In this thesis, we consider segregation from a data analysis perspective and with reference to human population. However, segregation has been also tackled from a simulation perspective and with reference to non-human agents. Let us briefly

(29)

2.3. RELATED APPROACHES 29

mention some related approach in these two fields. Next, we discuss sociological motivation behind considering segregation a social problem, and finally we relate segregation to the value-sensitive field of discrimination in data science.

Segregation and Biology. [178] is one of the first studies to deal with segrega-tion in the biological field. The authors re-adopted a well-know measure in spa-tial segregation analysis, calling it neural complexity, with the aim to explore the network structure and dynamics the brain neurons. This topic is receiving an in-creasing attention [136, 176], with reference to the whole-brain functional connec-tivity [48, 121, 131, 134, 182], or to the smaller scale of protein-protein network [19]. The objective of such studies is to discover correlation between structural parts of the network and neuro-developed disorders (autism, Alzheimer, Parkinson) located in such parts [14, 92, 131, 190]. For example, [134] observed that autism is strongly correlated with mirror neuron defect. [82] focused the attention on Parkinson dis-ease. Finally, we mention a few studies on spatial segregation patterns of animal species [13, 91], which found similarities with the humans’ behaviour, and a study on the link between residential segregation and the spreading of epidemiological diseases [2].

Simulating segregation. Segregation has been considered also from a simulation perspective of human dynamics. The precursor Schelling’s segregation model [50, 166] paved the way to a quantitative approach in sociology. [100] has deepened the model and characterized mixed patterns along borderline areas (where areas with concentration of different groups are in contact). [30] modelled the behavioral dynamics of social network users with first-order discrete Hidden Markov Models (HMMs), using some visible features (e.g., actions like “retweet”, “like”, etc.) to model the invisible ones (e.g. topic opinion, implicit bias [171]). [115] proposed a new methodology, inherent to behavioral science, using bots for experiments on ethical issues. The paper describe how to use automated programs to simulate the human dynamics in some very peculiar synthetic scenarios. [174] designed an agent model for simulating the spreading of fake news and the successive fact checking on a social network, and for analysing the results of the simulation at the variance of the segregation of the agents in the network. Their goal was to check to what extent the structure of a network can be a factor of the misinformation diffusion. Simulations rely on psychological and sociological models of segregation phenomena in society.

Sociological theories and legal acquis. The contact hypothesis theory [10, 153] states that under appropriate conditions interpersonal contact is one of the most ef-fective ways to reduce prejudice and distance between majority and minority group members. Examples of such conditions require to get in contact people with sim-ilar background (equal status), with common goals, with agreed authorities. The conflict theory [172] states, instead, that intergroup contact raises the perception of threat, driven by prejudice. Minimizing intergroup contacts leads to increased in-tragroup social capital [54] and, as a consequence, it relaxes intergroup tension. An intermediate approach [1] argues that both theories have merits, but they apply in

(30)

different scenarios, and it proposes to enhance trust between groups by making them aware of the problem. The legal scenario considers segregation unlawful not per se but as a cause affecting equality of opportunities and discrimination [58, Article 2 comma g,f]. The vicious cycle of discrimination [142] starts from a situation where prejudice causes a protected group to be socially disadvantaged. This is interpreted as evidence that the group is inferior, which, in turn, creates renewed prejudice by increasing social distance and segregation, by reinforcing negative stereotypes, and by legitimating negative feelings.

Discrimination and fairness. Social discrimination refers to an unjustified dif-ference in treatment of individuals based on any physical or cultural trait, such as sex, ethnic origin, religion or political opinions. Fairness goes beyond the legal obligations of non-discrimination towards embedding ethical values, and accounts for bias-free data. A subtle case is indirect discrimination, also called disparate impact, where apparently neutral rules lead to discrimination effects. An example is redlining, which is the practice of banks denying credit based on the residence of the applicant. Since residence is a proxy for race, especially in highly segregated urban areas, such a practice actually hides discrimination. Discrimination discovery consists of finding out cases of direct and indirect discrimination and explaining under which specific conditions they occurred in decisions made by a black-box sys-tem. Existing approaches provide explanations of discriminatory decisions using association rule mining, k-Nearest Neighbours Bayesian networks, statistical tests, causation approaches. We refer to the survey of economics and data mining methods [160], of measures of discrimination [180], and of legal implications [20].

(31)

3. Segregation discovery

“

[...] there may be recognition that a small set of summary mea-sures doesn’t suffice, and that a more elaborate multiple-indicator approach is required. [...] No single measure can capture all of the analytically distinguishable elements of the exposure, contact, and segregation. The statistical and conceptual properties of an index that make it especially suitable for one purpose may make it less suitable for another.

”

A.F. Taeuber and K. Taeuber, Measures of racial segregation: some problems., 1988

Traditional data analysis approaches from social sciences typically rely on formu-lating an hypothesis, i.e., a possible context of segregation against a certain social group, and then in empirically testing such an hypothesis – see, e.g., [140]. For instance, a suspect case of segregation of female students in high schools from NYC is studied first by collecting data on gender of high school students in NYC (ref-erence population), and then by computing and analysing segregation indexes over female students (minority group). The formulation of the hypothesis, however, is not straightforward, and it is potentially biased by the expectations of the data an-alyst of finding segregation in a certain context. In this process, one may overlook cases where segregation is present but undetected.

Below we use the notation (P1), (P2), (P3), (P4), (P5) for the properties described in Sec. 2.2.

Example 3.0.1. By property (P4), segregation can result undetected when the an-alyst targets an actually segregated minority but considering a reference population that is too small/large. Similarly, by property (P5), segregation is undetected if analysed at a wrong granularity level, as shown in Example 2.2.2. However, an an-alyst does not typically know a-priori which granularity is the most appropriate one for the context at hand.

We propose a data-driven approach, which complements hypothesis testing, by driving the search (the “discovery”) of contexts and social groups where a-priori

(32)

32 CHAPTER 3. SEGREGATION DISCOVERY

unknown segregation factors are quantitatively prominent. Recall the previous ex-ample on school segregation. The analyst has to collect data on gender and other possible segregation attributes such as age and ethnicity of students, and on location, school type, annual fees and other context attributes that may distinguish condi-tions of segregation. Although no segregation may be apparent in the overall data, it may turn out that for a specific combination of context attributes (e.g., high schools located in a particular area), a specific minority group identified by a combination1

of segregation attributes (e.g., black female students) is at risk of segregation. We quantify such a risk through a reference segregation index, and assume that a value of the index above a given threshold denotes a situation worth for further scrutiny – what legal scholars call a prima-facie evidence. We call the problem of discovering a-priori unknown minority groups and reference populations for which segregation indexes are above a given threshold, the segregation discovery problem.

3.1 Notation and itemset mining

Let us recall first notation and concepts from itemset mining [98, 162], which will serve to define the search space of segregation discovery.

Let I = {i1, i2, ..., in} be a set items where n ≥ 1. A transaction t is a pair

(tid, T ), where tid is an ID and T ⊆ I is a subset of items. A transactional database Dis a set of transactions. An example is shown in Table 3.1 (left). An itemset X is a set of items, namely X ⊆ I. As usual in the literature, we write X, Y for X ∪ Y. Itemsets are an abstraction of a set of transactions. A transaction t = (tid, T ) supports X if X ⊆ T , namely the transaction includes all items in X. The cover of X is a set of transactions that support X:

cover(X) = {tid|(tid, T ) ∈ D, X ⊆ T }.

The (absolute) support of an itemset X is the number of transactions in its cover: support(X) = |cover(X)|.

X is a frequent itemset if supp(X) ≥ minsupp, where minsupp is a given minimum support threshold. The problem of frequent itemset mining consists of computing all frequent itemsets for a given threshold minsupp. Frequent itemset mining is a well-established research area with solid theory [98] and efficient tools [89].

The number of frequent itemsets can be so large that no efficient algorithm exists to enumerate all of them. This case occurs, for instance, for very low minimum support thresholds and for highly correlated data. A concise representation [43, 124] of frequent itemsets is a lossless representation that addresses the problem. The set of frequent closed itemsets is a concise representation [149]. An itemset X

1_{In the context of social discrimination, this combination is referred to as intersectionality or}

(33)

3.1. NOTATION AND ITEMSET MINING 33

tid transaction

1 sex=M, age=21-30, residence=north

2 sex=F, age=31-40, residence=north

3 sex=M, age=51-60, residence=north

itemset supp cover

{sex=M} 2 {1,3}

{sex=F} 1 {2}

{residence=north} 3 {1,2,3}

Table 3.1: A sample transaction database (left) and some example of cover and support (right).

is closed if there is no Y ⊃ X with cover(Y) = cover(X) or, equivalently, with supp(Y) = supp(X). A closed itemset is a representative member of the class of equivalence of itemsets with a same cover [25]. In particular, X is θ-equivalent to Y if cover(X) = cover(Y). Closed itemsets are the maximal elements of the classes of θ-equivalence. The maximal elements are unique, so there is a bijection between distinct covers of itemsets and closed itemsets.

Transactions can readily model set of items in, e.g., a market basket. Moreover, there is a natural mapping of relational data to transactions which then allows to provide an interpretation of tabular data in the context of frequent itemset. Consider a relational table R, or, simply, a table or a dataset. In our context, tuples σ in R will denote individuals, and attribute values will denote information about individuals and organizational units they belong to. We assume that every attribute A has a discrete domain dom(A) of values. Continuous attributes can be considered after discretization into bins [118]. We denote by σ(A) the value of the tuple σ on attribute A, as in e.g., σ(sex) =female.

An A-item is a term A = v, where v ∈ dom(A). The set of items I consists then of the set of all A-items, where A is an attribute of R. A tuple σ can then be mapped into the transaction t = (tid, T ), where tid is the tuple RID and:

T = ∪A{A = σ(A)}.

Based on this mapping, we will use tuples and transactions interchangeably. In particular, it turns out that a tuple σ supports an itemset X if for every A = v in X, we have v ∈ σ[A]. If an itemset is seen as a conjunction of logical conditions, supporting means verifying that the condition holds for a given tuple. Moreover, the cover of an itemset is the set of tuples that verify the condition of the itemset. In our context, the cover will then denote groups of individuals sharing the characteristics stated by an itemset. Since closed itemsets have distinct covers, using closed itemsets (instead of itemsets) means considering each distinct group (whose size is at least the minimum support) exactly once.

Example 3.1.1. Consider the database in Table 3.1. The cover of the itemset sex=F is the set of young women in the dataset, which consists of only the transaction with tid=2. Its support is then 1. The itemset is not closed, since the itemset {sex=female, age=31-40, region=north} has the same cover/support.

(34)

3.2 The segregation discovery problem

We introduce here the segregation discovery problem. Let R be an input relational table. We assume that attributes are partitioned into three groups. First, segregation attributes (SA), such as sex, age, and ethnicity, which denote minority groups potentially exposed to segregation. Second, context attributes (CA), such as region and job type, which denote contexts where segregation may appear. Notice that the the classification of an attribute as SA or CA is arbitrary and related to the objectives of the segregation analysis at hand. Third, an attribute unitID, which is an ID of the unit the tuple/individual belongs to. We write A, B to denote an itemset where A includes only SA-items, and B includes only CA-items. We call A an SA-itemset, and B a CA-itemset. Since SA and CA attributes are disjoint, there is no ambiguity in distinguishing for a given itemset the A part from the B part. Example 3.2.1(Ctd.). For the itemset sex=female, age=young, region=north, it turns out A =sex=female, age=young and B =region=north. In this exam-ple, the minority group is the set of young women, and the majority population is all the rest, i.e., men or middle aged or elder. This is one specific pair of population and minority group. Other pairs could be considered in the segregation analysis. Actually, one would like to consider all possible such pairs.

We are now in the position to extend the notation of segregation indexes to an itemset A, B where the reference population is the cover of B, and the reference mi-nority group is the cover of A, B. Recall that 0 < M < T is assumed by segregation index definitions.

Definition Let s() be a segregation index. For an itemset A, B such that 0 < supp(A, B) < supp(B), we denote by s(A, B) the segregation index calculated for the population in cover(B) considering as minority population those in cover(A, B). For instance, D(A, B) is the dissimilarity index for minority A and population B. Notice that supp(A, B) < supp(B) implies that A cannot be the empty itemset (otherwise, A, B = B and then supp(A, B) = supp(B)).

Example 3.2.2 (Ctd.). D(sex=female, age=young, region=north) is then the dissimilarity index of segregation of young women among the units in north re-gion. With reference to the dataset in Figure 3.1 (left), we have T = 10 (the support of region=north) and M = 1 (the support of sex=female, age=young, region=north). The population of units is as follows: t1 = 2, t2 = 3, t3 = 2,

t4 = 3, t5 = 0. It turns out that t5 = 0 because there is no individual of the

ref-erence population region=north in it. Minority distribution in the units is m1 =

m3 = m4 = m5 = 0 and m2 = 1. By definition of dissimilarity, D(sex=female,

age=young, region=north) = 102/(2 · 9) · (2/10 · 1/10 + 3/10 · (1/3 − 1/10) + 2/10 · 1/10 + 3/10 · 1/10) = 7/9 ≈ 0.78.

(35)

3.3. A DATA CUBE FOR EXPLORATORY ANALYSIS OF SEGREGATION 35

Notice that all parameters needed to compute D can be defined using the itemset notation. In particular, T = supp(B), M = supp(A, B), and, for a unit i: ti =

supp(B, unit=i), and mi = supp(A, B, unit=i).

Let us introduce now the problem of segregation discovery. Definition Let s() be a segregation index, and α a fixed threshold.

Let A, B be an itemset such that 0 < supp(A, B) < supp(B).

We say that A, B is α-integrative w.r.t. s() if s(A, B) ≤ α. Otherwise, A, B is α-segregative. The problem of segregation discovery consists of computing the set of α-segregative itemsets.

Intuitively, we are interested in searching the space of itemsets for A, B denoting a minority sub-group (A) and a context (B) where the segregation index s(A, B) is above the α threshold. Notice that we assume that higher values of s() denote higher segregation, which is the case for all introduced indexes except for Int. For such an index, the bound in Def. 3.2 becomes s(A, B) ≥ α.

3.3 A data cube for exploratory analysis of

seg-regation

The problem of segregation discovery can be readily formulated as the one of com-puting iceberg multi-dimensional data cubes [99]. Let A1, . . . , Ak be the collection

of segregation and context attributes. They can be considered dimensions of a multi-dimensional array. The ith _{dimension is named as the attribute A}

i, and it

takes values in dom(Ai) ∪ {?}. The coordinate v ∈ dom(Ai) denotes the itemset

Ai = v. The coordinate ? denotes the empty itemset (absence of an Ai-item). A

multi-dimensional data cube is an array mapping dimension coordinates to values of a measure, which, in our case, is a segregation index s():

d[A1 = v1, . . . , Ak = vk] = s(∪ki=1{Ai = vi | vi 6= ?})

With a little abuse of notation, we write items instead of coordinate values in the index positions of the array d[]. For an itemset A, B whose support is non-zero and is lower than the support of B, if d[A, B] = s(A, B) > α then A, B is α-segregative. The subset of cells in a data cube whose value is higher than a minimum threshold is called an iceberg data cube. Thus, the problem of segregation discovery is equivalent to computing the iceberg data cube of d[]. However, since the number of cells in a data cube grows exponentially with the number of dimensions, a practical additional requirement is to impose also a minimum support threshold. Thus, we aim at computing s(A, B) only if A, B is frequent, namely supp(A, B) ≥ minsupp. Finally, we also aim at considering itemsets of the form A, B that denote no duplicate pair of reference population cover(B) and of minority group cover(A, B). Distinct

(36)

SA CA _unitID

sex age region

male young north 1

male middle south 2

male middle north 2

female middle north 2

female young north 2

male middle north 3

female middle north 3

male elder south 3

male middle south 3

female elder south 4

female middle south 4

male elder north 4

male young north 4

male elder north 4

male middle south 5

male young south 5

female middle south 5 young middleage (SA)

elder ∗

region (CA)

female sex (SA) male

∗ ∗ north south 0.78 0.63 - 0.71 0.71 0.63 0.88 0.71 0.50 0.83 - -- 0.43 0.86 0.75 0.75 0.50 0.88 0.75 - 0.35 0.67 -0.83 0.22 0.76 0.30 0.62 0.57 0.56 0.30 0.46 0.59 0.66

-Figure 3.1: Input table (left) and segregation data cube with the D index (right).

reference populations can be considered by enumerating B’s that are frequent and closed (in R). For a fixed B, then the distinct minority groups can be achieved by enumerating A’s that are frequent and closed in the dataset cover(B).

Example 3.3.1. Reconsider Ex. 3.1.1. The itemset A1, B1 = sex=female, age=young

and the itemset A2, B2 = sex=female, age=young, region=north have the same

cover, i.e., they denote the same minority population. However, the former considers as reference population the whole dataset (B1 is empty), while the latter considers

people from the north region (B2 is region=north). By property (P4) stated in

Sect. 2.2, the dissimilarity index of A1, B1 can be lower, equal or higher than the

one of A2, B2. In Ex. 3.2.2, we have seen that D(A2, B2) ≈ 0.78. By doing the

calculations, it turns out that D(A2, B2) ≈ 0.83.

We introduce next the definition of segregation data cube.

Definition Let s() be a segregation index, and minsupp > 0 a fixed threshold. A segregation data cube is a multi-dimensional data cube such that:

d[A, B] =     

s(A, B) if minsupp ≤ supp(A, B) < supp(B) and B is closed and A is closed in cover(B) “ − ” otherwise

Recalling that M = supp(A, B) and T = supp(B), we have that the value of d[A, B] is undefined, which is written as “-”, if:

Segregation aware data mining

Dipartimento di Informatica

Dottorato di Ricerca in Informatica

Ph.D. Thesis

Segregation aware data mining

Alessandro Baroni

Contents

List of Figures

List of Tables

Acknowledgments

“

”

Abstract

1. Introduction

“

”

1.1

Motivation

1.2

Contribution

1.3

Plan of the thesis

2. Segregation analysis

“

”

2.1

Segregation indexes

2.1.1

Evenness indexes

2.1.2

Exposure indexes

2.1.3

Concentration indexes

2.1.4

Centralization indexes

2.1.5

Clustering indexes

2.1.6

An extension of segregation indexes for overlapping

units

2.2

Properties of segregation indexes

2.2.1

Properties of evenness and exposure indexes

2.2.2

Secondary factors

2.3

Related approaches

3. Segregation discovery

“

”

3.1

Notation and itemset mining

3.2

The segregation discovery problem

3.3

A data cube for exploratory analysis of

seg-regation