The pictures we like are our image: continuous mapping of favorite pictures into self-assessed and attributed personality traits

(1)

The Pictures we Like are our Image: Continuous

Mapping of Favorite Pictures into Self-Assessed

and Attributed Personality Traits

Crisitina Segalin

1

, Student Member, IEEE, Alessandro Perina

2

, Marco Cristani

1

, Member, IEEE,

and Alessandro Vinciarelli

3

, Member, IEEE

Abstract—Flickr allows its users to tag the pictures they like as “favorite”. As a result, many users of the popular photo-sharing platform produce galleries of favorite pictures. This article proposes new approaches, based on Computational Aesthetics, capable to infer the personality traits of Flickr users from the galleries above. In particular, the approaches map low-level features extracted from the pictures into numerical scores corresponding to the Big-Five Traits, both self-assessed and attributed. The experiments were performed over 60,000 pictures tagged as favorite by 300 users (the PsychoFlickr Corpus). The results show that it is possible to predict beyond chance both self-assessed and attributed traits. In line with the state-of-the-art of Personality Computing, these latter are predicted with higher effectiveness (correlation up to 0.68 between actual and predicted traits).

Index Terms—Computational Aesthetics; Personality Comput-ing, Big Five Personality Traits, Automatic Personality Percep-tion, Automatic Personality Recognition.

I. INTRODUCTION

I

S a picture worth a thousand words? It seems to be so when it comes to mobile technologies and social network-ing platforms: taknetwork-ing pictures is the action most commonly performed with mobile phones (82% of the American users), followed by exchanging text messages (80% of the users) and accessing the Internet (56% of the users) [1]. Furthermore, 56% of the American Internet users either post online original pictures and videos (46% of the total Internet users) or share and redistribute similar material posted by others (41% of the total Internet users). In other words, “photos and videos have become key social currencies online” [2].

Pictures streams are “often seen as a substitute for more direct forms of interaction like email” [3] and interacting with connected individuals appears to be one of the main motiva-tions behind the use of online photo-sharing platforms [4]. The pervasive use of liking mechanisms, online actions allowing users to publicly express appreciation for a given online item, further confirms the adoption of pictures as a social glue [5]: on Flickr, likes between connected users are roughly 105_times more frequent than those between non-connected ones [6].

Besides helping to maintain connections and express affil-iation, liking mechanisms are a powerful means of possibly involuntary self-disclosure: statistical approaches can infer personality traits and hidden, privacy sensitive information

1_{University of Verona (Italy),} 2_{Italian Institute of Technology (Italy),} 3_{University of Glasgow (UK).}

(e.g., political views, sexual orientation, alcohol consumption habits, etc.) from likes and Facebook profiles [7]. Furthermore, online expressions of aesthetic preferences convey an impres-sion in terms of characteristics like prestige, differentiation or authenticity [8]. For this reason, this article proposes new approaches, based on Computational Aesthetics, capable to infer the personality traits of Flickr users, both self-assessed and attributed by others, from the pictures they tag as favorite. In other words, this article proposes an approach aimed at mapping the pictures people like into personality traits.

Regression analysis appears to be the most suitable compu-tational framework for the problem. This applies in particular to Multiple Instance Regression (MIR) [9], [10] because Flickr users typically tag several pictures as favorite. Therefore, there are multiple instances (the favorite pictures) in a bag (the user that tags the pictures as favorite) associated with a single value (a personality trait of the user). Previous approaches show that it is possible to infer the aesthetic preferences of people through an appropriate weighting of their favorite pictures [11]. This work extends such a principle to the inference of personality traits and addresses the problem with a multiple instance strategy. In particular, this article proposes a set of novel methods that build an intermediate representation of the pictures - using topic models - and then perform regression in the resulting space, thus improving the performance of standard MIR approaches operating on the raw features extracted from the pictures.

The experiments have been performed over PsychoFlickr, a corpus of 60,000 pictures tagged as favorite by 300 Pro Flickr users1 _{(200 randomly selected favorite pictures per} user). For each user, the corpus includes two personality assessments. The first has been obtained by asking the users to self-assess their own Big-Five Traits, the second has been obtained by asking 12 independent assessors to rate the Big-Five Traits of the users (see Section III for details). In this way, according to the terminology introduced in [12], it is possible to perform both Automatic Personality Recognition (APR) and Automatic Personality Perception (APP), i.e. the prediction of self-assessed and attributed traits, respectively.

The reason for addressing both APR and APP is that self-assessed and attributed traits tend to relate differently to different aspects of an individual [13] and, therefore, both

1_{At the moment of the data collection, Pro users were individuals paying}

(2)

need to be investigated. In the case of this work, the results suggest that favorite pictures account only to a limited extent for self-assessed traits (the best APR result is a correlation 0.26 between actual and predicted traits) while they have a major impact on attributed ones (the best APP result is a correlation 0.68 between actual and predicted traits). To the best of our knowledge, this is one of the few works where APP and APR have been compared over the same data (see [12] for an extensive survey). This is an important advantage because it allows one to assess the effectiveness of a given type of behavioural evidence (the favorite pictures in this case) in conveying information about personality.

The results above show that the proposed approach is more effective in the case of the attributed traits, i.e. in the case of APP. While not necessarily corresponding to the actual traits of people, attributed traits are still predictive of important aspects of social life [13]. In particular, attributed traits determine, to a significant extent, the way others behave towards a given individual, especially in the earliest stages of an interaction [14]. Furthermore, sociologists have observed that the social identity of an individual does not result only from her actual characteristics, but also from the characteristics attributed by others: “We need to recognise that identification is often most consequential as the categorisation of others, rather than as self-identification” [15]. For this reason, the literature proposes approaches aimed at predicting both self-assessed and attributed traits [12] and this work addresses both problems.

The rest of this paper is organised as follows: Section II surveys previous work, Section III presents the PsychoFlickr Corpus, Section IV introduces the low-level features extracted from the pictures, Section V describes the new MIR ap-proaches developed for this work, Section VI reports on experiments and results and the final Section VII draws some conclusions.

II. PREVIOUSWORK

The experiments of this work lie at the crossroad be-tween Computational Aesthetics and Personality Computing (see [12] and [16] for an extensive surveys). To the best of our knowledge, no other works have addressed the problem of mapping favorite pictures into personality traits. How-ever, several works have addressed separately the inference of aesthetic preferences from pictures and the inference of personality traits (both assessed and self-assessed) from social media material. The rest of this section proposes a survey of the main works presented in both areas.

A. Computational Aesthetics

Computational Aesthetics (CA) target “[...] computational methods that can make applicable aesthetic decisions in a similar fashion as humans can” [17]. In the particular case of pictures, the goal of CA is typically to predict automatically whether human observers like a given picture or not (for a more general discussion on the relation between technology and aesthetics, see [18]). In most cases, the task corresponds to a binary classification, i.e. to predict automatically whether

a picture has been rated high or low in terms of visual pleasant-ness [19], [20], [21], [22], [23]. Unlike Implicit Tagging [24], CA does not try to measure or detect the reaction of people to get an indication of what the content can be (e.g., by tagging as “funny” an image when a person laughs at it). The sole target of CA is to identify image properties that discriminate between appealing pictures and the others.

The experiments of [19] aim at predicting whether a picture has been rated as visually pleasant or not (the two classes correspond to top and bottom pleasantness ratings assigned by human observers, respectively). The experiments are per-formed over a set of 1,664 images downloaded from the web. The features extracted from the images account for the properties of color, composition and texture. The classification accuracy achieved with Support Vector Machines is higher than 70%. Similar experiments are proposed in [20] over 6,000 images (3,000 per class). The features account for composition, lighting, focus controlling and color. The main difference with respect to the other approaches presented in this section is that the processing focuses on the subject region and on its difference with respect to the background. The classification accuracy is higher than 90%.

In the case of [21], the experiments are performed over digital images of paintings and the task is the discrimination between high and low quality paintings. The features include color distribution, brightness properties (accounting for the use of light), use of blurring, edge distribution, shape of pic-ture segments, color properties of segments, contrast between segments, and focus region. In this case as well, the task is a binary classification and the experiments are performed over 100 images. The best error rate is around 35%. In a similar vein, the approach proposed in [22] detects the subject of a picture first and then it extracts features that account for the difference between foreground and background. The features account for sharpness, contrast and exposure and the experiments are performed over a subset of the pictures used in [19]. Like in the other works presented so far, the task is a binary classification and the accuracy is 78.5%. The approach proposed in [23] adopts an alternative approach for what concerns the features. Rather than using features inspired by good practices in photography, like the other works presented so far, it uses features like SIFT and descriptors like the Bag of Words or the Fisher Vector. While being general purpose, these are expected to encode the properties that distinguish between pleasant and non-pleasant images. The experiments are performed over 12,000 pictures and the accuracy in a binary classification task (6,000 pictures per class) is close to 90%.

B. Personality and Social Media

The literature proposes several works investigating the in-terplay between the traces that people leave on social media (posts, pictures, profiles, likes, etc.) and personality traits, both self-assessed [25], [26], [32], [27], [28], [29] and at-tributed [29], [30], [31]. Table I contains a synopsis of the

(3)

Ref. Subj. Samples Features Task Ext. Agr. Con. Neu. Ope. Other [25] 167 167 Facebook profile info., egocentric R 0.12 0.10 0.10 0.11 0.10

Profiles networks, LIWC MAE MAE MAE MAE MAE

[26] 209 209 RenRen Profile info., usage C(2) 83.8 69.7 82.4 74.9 81.1

profiles statistics, emotional states C(3) 71.7 72.3 70.1 71.0 69.5

F F F F F

[27] 156 473 posts on Some LIWC categories U average

FriendFeed accuracy 63.1

[28] 10000 10000 blog LIWC C(2) 80.0

posts ACC

[29] 300 60, 000 favorite visual patterns, R 0.19 0.17 0.22 0.12 0.17

pictures aesthetic preferences ρ ρ ρ ρ ρ

[30] 440 440 pictures photo content, appearance CA see text

[31] 5216 5216 social media profiles presence of personal information CA see text

TABLE I: The table reports, from left to right, the number of subjects involved in the experiments, number and type of behavioral samples, main cues, type of task and performance over different traits. LIWC stands for Linguistic Inquiry Word Count (a psychologically oriented text analysis approach). The column “Other” refers to works using models different from the Big-Five. R stands for regression, U stands for unsupervised classification, C(n) for classification with n classes, and CA for correlational analysis. The performance for the classification tasks is reported in terms of Mean Absolute Error (MAE), F-Measure (F), accuracy (ACC) and correlation (ρ). The performances are not reported for comparison purposes (the results have been obtained over different data), but to provide full information about the works described.

works presented in this section 2.

The approach proposed in [25] infers the self-assessed per-sonality traits of 167 Facebook users from the absence or pres-ence of certain items (e.g., political orientation, religion,etc.) in the profile. The results, obtained with regression approaches based on Gaussian Processes and M5 algorithm, correspond to a mean absolute error lower than 0.15. Given that personality scores are defined along a 5 points scale, such an average error can be considered low, but it is unclear whether it is roughly the same for all subjects or it tends to be low for people on the extremes and high for those in the middle of the scales. Furthermore, the low number of trainig items (roughly 150) and the high number of features might have led to overfitting. In a similar way, the experiments presented in [26] predict whether 209 users of Ren Ren (a popular Chinese social net-working platform) are in the lowest, middle or highest third of the observed personality scores. The features adopted in such a work include usage measures such as the post frequency, the number of uploads, etc. The results show an F -measure up to 72% depending on the trait. However, the performance seems to be higher for those traits where one of the three classes is more represented than the others, then the improvement is low with respect to a basic approach always giving as output the most represented class. APR on Facebook profiles was the subject of an international benchmarking campaign 3 _the results of which appear in [32]. The main indication of this initiative is that selection techniques applied to large sets of initial features lead to the highest performances. However, the experimental setup adopted for the challenge (participants have all the data at disposition since the beginning) cannot exclude overfitting. In particular, it is unclear whether feature selection techniques have been applied only to the training set or to the entire corpus (if this is the case, features have been selected using information from the test set), thus overestimating the

2_{Section III-B provides an introduction to personality and its measurement.}

Should the reader be unfamiliar with these concepts, reading Section III-B can help to better understand Section II-B and the content of Table I.

3_{http://mypersonality.org/wiki/doku.php?id=wcpr13}

performance.

Given the difficulty in collecting large amounts of self-assessments, two approaches propose to use measures like the number of connections or the lexical choices in posts as a criterion to assign personality scores to social media users [27], [28]. In the case of [27], the proposed methodology measures first whether features like the use of punctuation (e.g., exclamation marks) or emoticons is stable for a given user, then it uses the most stable features to assign personality traits. The results show an accuracy of 63% in predicting the actual self-assessed traits of 156 users of FriendFeed, an Italian social network. The most interesting novelty of this work is the attempt to avoid the collection of assessments, an expensive and time-consuming, process, through unsupervised approaches trying to assign similar personality profiles to user similars in terms of online activities. However, the perfor-mances are not sufficient to actually replace the collection of self-assessed traits. Similar considerations apply to the case of [28], the authors simply label the users as extravert or introvert depending on how many connections they have. The resulting labels are then predicted automatically using lexical choices, i.e. calculating how frequently people use words falling in the different categories of the Linguistic Inquiry Word Count.

While the works presented so far address the APR prob-lem, the experiments presented in [30], [31] target APP. In particular, both works investigate the agreement between self-assessed traits of social media users and traits that these are at-tributed by observers after posting material online. In the case of [30], the focus is on profile pictures and the experiments show that the agreement is higher when the picture subjects smile and do not wear hats. The other work [31] performs a similar analysis over profiles showing personal information and the results show that the agreement improves when people post information about their spiritual and religious beliefs, their most important sources of satisfaction, and material they consider to be funny. However, in both these works the ratings were made by one assessor only and, therefore, it is unclear

(4)

0 80 160 Openness Traits Distribution 0 80 160 Conscientiousness 0 80 160 No. of Subjects Extraversion 0 80 160 Agreeableness -4 -3 -2 -1 0 1 2 3 4 BFI-10 Score 0 80 160 Neuroticism attributed self-assessed

Fig. 1: The figure shows the distribution of self-assessed and attributed traits.

whether they can account for the average impression a subject conveys.

To the best of our knowledge, the work in [29] is the only one that addresses both APR and APP using the same data. The experiments are performed over the PsychoFlickr Corpus, the same dataset used in the experiments of this article (see Section III). The results show that the agreement between self-assessed and attributed traits range between 0.32 and 0.55 depending on the trait. Furthermore, experiments based on Counting Grid models and Lasso regression approaches lead to a correlation up to 0.62 for the attributed traits and up to 0.22 for self-assessed ones. Compared to the work in [29], the experiments of this article propose entirely different ap-proaches, include the comparison between seven different methodologies based on Multiple Instance Regression, present a more extensive correlational analysis and provide a full description of the features adopted. In other words, this work includes substantial novelties and differences with respect to the results of [29].

Overall, the best performances tend to be observed for Extraversion and Conscientiousness (see Table I). This is not surprising because personality psychologists have observed that these are the traits that human observers tend to perceive more clearly as well [33]. However, there are specific contexts where traits typically difficult to observe become more avail-able, i.e. more accessible to human observers and, therefore, easier to predict [13]. This is the case, e.g., of the higher performances obtained for Openness in [29], [31].

III. PSYCHOFLICKR: PICTURES ANDPERSONALITY The experiments of this work are performed over Psy-choFlickr 4_{, a corpus designed to investigate the interplay} between aesthetic preferences and personality traits. The cor-pus includes pictures that 300 Flickr users have tagged as

4_{The corpus is available at http://vips.sci.univr.it/dataset/}

psychoflickr/PsychoFlickr.rar

favorite (200 pictures per user for a total of 60,000 samples). Furthermore, for each user, the corpus includes both self-assessed and attributed traits (see Section III-B). Therefore, PsychoFlickr allows one to perform both APP and APR experiments.

A. The Subjects

The subjects included in the corpus were recruited through a word-of-mouth process. A few Flickr Pro users were con-tacted personally and asked to involve other Pro users in the experiment (typically through the social networking facilities available on Flickr). The process was stopped once the first 300 individuals answered positively. The resulting pool of users includes 214 men (71.3% of the total) and 86 women (28.7% of the total). The age at the moment of the data collection is available only for 44 subjects (14.7% of the total)5_{. These participants are between 20 and 62 years old} and the average age is 39. However, it is not possible to know whether this is representative of the entire pool. The nationality is available for 288 users (96.0% of the total) that come from 37 different countries. The most represented ones are Italy (153 subjects, 51% of the total), United Kingdom (31 subjects, 10.3% of the total), United States (28 subjects, 9.3% of the total), and France (13 participants, 4.3% of the total).

B. Personality and its Measurement

Personality is the latent construct that accounts for “in-dividuals’ characteristic patterns of thought, emotion, and behavior together with the psychological mechanisms - hidden or not - behind those patterns” [34]. The literature proposes a large number of personality models and PsychoFlickr adopts the Big-Five (BF) Traits, five broad dimensions that have been shown to capture most individual differences [35]. The reason behind this choice is twofold: On the one hand the BF is the model most commonly applied in both personality computing [12] and personality science [13]. On the other hand, the BF model represents personality in terms of five numerical scores, a form particularly suitable for computer processing.

The scores of the BF model account for how well the behavior of an individual fits the tendencies associated to the BF Traits, i.e. Openness (tendency to be intellectually open, curious and have wide interests), Conscientiousness (tendency to be responsible, reliable and trustworthy), Extraversion (ten-dency to interact and spend time with others), Agreeableness (tendency to be kind, generous, etc.) and Neuroticism (ten-dency to experience the negative aspects of life, to be anxious, sensitive, etc.).

In the BF framework, assessing the personality of an in-dividual means to calculate the five scores corresponding to the traits above. The literature proposes several questionnaires designed for such a task (see [12] for a list of the most important ones). The personality assessments of PsychoFlickr have been obtained with the BFI-10 [36], a list of ten items

5_{Personal information is extracted from the Flickr profiles where the users}

(5)

Self-Assessment Attribution Trait

I am reserved The user is reserved Ext

I am generally trusting The user is generally

trusting Agr

I tend to be lazy The user tends to be lazy Con

I am relaxed, handle

stress well

The user is relaxed,

han-dles stress well Neu

I have few artistic inter-ests

The user has few artistic

interests Ope

I am outgoing, sociable The user is outgoing,

so-ciable Ext

I tend to find fault with others

The user tends to find

fault with others Agr

I do a thorough job The user does a thorough

job Con

I get nervous easily The user gets nervous

easily Neu

I have an active imagina-tion

The user has an active

imagination Ope

TABLE II: The BFI-10 [36] is the short version of the Big-Five Inventory. Each Item is associated to a Likert scale, from -2 (“Strongly disagree”) to 2 (“Strongly agree”) and contributes to the integer score (in the interval [−4, 4]) of a particular trait, see the third column. The answers are mapped into numbers (e.g., from -2 to 2). The table shows the questionnaire in both self-assessment and attribution version.

associated to 5-points Likert scales ranging from “Strongly Disagree” to “Strongly Agree”. The main advantage of the BFI-10 is that it can be filled in less than one minute while still providing reliable results. Table II shows the BFI-10 for both assessment and attribution of personality traits. In the self-assessment case, people fill the questionnaire about themselves and the result is a personality self-assessment (necessary to perform APR experiments). In the attribution case, people fill the questionnaire about others and the result is a personality attribution (necessary to perform APP experiments).

The 300 Flickr users included in the corpus were asked to fill the self-assessment version of the BFI-10 and were offered a short analysis of the outcome as a reward for the participation. The chart of Figure 1 shows the distribution of the scores for each trait. In line with the observations of the lit-erature, the self-assessments tend to be biased towards socially desirable characteristics (e.g., high Conscientiousness and low Neuroticism) [37]. In the case of PsychoFlickr, this applies in particular to Openness, the trait of intellectual curiosity and artistic inclinations. A possible explanation is that the pool of subjects includes individuals that tend to consider photography as a form of artistic expression. However, no information is available about this aspect of the users.

In parallel, 12 independent judges were hired to attribute personality traits to the 300 subjects of the corpus. The judges are fully unacquainted with the users and they are all from the same country (Italy) to ensure cultural homogeneity. They were asked to watch the 200 pictures tagged as favorite by each user and, immediately after, to fill the attribution version of the BFI-10 (“Attribution” column of Table II). Each judge has assessed all 300 users and the 12 assessments available

for each user were averaged to obtain the attributed traits. The judges were paid 95 Euros for their work. The chart of Figure 1 shows the resulting distribution of scores. Since the judges are fully unacquainted with the users and all they know about the people they assess are the favorite pictures, the ratings tend to peak around 0, the score associated to the expression “Neither agree nor disagree”.

Research on consensus in the perception of the Big-Five suggests to measure the agreement between raters in terms of percentage of ratings variance shared across judges [38], [39]. The percentage can be measured by performing a two-way Analysis of Variance of the ratings [40]. If rij is the score that judge j assigns to subject i for a particular trait, then the grand mean of the scores ¯r is

¯ r = 1 SR i=S,j=R X i=1,j=1 rij, (1)

where S is the total number of subjects and R is the total number of raters. Correspondingly, it is possible to define the mean squares for subjects, raters, cells and interaction between raters and subjects as follows [40]:

µ2 s= S−1R · PS i=1( PR j=1rij− ¯r)2 µ2r=R−1S · PR j=1( PS i=1rij− ¯r) 2 µ2_c= _RS−11 Pi=S,j=R_i=1,j=1(rij− ¯r)2

µ2

s×r= µ2c− µ2r− µ2s.

(2)

The expressions above allow one to define the percentage α of variance shared across raters as follows:

α = σ 2 s σ2 s+ σr2+ σs×r2 , (3) where σ2

s = (µ2s− µ2s×r)/R is the estimate of the subject variance, σ2

r = (µ2r− µ2s×r)/S is the estimate of the raters’ variance and σ2

s×r= µ2s×r is the estimate of the variance of the interaction between raters and subjects. This latter is the component of the variance that cannot be associated only to raters or only to subjects and, hence, it is associated to the interaction between the two. In the case of the data used in this work, the α values are as follows: 0.08 for Openness, 0.14 for Conscientiousness, 0.28 for Extraversion, 0.19 for Agreeableness and 0.24 for Neuroticism.

The problem left open is whether the α values above can be considered acceptable or not. According to a study presented in [38] - to the best of our knowledge, the most extensive investigation of the problem so far - the median values of α over 9 articles on the perception of the Big-Five at zero acquaintance (the same situation as this article) are as follows: 0.07 for Openness, 0.13 for Conscientiousness, 0.32 for Extraversion, 0.03 for Agreeableness and 0.07 for Neuroticism. The comparison with the α values for this article (see above) confirms that the agreement level observed in this work is compatible with the agreement levels observed and accepted in the zero acquaintance personality perception literature [38], [39].

(6)

Category Name d Short Description

Color

HSV statistics 5

Average of S channel and standard deviation of S, V channels [41];

circular variancein HSV color space [42]; use of light as the average

pixel intensity of V channel [43]

Emotion-based 3 Measurement of valence, arousal, dominance [41], [44]

Color diversity 1 Distance w.r.t a uniform color histogram, by Earth Mover’s Distance

(EMD) [43], [41]

Color name 11 Amount of black, blue, brown, green, gray, orange, pink, purple, red,

white, yellow [41]

Composition

Edge pixels 1 Percentage of pixels classed as edge by the Canny detector [45]

Level of detail 1 Number of regions (after mean shift segmentation) [46], [47]

Average region size 1 Average size of the regions (after mean shift segmentation) [46]

Low depth of field (DOF) 3 Amount of focus sharpness in the inner part of the image w.r.t. the

overall focus [43], [41]

Rule of thirds 2 Average of S,V channels over inner rectangle [43], [41]

Image size 1 Size of the image [43], [45], [48]

Textural Properties

Gray distribution entropy 1 Image entropy [45]

Wavelet based textures 12 Level of spatial graininess measured with a three-level (L1,L2,L3)

Daubechies wavelet transform on HSV channels [43]

Tamura 3 Amount of coarseness, contrast, directionality [49]

GLCM - features 12 Amount of contrast, correlation, energy, homogeneousness for each

HSV channel [41]

GIST descriptors 24 Output of GIST filters for scene recognition [50].

Faces Faces 1 Number of faces (extracted manually)

TABLE III: Synopsis of the features. Every image is represented with 82 features split in four major categories: Color, Composition, Textural Properties, and Faces. This latter is the only feature that takes into account the picture’s content.

IV. FEATUREEXTRACTION

The goal of this work is to map the favorite pictures of Flickr users into personality traits. The features adopted in this work focus on the contributions of Computational Aesthetics (see Section II) because these have been designed to account for the properties that make pictures visually appealing. The main assumption behind this choice is that pictures are often tagged as favorite for personal reasons (e.g., they show friends and relatives or are related to fond memories) [6], but the raters cannot access these motivations and can only access the appearance of the pictures. Popular feature extraction techniques like, e.g., SIFT and HOG have not been considered because they were originally conceived for other purposes, even if they have been shown to be effective in some tasks related to CA (see Section II). Furthermore, one of the main goals of this work is to show the very feasibility of a task like mapping favorite pictures into attributed traits. In this respect, the exploration of a wider spectrum of features can come at a later stage of the investigation.

A synopsis of the features adopted in this work is available in Table III. The features cover a wide, though not exhaustive, spectrum of visual characteristics and are grouped into three main categories: color, composition and textural properties. This follows the taxonomy proposed in [41], but it excludes the contentcategory to make the process more robust with respect to the wide semantic variability of Flickr images. The only exception is a feature that counts the number of faces because these are ubiquitous in the images and, furthermore, the human brain is tuned to their detection in the environment [51].

A. Color

The feature extraction process represents colors with the HSV model, from the initials of Hue, Saturation and Value

(this latter is often referred to as Brightness). This section describes features related to colors and their use.

HSV statistics: These features account for the use of colors and are based on statistics collected over H,S and V pixel values observed in a picture (see Figure 2). The H channel provides information about color diversity through its circular varianceR [42]: A = K X k=1 L X l=1 cos Hkl, B = K X k=1 L X l=1 sin Hkl R = 1 − 1 KL p A2_{+ B}2

where Hklis the Hue of pixel (k, l), K is the image height and L is the image width. On the S and V channels we compute the average and standard deviation; in particular, the average Satu-ration indicates chromatic purity, while the average over the V channel is called use of light and corresponds to a fundamental observation of image aesthetics, i.e. that underexposed or overexposed pictures are usually not considered aesthetically appealing [43]. The average of the Hue was not calculated because it cannot be associated to an intensity attribute (low, high), being an angular measure. Figure 2 provides examples of how the pictures change according to HSV statistics.

Emotion-based: Saturation and Brightness can elicit emotions according to the following equations resulting from psychological studies (Valence, Arousal and Dominance are dimensions commonly adopted to represent emotions) [44]:

Valence = 0.69 · ¯V + 0.22 · ¯S (4)

Arousal = −0.31 · ¯V + 0.60 · ¯S (5)

(7)

Use of light high (= 0.79) low (= 0.14) Average saturation high (= 0.89) low (= 0.17) Valence high (= 0.72) low (= 0.18) Dominance high (= -0.03) low (= -0.50) Arousal high (= 0.36) low (= -0.22)

Hue circular variance

low (= 0.04) high (= 0.84)

Color diversity

high (= 1/8.16) low (= 1/16.7)

Fig. 2: The figure shows examples of how the visual properties of a picture change according to several color-related features.

where ¯V and ¯S are the averages of V and S over an image, respectively. See Figure 2 for examples of pictures with different levels of Valence, Arousal and Dominance.

Color diversity (Colorfulness): The feature distin-guishes multi-colored images from monochromatic, sepia or low-contrast pictures. Following the approach in [43], the image under analysis is converted in the CIELUV color space, and its color histogram is computed; this representation is compared (in terms of Earth Mover’s Distance) with the histogram of an ideal image where the distribution over the colors is uniform, that is, an histogram where all the bins have the same value (Figure 2 shows examples of pictures with different colorfulness).

Color name: Every pixel of an image can be assigned to one of the following classes identified in [52]: black, blue, brown, grey, green, orange, pink, purple, red, white and yellow. The sum of pixels across the classes above accounts not only for how frequently the colors appear in an image, but also for the style of a photographer. The classification of the pixels is performed using the algorithm proposed in [53] and it mimics the way humans label chromatic information in an image. The fractions of pixels belonging to each of the classes above are used as features.

B. Composition

The term composition refers to the organization of visual elements across an image regardless of its actual content. The features described in this section aim at capturing such an aspect of the pictures.

Edge pixels: The structure of an image depends, to a significant extent, on the edges, i.e. on those points where the image brightness shows discontinuities. Therefore, the feature extraction process adopts the Canny detector [45] to identify

the edges and calculate the fraction of pixels in an image that lie on an edge (see Figure 3 for an example of edge extraction in an image).

Level of detail: images can be partitioned or segmented into multiple regions, i.e. sets of pixels that share common visual characteristics. The feature extraction process segments the images using the EDISON implementation [46] of the mean shift algorithm [47], and provides two features: i) the number of segments, accounting for the fragmentation of the image, and ii) the normalized average extension of the regions, that is, the mean area of the regions divided by the area of the whole image. On average, the more the details, the more the segments (see Figure 3).

Low depth of field (DOF) indicator: An image with low depth of field corresponds to a shot where the object of interest is sharper than the background, drawing the attention of the observer [41], [43] (see Figure 3). To detect low DOF, it is assumed that the object of interest is central; the image is thus decomposed into wavelet coefficients (see the next section), which measure the frequency content of a picture: in particular, high frequency coefficients (formally, level 3 as used in the notation of Eq. (10)) encode fine visual details. The low DOF indicator calculates the ratio of the high frequency wavelet coefficients of the inner part of the image against the whole image is calculated. In specific, the image is divided into 16 equal rectangular blocks M 1, . . . M 16, numbered in row-major order. Let w3 = wHL=v

3 , w3LH=h, wHH=d3 denote the set of wavelet coefficients in the high frequency of the hue image IH. The low DOF indicator DOFHfor hue is computed as follows, DOFH= P (k,l)∈M6∪M7∪M10∪M11w3(k, l) P16 i=1 P (k,l)∈Miw3(k, l) (7)

High DOFH indicates an apparent low depth of field in the image. The low depth of field indicator for the Saturation and the Brightness channels is computed similarly on the correspondent image channels. Figure 3 shows the difference between images with different Depth of Field.

The rule of thirds: Any image can be ideally divided into nine blocks - arranged in a 3 × 3 grid - by two equally-spaced horizontal lines and two equally-equally-spaced vertical lines. The rule of thirds is a photography composition guideline that suggests to position the important visual elements of a picture along such lines or at their intersections. In other words, it suggests where the most salient objects should lie in the image. The rule of thirds feature in image aesthetics simplifies the photographic technique, analyzing the central block of the image by keeping the average values of Saturation and Brightness [41], [43]: fS = 9 KL 2K/3 X k=K/3 2L/3 X l=L/3 Skl (8)

where K is the image height, L is the image width and Skl is the Saturation at pixel (k, l). A similar feature fV can be calculated for the Brightness.

Image size: the size of the image is calculated as the total number of pixels and used as feature.

(8)

Canny

processed original

Level of detail

high (#segments = 528

norm. avg extension = 0.002)norm. avg extension = 0.5)low (#segments = 2

strong (= 2,1.3, 2) weak (= 1.1, 0.9, 0.9)

Fig. 3: The figure shows the effect of the Canny algorithm and an example of the visual properties associated to Level of Detail and Low Depth of Field.

C. Textural Properties

A texture is the spatial arrangement of intensity and colors in an image or in an image region. Textures capture perceptual aspects (e.g., they are more evident in sharp images than in blurred ones) and provide information about the subject of an image (e.g., textures tend to be more regular in pictures of artificial objects than in those of natural landscapes). The features described in this section aim at capturing textural properties.

Entropy: The entropy serves as a feature to measure the homogeneousness of an image. The image is first converted into gray levels; then, for each pixel, the distribution of the gray values in a neighborhood of 9 × 9 pixels is calculated (that is, the gray level histogram of the patch) and the entropy of the distribution is computed. Finally, all the entropy values are summed, and divided by the size of the image. The more the intensity tends to be uniform across the image, the lower will be the entropy (see Figure 4 for the impact of Entropy on visual characteristics). In the below expression, Pi is the probability that the difference between two adjacent pixels is equal to i, and Log2 is the base 2 logarithm.

E = −X

i

PiLog2Pi (9)

Wavelet textures: Daubechies wavelet transform can measure the spatial smoothness/graininess in images [43], [41]. The 2D Discrete Wavelet Transform (2D-DWT) of an image aims at analyzing its frequency content, where high frequency can be associated intuitively to high edge density. The output of a 2D-DWT can be visualized as a multilevel organization of square patches (see Figure 5). Each level corresponds to a given frequency analysis of the original image. In the first level of decomposition, the image is separated into four parts. Each of them has a quarter size of the original image, and a label. The upper left part is labeled LL (LowLow) and is a low-pass version of the original image. The vertical LH (LowHigh), horizontal HL (HighLow) and diagonal HH (HighHigh) parts can be assumed as images where vertical, horizontal and diagonal edges at the finest scale are highlighted. We can call them edge images at level 1. The subdivision can be further applied to find coarser edges, as the figure shows, performing again the wavelet transform to

Entropy high (= 4.66) low (= 0.31) Tamura directionality high (= 0.5) low (= 0.25) Tamura coarseness high (= 4.06) low (= 2.93) Tamura contrast high (= 0.0598) low (= 0.0027) GLCM contrast (on the V channel)

high (= 1) low (= 0.75)

GLCM correlation (on the H channel)

high (= 0.9646) low (= 0.7646) GLCM energy (on the S channel)

high (= 0.99) low (= 0.58)

GLCM homogeneity (on the H channel)

high (= 0.95) low (= 0.47)

Fig. 4: The figure shows examples of pictures where the value of the textural features is high and low.

the coarser (LL coefficients) version at half the resolution, recursively, in order to further decorrelate neighboring pixels of the input image. The Daubechies wavelet transform is a particular kind of wavelet transform, explicitly suited for compression and denoising of images.

The feature extraction process computes a three-level wavelet transform on H, S and V channels separately. At each level, we have three parts which represent the edge images, called wih, wvi and wid, where i ∈ 1, 2, 3, d = HH, h = HL and v = LH, to resemble the kind of edges that are highlighted (diagonal, horizontal, vertical, respectively). The wavelet features are defined as follows:

wfi = P k,lw h i(k, l) + P k,lw v i(k, l) + P k,lw d i(k, l) (|wh i| + |wiv| + |wdi| , (10)

for a total of 9 features (three levels for each of the three channels). The values k, l span over the spatial domain of the single w taken into account, and the operator | · | accounts for the spatial area of the single w. The corresponding wavelet features of saturation and brightness images have been com-puted similarly. In other words, for each color space channel and wavelet transform level, we average the values of the high frequency coefficients. We extracted three more features by computing the sum of the average wavelet coefficients over all three frequency level for each HSV channel (see Figure 5).

Tamura: In [49], six texture features corresponding to human visual perception have been proposed: coarseness, contrast, directionality, line-likeness, regularity and roughness. The first three have been found particularly important, since they are tightly correlated with human perception, and have been considered in this work. They are extracted from gray level images.

Coarseness: The feature gives information about the size of texture elements. A coarse texture contains a small number

(9)

of large texels, while a fine texture contains a large number of small texels. (see Figure 4). The coarseness measure is computed as follows. Let X an I × J matrix of values X(i, j) that can for instance be interpreted as gray values:

1) For every point (i, j) calculate the average over neigh-bourhoods. The size of the neighbourhoods are powers of two, e.g.: 1 × 1, 2 × 2, 4 × 4, . . . , 32 × 32: Ak(i, j) = 1 22k 22k X n=1 22k X m=1 X(i − 2k−1+ n, j − 2k−1+ m) (11) 2) For every point (i, j) calculate the difference between the not overlapping neighbourhoods on opposite sides of the point in horizontal and vertical direction:

Ekh(i, j) = |Ak(i + 2 k+1

, j) − Ak(i − 2k−1, j)| (12)

and

E_kv(i, j) = |Ak(i, j + 2k+1) − Ak(i, j − 2k−1)| (13)

3) At each point (i, j) select the size leading to the highest difference value:

S(i, j) = arg max k=1...5

max d=h,vE

d

k(i, j) (14)

4) Finally take the average over 2S_{as a coarseness measure} for the image:

Fcrs= 1 IJ I X i=1 J X j=1 2S(i,j) (15)

Contrast: It stands, in rough words, for texture quality. It is calculated by Fcon= σ αz 4 with α4= µ4 σ4 (16)

where µ4 = _KL1 PK_k=1PL_l=1(X(k, l) − µ4) is the fourth moment about the mean µ, σ2 is the variance of the gray values of the image, and z has experimentally been determined to be 1₄. In practice, contrast is influenced by the following two factors: range of gray-levels (large for high contrast), polarization of the distribution of black and white on the gray-level histogram (polarized histogram for high contrast). For an example, see Figure 4.

Directionality: It models how polarized is the distribution of edge orientations. High directionality indicates a texture where the edges are homogeneously oriented, and conversely. Given the directions of all the edge pixels, the entropy E of their distribution is calculated; the directionality becomes then 1/(E + 1). Textures with edges oriented along a single direction will be distributed as a single peak, thus E = 0, and maximal directionality (=1). Conversely, pictures with edges whose orientation is distributed in a uniform manner will have low directionality (∼ 0). For an example, see Figure 4.

Gray-Level Co-occurrence Matrix (GLCM) features: The GLCM is a matrix where the element (i, j) is the probability p(i, j) of observing values i and j for a given channel (H, S or V) in the pixels of the same region W . In the feature extraction process, W includes a pixel and its right neighbor and, therefore, the GLCM includes the probabilities of observing one pixel where the value j is at the right of a pixel where the value is i. The GLCM serves as a basis for calculating several features, each obtained separately over the H, S and V channels [54]:

Contrast: It is the average value of (i − j)2, the square difference of values observed in neighboring pixels: C = PL−1

i,j=0(i − j) 2

p(i, j), where L is the number of possible values in a pixel. The value of C ranges between 0 (uniform image) and (L − 1)2 _{(see Figure 4 for examples of pictures} with high and low contrast).

Correlation: It is the coefficient that measures the covaria-tion between neighboring pixels:

L−1 X i,j=0

(i − µ)(j − µ)p(i, j)

σ2 , (17)

where µ =PL−1_i,j=0ip(i, j), and σ2₌PL−1

i,j=0p(i, j)(i − µ) 2₊ PL−1

i,j=0p(i, j)(j − µ)

2_{. The correlation ranges in the interval} [−1, 1] (see lower part of Figure 4).

Energy is the sum of the square values of the GLCM elements: PL−1_i,j=0p(i, j)2_{. If an image is uniform, the energy} is 1 (see Figure 4).

Homogeneity is a measure of how frequently neighboring pixels have the same value (see Figure 4):

H = L−1 X i,j=0 p(i, j) 1 + |i − j| (18)

The feature tends to be higher when the elements on the diagonal of the GLCM are larger (see Figure 4).

Spatial Envelope (GIST): it is a low dimensional rep-resentation of a scene that relies on Gabor Filters to capture a set of perceptual dimensions, namely naturalness, openness, roughness, expansion, ruggedness [50]. The outputs of the GIST filters are used as features.

D. Number of Faces

All features presented so far are content independent, i.e. do not take into account what the images show. This feature is the only exception to such an approach because human faces are frequently portrayed in Flickr pictures and, furthermore, there are neural pathways that make the human brain particularly sensitive to faces [51]. In this work, the number of faces is calculated manually for each of the 60,000 pictures of the dataset. Every visible face was counted, irrespectively of its scale, pose, size and occlusion. Facial expressions were not taken into account. Automatic face detectors were avoided because they are not sufficiently robust to deal with the variability of favorite pictures. These often portray people in unusual poses and a preliminary analysis shows that the Viola-Jones detector [55] identifies only 70% of the faces in the corpus. This introduces noise difficult to model and quantify.

(10)

Original image LH1 HL1 HH1 LH1 HL1 HH1 LL1 LL2 HL2 LH2 HH2 b) a)

Fig. 5: The figure shows how the wavelet decomposition works.

V. INFERENCE OFPERSONALITYTRAITS

This section presents the regression approaches adopted to map the features described in Section IV into personality traits. The regression is performed separately for the Big-Five traits because these result from the application of Factor Analysis to behavioural data [56] and, therefore, they are indepedent. The goal of the experiments is to infer the personality traits - both self-assessed and attributed - from the multiple pictures that a user tags as favorite. Hence, Multiple Instance Regression (MIR) [9], [10] appears to be the most suitable computational framework because it addresses problems where there are multiple instances (the favorite pictures of a Flickr user) for a bag(the Flickr user) associated with one value (the score of the Flickr user for a particular trait). Furthermore, MIR approaches can deal with cases where only a subset of the bag instances actually account for the value to be predicted, or when all the bag instances have a role in its definition [57]. In this scenario the latter hypothesis could be the most reasonable, but no experiments have been carried out to actually pinpoint which image(s) is (are) more significant to determine the personality score.

Before applying the regression approaches (both the base-lines and the proposed approaches), the features are discretized using Q = 6 quantization levels (Q values between 3 and 9 were tested, but no significant performance differences were observed: in any case, Q = 6 gave a slightly better performance). The intervals corresponding to the levels are obtained by splitting the range of each feature in the training set into Q uniform, non-overlapping intervals. In this way, the 82 features describing each picture (see Section IV) can be interepreted as counts. The main motivations for not clustering features individually are, on one hand, to remain dataset independent (different datasets result into different clusters) and, on the other hand, to limit computational costs when the

number of features is high.

In the following, each Flickr user u corresponds to a bag of favorite pictures Bu (u = 1, . . . , 300) and five traits ypu (p = O, C, E, A, N). The trait values predicted by the MIR approaches are denoted with ˆyu

p. The notation does not distinguish between self-assessed and attributed traits because the two cases are treated independently of each other. The element Ct

z of the feature matrix C is the value of feature z (z = 1, . . . , 82) for picture t (t = 1, . . . , 6 × 104_{). Finally, υ(t)} is a function that takes as input a picture index t and returns the corresponding user-index u.

A. Baseline Approaches

The general MIR formulation is NP-hard [9] and this requires the adoption of simplifying assumptions. The most common one is to consider that each bag includes a primary instancethat is sufficient to predict correctly the bag label. In the experiments of this work, this means that each bag Bu includes only one picture t - with υ(t) = u - that should be fed to the regressor to obtain as output the trait score yup (for a given p). However, the primary instance cannot be known a-priori for a test bag. Furthermore, the bags of this work include 200 pictures and, therefore, using only one of them means to neglect a large amount of information. For this reason, this work adopts different baseline MIR approaches, more suitable for the PsychoFlickr data. Essentially, these baselines assume that each instance carries a role in determining the value of the bag label, in line with the assumptions of [57]. Furthermore, the baseline approaches include a regressor that always predicts the average value of the traits as per estimated over the training set.

Baseline: The simplest baseline approach consists in predicting always the average of a given trait in the training set. The reason for using the average is that this is the value that minimizes the Root Mean Square Error when a regressor always predicts the same value.

Naive-MIR [10]: The simplest approach consists in giving each picture of a bag Bu as input to the regressor. As a result, there is a predicted score ˆyu

p(t) for each picture t such that υ(t) = u. The final trait score prediction ˆyu

p is the average of the ˆyu

p(t) values. The main assumption behind the Naive-MIR is that all the pictures of a bag carry task-relevant information and, therefore, they must all influence the predicted score ˆyu

p.

cit-kNN [58]: Given a test bag Bu, this methodology adopts the minimal Hausdorff distance [59] to identify, among the training bags, both its R nearest neighbors and its C-nearest citers (the training bags that have Bu among their C nearest neighbors). The predicted score ˆyu

p is then the average of the scores of both R nearest training bags and C nearest citers training bags. The approach does not include an actual regression step, but still maps a test bag Bu _{into a continuous} predicted score ˆyu

p.

Clust-Reg [60]: The Clust-Reg MIR includes three main steps. The first consists in clustering all the pictures of the training bags using a kmeans, thus obtaining C centroids cj in the feature space (j = 1, . . . , C). The second step considers all

(11)

the images of a bag Buthat belong to a cluster ckand averages them to obtain a prototype k. The same task is performed for all training bags Bu_{. In this way each training bag is} represented by C prototypes at most. The third step trains C regressors ri(i = 1, . . . , C) - each obtained by training the model in Section V-C over all prototypes corresponding to one of the C clusters - and identifies rk, the one that performs best on a validation set. At the moment of the test, rkis applied to prototype k of a test bag Bu to obtain the predicted score ˆyu_p. The regressor operates on a prototype k that represents only the test-bag pictures surrounding the centroid ck. Therefore, the Clust-Reg implements the assumption that only a fraction of the pictures carry task-relevant information.

B. Latent representation-based methods

The baseline approaches presented in Section V-A operate in the feature space where the pictures are represented. Such a scheme is not suitable when the number of instances per bag is large - like in the experiments of this work - because it is not possible to know a-priori what are the samples that carry information relevant to the task. In this respect, the main novelty and advantage of the approaches presented in this section consist in mapping the pictures of the training bags onto an intermediate latent space Z. This latter is expected to capture most of the information necessary to perform the trait scores’ prediction. While being proposed for the experiments of this work, the new MIR approaches can be applied to any problem where the number of instances per bag is large.

Topic-Sum: The main assumption of this approach is that the pictures of a test bag Bu _{distribute over topics,} i.e. over frequent associations of features that can be learnt from the images of the training bags. Such an approach is possible because the features extracted from the pictures have been quantized and the feature vectors can then be considered as vectors of counts or Bags of Features (see beginning of Section V). The topic model adopted in this work is the Latent Dirichlet Annotation (LDA) [61]. The LDA expresses the topics as probability distributions of features p(Czu|k), with k = 1, . . . , K and K << D (D is the dimension of the feature vectors). Once the topics are learnt from the training bags, a test bag Bu _{can be expressed as a mixture of topics:}

p(Bu) = K X k=1

p(C_zu|k)p(k|u), (19)

where Czuis the feature matrix of the images belonging to Bu, and the coefficients p(k|u), called topic proportions, measure how frequently the topics appear in test bag Bu_{. The regressor} of Section V-C is trained over vectors where the components correspond to the topic proportions of the training bags. At the moment of the test, the topic proportions of a test bag Bu are fed to the resulting regressor to obtain ˆyu

p.

Gen-LDA: This approach learns a LDA model [61] from the pictures of each training bag and then it fits a Dirichlet distribution p(·; αu) on the resulting topic proportions (see description of Topic-Sum above). The parameter vectors αu are then used to train the regressor of Section V-C. At the test stage, the αu parameters of the Dirichlet distribution

corresponding to the topic proportions in test bag Buare then given as input to the regressor to predict the trait scores.

Gen-MoG: This approach learns a Mixture of Gaussians (with diagonal covariances) with C components from the pictures of the training bags, then it considers all the images of a test bag Bu to estimate the following for c = 1, . . . , C:

Zu(c) = X

t:υ(t)=u

p(c|t) (20)

where p(c|t) is the a-posteriori probability of component c in the mixture when the picture is t. The values Zu(c) are given as input to the regressor to predict the trait scores. Compared to the Multiple Instance Cluster Regression [60], a similar methodology, the main difference of the Gen-MoG is that an instance is softly attributed to all the components of the Mixture of Gaussians through the probabilities p(c|t).

CG: This approach is based on the Counting Grid (CG) [62], a recent generative model which embeds BoF representations like those used in this work in d-dimensional manifolds. The CG allows one to map each picture t of the training bags onto a 2-dimensional grid lying on a smooth manifold, i.e. a manifold where close positions correspond to close images in the original feature space. The grid has E1 rows and E2 columns and, typically, E1× E2 << N , where N is the number of pictures in the bag. After such a training step, every test bag Bu is projected onto the same manifold and becomes a set of locations Lu = {`t_{} on the} 2-dimensional grid, i.e. a distribution of the test bag pictures over the grid. The distribution - an E1×E2dimensional vector - is first smoothed by averaging over a 5 × 5 window and then given as input to the regressor of Section V-C to predict the trait scores.

C. Regression

All the methods above, except cit-kNN, require a regressor to predict the trait scores. The one adopted in the experiments of this work has the following form:

ˆ yu_p = K X k=1 βkxuk, (21)

where the β = (β1, . . . , βK) are the regressor parameters and the values xu_kare the parameters that, according to the different methods presented above, represent a test bag Bu (e.g., the Dirichlet distribution parameters in Gen-LDA). The βk are estimated by minimising the Mean Square Error E(β):

E(β) = U X u=1 yu_p− K X k=1 βkxuk 2 , (22)

where U is the number of training bags. The problem was regularized using LASSO [63], an approach which constrains the L1norm of the least squares solution, thus acting as model selection method, by enforcing the sparsity on coefficients β. The regularizer in the Lasso estimate is simply expressed as a threshold on the L1-norm of the weight β:

X k

(12)

Neu (A) Agr (A) Ext (A) Con (A) Ope (A) Neu (S) Agr (S) Ext (S) Con (S) Ope (S)

Color Composition Textural Properties Faces

use of light avg S std S std V valence dominance arousal H circular variance color diversity

black blue brown gray green orange pink purple red white yellow

edge pixels level of detail avg region size low DOF - H low DOF - S low DOF - V rule of thirds - S rule of thirds - V image size

gray distribution entropy H wavelet - lev 1 H wavelet - lev 2 H wavelet - lev 3 S wavelet - lev 1 S wavelet - lev 2 S wavelet - lev 3 V wavelet - lev 1 V wavelet - lev 2 V wavelet - lev 3 H wavelet sum S wavelet sum V wavelet sum Tamura coarseness Tamura contrast Tamura directionality GLCM contrast - H GLCM correlation - H GLCM energy - H GLCM homogeneity - H GLCM contrast - S GLCM correlation - S GLCM energy - S GLCM homogeneity - S GLCM contrast - V GLCM correlation - V GLCM energy - V GLCM homogeneity - V GIST - channel 1 GIST - channel 2 GIST - channel 3 GIST - channel 4 GIST - channel 5 GIST - channel 6 GIST - channel 7 GIST - channel 8 GIST - channel 9 GIST - channel 10 GIST - channel 11 GIST - channel 12 GIST - channel 13 GIST - channel 14 GIST - channel 15 GIST - channel 16 GIST - channel 17 GIST - channel 18 GIST - channel 19 GIST - channel 20 GIST - channel 21 GIST - channel 22 GIST - channel 23 GIST - channel 24 number of faces

Fig. 6: The plot shows the Spearman Correlation Coefficients ρ between features and traits, both self assessed (upper part) and attributed (lower part). The bubbles are blue when the correlation is positive and red when it is negative. The largest bubbles correspond to |ρ| = 0.55 while the smallest ones correspond to |ρ| ' 0.00. Correlations for which |ρ| ≥ 0.12 are statistically significant at level 0.05.

This term acts as a constraint that has to be taken into account when minimizing the error function. By doing so, it has been proved that (depending on the parameter t), many of the coefficients βj become exactly zero [63]. This is particularly relevant to the problem of this work, because it allows one to model the fact that not all the features are correlated with a given trait.

VI. EXPERIMENTS ANDRESULTS

This section presents first a correlational analysis aimed at showing the relationship between features and traits and then the regression experiments performed in this work.

A. Correlational Analysis

Given the vectors ~x(u)_i (i = 1, . . . , 200) extracted from the 200 images favored by a user u, it is possible to use their average ~x(u) _{as a representative of the corresponding bag.} Figure 6 shows the covariation, measured with the Spearman Coefficient ρ, of ~x(u)_{components and traits, both self-assessed} and attributed (blue and red bubbles account for positive and negative values of ρ, respectively). The size of the bubbles, proportional to the absolute value of ρ, represents the strength of the relationship between a given feature and a trait: “ [...] the sign of the correlation coefficient has no meaning other than to denote the direction of the relationship. Correlations of 0.75 and −0.75 signify exactly the same degree of rela-tionship. It is only the direction of that relationship that is different” [40].

The covariation is high for the attributed traits, but limited for the self-assessments. In particular, ρ is statistically signifi-cant (at level 0.05) for 48.5% of the features in the case of the attributed traits and only for 8.3% of the features in the case of self-assessments. This suggests that the visual properties of the images covariate with the impression that the judges develop about the Flickr users, but do not account for the self-assessments that the users provide. For this reason, the rest of this section focuses on the attributed traits.

Color properties (see Section IV-A) covariate to a significant extent with all traits and, in particular, with Agreeableness and Neuroticism. However, the properties that are positively correlated with one trait tend to be correlated negatively with the other and conversely. In other words, Agreeableness and Neuroticism seem to be perceived as complementary with respect to color characteristics. This applies, e.g., to average saturation (ρ = 0.40 for Agreeableness and ρ = −0.55 for Neuroticism), percentage of orange (ρ = 0.45 and ρ = −0.56), blue (ρ = 0.36 and ρ = −0.52) and red (ρ = 0.30 and ρ = −0.40) pixels, arousal (ρ = 0.38 and ρ = −0.52) and valence (ρ = 0.27 and ρ = −0.40). Overall, the judges appear to assess as high in Agreeableness users that like images eliciting pleasant emotions and showing pure colors. Conversely, the judges consider high in Neuroticism people that like images stimulating intense, unpleasant emotions and contain colors with low saturation.

Complementary assessments can be observed for Openness and Conscientiousness as well when it comes to the relation-ship with compositional properties (see Section IV-B). The features of this category that covariate most with the two traits

(13)

0.00.1 0.2 0.3 0.40.5 0.6 0.7 Correlation Attributed Traits Ope. Con. Ext. Agr. Neu. 0.00.5 1.0 1.5 2.02.5 RMSE 0.00.1 0.2 0.3 0.4 0.5 R2 0.00.1 0.2 0.3 0.40.5 0.6 0.7 Correlation Self-Assessed Traits 0.00.5 1.0 1.5 2.02.5 RMSE 0.00.1 0.2 0.3 0.4 0.5 R2

Baseline Naive MIR cit-kNN Clust-Reg Topic-Sum Gen-MoG Gen-LDA CG

Fig. 7: The figure shows APP (upper charts) and APR (lower charts) performances in terms of Spearman Correlation Coefficient between actual and predicted traits, Root Mean Square Error (RMSE) and R2_{metric. In the case of the correlations, missing} bars correspond to non statistically significant values.

are rule of thirds (ρ = −0.21 for Openness and ρ = 0.22 for Conscientiousness) and level of detail (ρ = −0.30 and ρ = 0.19). Therefore, unconventional compositions displaying a few details tend to be associated with high Openness (the trait of creativity and artistic inclinations) while conventional compositions with many details tend to be associated with high Conscientiousness (the trait of reliability and thoroughness).

Textural features (see Section IV-C) appear to covariate with the perception of most traits, especially when it comes to the properties of the Gray-Level Co-occurrence Matrix (see Section IV-C). In the case of Openness, the highest correlations are observed for exposure (measured in terms of brightness energy) and image homogeneousness (measured in terms of gray distribution entropy). The covariation is positive for the former (ρ = 0.35) and negative for the latter (ρ = −0.27). Therefore, people that like pictures with homogeneous illu-mination and uniform textural properties tend to be perceived as higher in Openness. For Conscientiousness, the covariation (ρ = 0.23) is significant only for the Tamura directionality. Hence, there seems to be no relationship between the trait and textural properties. In contrast, several textural properties covariate with the attribution of Extraversion. High contrast in

hue, meaning large color differences in neighboring pixels, and saturation, meaning chromatic purity, are associated with high Extraversion scores (ρ = 0.26 and ρ = 0.21, respectively). The same applies to Tamura contrast (ρ = 0.25) and directionality (ρ = −0.33) of the images.

The value of ρ for the number of faces, the only content related feature considered in this work, is statistically signif-icant at 0.01 confidence level for all traits except Openness. The ρ value is negative for Conscientiousness (ρ = −0.2) and Agreeableness (ρ = −0.17) and positive for the other traits. Not surprisingly, the absolute covariation is particularly high (ρ = 0.53) for Extraversion, the trait of sociability and interest for others, and Neuroticism (ρ = −0.28), the trait of the difficulties in dealing with social interactions.

According to personality psychologists, the traits that people tend to perceive more clearly are Extraversion and Consci-entiousness [33]. However, different data can make different traits more or less available, i.e. more or less accessible to human observers [13]. The correlational analysis shows that favorite pictures convey impressions more effectively for Agreableness and Neuroticism than for the other traits. This seems to suggest that the raters develop an impression in terms

(14)

of whether a person is overall nice (a typical characteristic of people high in Agreeableness) or not (a typical characteristic of people high in Neuroticism). This appears to be confirmed by the fact that the correlations for the two traits have often opposite sign, meaning that a person perceived to be neurotic is not perceived to be agreeable and conversely.

B. Experimental Setup

All the experiments of this work have been performed using a Leave-One-User-Out approach: the models are trained over all the pictures of PsychoFlickr except those tagged as favorite by one of the Flickr users included in the corpus (see Section III). The traits of these latter are then predicted using the excluded pictures as test set. The process is then iterated and, at each iteration, a different user is left out. The hyper-parameters of the methods introduced in Section V-A and Section V-B have been set through cross-validation: all parameter values in a search range were tested over a subset of the training set and the configurations leading to the highest performance were retained for the test. The main advantage of the setup above is that it allows the use of the entire corpus to measure the performance of the inference approaches while still preserving a rigorous separation between training and test set.

Hyper-parameters and search ranges for the methods de-scribed above are as follows: for the cit-kNN, number of nearest citers C and number of nearest neighbors R were searched in the ranges [4, 10] and [2, 8], respectively; for Clust-Reg and Gen-MoG, the number C of clusters was searched in the set {5, 10, 20, . . . , 100}; for Topic-Sum and Gen-LDA, the number of topics K was searched in the set {50, 70, 90, 110, 130, 150}; for CG, the grid sizes were searched in the set {20 × 20, 25 × 25, . . . , 65 × 65}. Whenever there was no risk of confusion, the same symbol has been used for different hyper-parameters in different models. These ranges have been set by considering common works on object recognition for what concerns number of topics, number of clusters, and grid size [62], [61], and the original papers of cit-kNN [58] for what concerns the number of citers and neighbors.

C. Prediction Results

Figure 7 reports the results obtained with the regression methods described in Section V for both self-assessed and attributed traits. The performance is assessed with three differ-ent metrics, namely Spearman correlation coefficidiffer-ent between scores predicted automatically and scores resulting from the BFI-10 questionnaire (see Section III), Root Mean Square Er-ror (RMSE) and R2_{. The reason is that different performance} metrics account for different aspects and only the combination of multiple metrics can provide a complete description of the results.

In line with the state-of-the-art of Personality Comput-ing [12], APP results tend to be more satisfactory than APR ones. In the case of this work, the reason is that the judges are unacquainted with the users. Therefore, the pictures dominate the personality impressions that the judges develop and, as a

result, the correlation between visual features and trait scores is higher (see Figure 6). Furthermore, the consensus across the judges is statistically significant (see Section III-B). These two conditions help the regression approaches to achieve higher performances. When the users self-assess their personality, they take into account information that is not available in the favorite pictures like, e.g., personal history, inner state, edu-cation, etc. Therefore, the correlation between visual features and trait scores is low. This does not allow the regression approaches to achieve high performances.

APP and APR performances are similar in terms of RMSE, but correlation and R2 are better for APP than for APR. The probable reason is that, in the case of APP, the regressor tends to maintain the mutual relationships between personality scores, i.e., the regressor tends to predict higher scores for those subjects that tend to be rated higher by the assessors. This explains why the correlations are statistically significant (actual and predicted traits covariate to a statistically signifi-cant extent) and more satisfactory than R2 _{and RMSE results.} According to personality psychology, “[...] a compelling argument can be made for emphasizing comparisons among individuals, which we do in everyday life [...] and which is useful for practical purposes” [64]. This means that what is important is not to predict the actual personality scores that individuals have been attributed, but to ensure that the subjects that have been attributed higher scores by the raters tend to be assigned higher scores by the regressor as well. In this respect, the Spearman Correlation Coefficient appears to be the performance metric that better fits the indications of personality psychology.

All approaches have been compared with a baseline that simply predicts the average of the trait values observed in the training set. The performance of the baseline is lower than the performance of all other approaches to a statistically significant extent (see Figure 7). The weakest approaches (cit-kNN and Clust-Reg) are those that make hard decisions to exclude part of the pictures in a test bag. This seems to suggest that all pictures carry task-relevant information and the most effective approach is to make soft decisions by combining complex generative models (e.g., LDA and CG) and sparsity control regressors (see Section V-C). This is the case of the best performing methods, namely Topic-Sum, Gen-MoG, CG and Gen-LDA (this latter has the best overall performance). The good performance of the Naive MIR further confirms that all images in a test bag contribute to influence the attributed traits and, hence, must be used for the regression. This observation has two possible explanations. The first is that all pictures influence the impression that each judge develops about the Flickr users. The second is that each judge is influenced by a different subset of pictures and the attributed traits - the average over the traits attributed individually by each judge - are therefore influenced by all pictures in a test bag.

For every trait, it is possible to split the range of the ob-served scores into quartiles. The performance of the regressors has been measured separately over subjects that fall in the top and bottom 25% of the observed scores and over the remaining subjects. Overall, the performance tends to be higher for