Hotel performance analysis based on Booking.com reviews

(1)

UNIVERSITY OF PISA Department of Computer Science

Corso di Laurea Magistrale in Informatica per l‟economia e per l‟azienda

(Business Informatics)

MASTER THESIS

Hotel performance analysis based on Booking.com reviews

SUPERVISOR: Prof. Mirco NANNI

CANDIDATE: Chenxi ZHU

(2)

(3)

Abstract

In this thesis, we analyze hotels reviews provided by the Booking.com platform, relative to hotels located in the heart of six important European cities. By analyzing the various attributes of the hotels and the scores and contents of the reviewer evaluations, we cluster the hotels according to the number of monthly reviews and monthly ratings of the hotel. We aim to find commonalities and characteristics, weak and strong points of hotels in European cities and propose some possible improvements based on our analysis results.

(4)

(5)

1

Introduction

With the development of world economy and the acceleration of globalization, the business contacts between and within countries become increasingly frequent and business tourism develops rapidly in the world. At the same time, the economic development has also laid a broad market foundation for tourism and opened the curtain for the arrival of the era of mass tourism. Therefore, the booming economy has also created a huge demand for hotels. Hotels not only occupy an important position in business travel, but also an indispensable part of leisure travel. Nowadays, hotels are not only places for people to stay, but also a variety of life services, such as catering, entertainment, shopping, banquet and conference services. With the rapid development of Internet technology, almost every guest has a smartphone and mobile check-in is in full swing. Using a smartphone allows guests to choose their room by themselves. In this case, there is a lot of software that can meet the travel needs of the guests.

People can use mobile phones or computers to log in, then search and book hotel rooms. After checking out, guests can leave their feedbacks such as score and evaluate the hotel, according to their own accommodation experience. Not only people are willing to use this method to choose hotels that are more suitable for them, but also hotel merchants are also more willing to use such quick software to promote sales to attract more customers. Thus, the evaluation left by the customer is very important. Because later guests will consider as the main reference for choosing a hotel based on the evaluation and rating left by previous guests.

In this way, the evaluation and rating of the hotel on the website become an important indicator of the hotel's reputation. The higher the score, the better the hotel service, the higher the guest satisfaction. And it will attract more potential guests. The more subjects in the positive evaluation, the more prospective guests can find the hotel that meets the requirements. Well, when other things being equal, the higher the rating of the hotel and the richer the praise, the greater the possibility of being booked by guests. Hotel merchants can also adjust or improve the hotel's facilities, according to their own evaluation and rating, so as to better meet customers, gain favorable comments and higher popularity. Obviously, hotel reviews and ratings affect people's choices and decisions, and whether

(8)

4

people choose to book a hotel directly affects hotel profits.

In order to improve their attractiveness, hotel merchants try to explore the feedbacks that the customers provide through this technology. This thesis goes exactly researching in this direction using data analysis, cluster analysis and so on methods, so you can see more detailed information here. The data used in our thesis is from the Booking.com platform, which is the website with the above features. The datasets have 515k rows represent the customer reviews and scoring of 1494 luxury hotels across Europe in the period from August 2015 to August 2017. We conducted data analysis based on the content of hotel evaluation, and other attributes of the hotel. We aim to find commonalities and characteristics, weak and strong points of hotels in European cities, and propose some possible aspects based on our analysis result. In this way, on the one hand, the hotel can better understand what kind of service the customer wants to. And on the other hand, a better understanding of their own actual situation, improve the shortcomings, adjust to better meet customer demand, to provide customers with more quality services, achieve higher popularity (a higher score), attract more guests, earn more profits. At the same time, we can also find the future development direction of the hotel from the guest reviews. Outlines

This thesis is organized as follows:

In Chapter 1, it describes the background, which means the functions and applications of the Booking.com platform, the theory of related data mining and a brief explanation of python languages.

In Chapter 2, it shows the data understanding and data preprocessing phases of the basic attributes of our data.

In Chapter 3, it shows the cluster analysis based on the number of reviews over time. It includes the analysis of the basic data of each cluster and the content of the hotel review. Besides, it also shows different cluster hotel density maps of different cities.

In Chapter 4, it shows the cluster analysis based on the average reviewers score over time and the relationship between rating trends and review trends. It also includes the analysis of the basic data of each cluster, the content of the hotel review and different cluster hotel density maps of different cities.

In Chapter 5, it discusses the results we have summarized through the analysis of the previous chapters and suggestions for the development of future hotels.

(9)

Chapter 1 Background

This chapter shows the functions and applications of Booking.com website at first. Then, we introduce the background of related data mining used in the data analysis. Last part, it is about Jupyter notebook which is used for data analysis with python languages.

(10)

6

1.1 Booking.com

Booking.com is a website which can help people in booking lodging and travel tickets for the world online [Wikipedia 18a]. Its headquarters is in Amsterdam, the Netherlands. The website has more than 29 million listings in 140,210 destinations in 231 countries and territories worldwide [Booking.com 18b]. It is available in 43 languages. More than 1,550,000 rooms are booked by Booking.com per day. Its website and mobile app are popular with leisure travel and business travel users from all over the world.

1.1.1 Hotel merchants

Hotel merchants can register on the Booking.com website. After successful registration, if all goes well, they will pass the review and obtain permission to use the hotel back office. Then update the calendar, room price and other settings, and finally get ready to go online to accept the booking [Booking.com 18b]. Hotel registration information needs to provide the name, address and latitude and longitude of the hotel to the website. And our data onto hotel are derived from it. Hotel merchants can earn more profits through the business analysis provided with the Booking.

1.1.2 User registration and booking

Users can register on the website via email or social network account. After successful registration, users can set personal information, credit card, payment settings, travel preferences, password and currency settings, social network, SMS push settings and account security in the user's settings.

Figure 1.1 Booking website

As shown in Figure 1.1, users can search and book hotels by filling in the destination, check-in time and the number of guests in the hotel's search bar on the website. Based on the user's search, the website will display a new list of hotels. The user views the hotel information in the list to select and book the hotel. After the reservation is successful, both the user and the hotel will receive the order.

(11)

7

1.1.3 Check in and evaluation

Reviewer reviews are the personal views of the guests‟ experience after staying. On Booking.com, guests can leave a comment if and only if have booked through the website or app and/or stayed at the hotels. The hotel merchant will send the guest an invitation to check within 48 hours of check out, and the guest can write a review of 28 days [Booking.com 18c].

Comments are the subjective opinions of guests. In order to maintain an objective and fair position as much as possible, booking.com will only remove the disrespectful description of the comments, or if the comments are irrelevant. If the guest chooses an anonymous review, Booking.com is obligated to protect their privacy and therefore cannot share any information about that guest with the hotel merchant [Booking.com 18c].

In addition, the booker may book on behalf of their friends, family or colleagues. However, the booker will receive a review questionnaire about the accommodation and they will forward it to the actual guest to complete [Booking.com 18c].

1.1.4 Calculation of accommodation ratings

(12)

8

The hotel's total review scores are calculated based on all individual scores displayed online. As shown in Figure 1.2, the details page for reviews on the Booking.com website is straightforward, with a 7-dimensional rating to showcase the hotel's basics. They are Cleanliness, Comfort, Location, Facility, Staff, and Value for Money, Free Wi-Fi. When the hotel merchant receives at least 5 reviews, the hotel's score will be calculated then displayed on the website [Booking.com 18a]. In addition to this, guests can add tags to the hotel. For each stay, guests can leave positive and negative comments at the same time.

Whenever a hotel merchant receives a review or reviews an appointment, the booking.com system takes 48 hours to sync each platform. After the update is complete, the background information can be displayed on the website. Therefore, the review pass time of the review will affect the time of the release. The score will be calculated automatically by the system and cannot be manually modified by booking.com.

How Booking.com calculates your average score? The guest used 6 smiley faces and sad expressions evaluate 6 different categories, each with a score of 2.5. When Booking.com calculates the average score for a hotel, Booking.com adds the room cleanliness, comfort, location, facilities and services, staff quality, and price/performance scores, then divides by 6 (the total number of individual categories). Therefore, each individual score does not have an immediate impact on accommodation score. If the guest does not score a single item, the score for that item will default to 0 and will not be included in the calculation of the average score [Booking.com 18a].

The review automatically expires after 24 months, the system needs 48 hours to recalculate the average score and then posted to the web page. Comments over 24 months will be automatically deleted, ensuring that guests can see the latest and most relevant comments. It means that the review score will change over time.

1.2 Data mining background

1.2.1 Data Mining

As shown in Figure 1.3, data mining is the core work of Knowledge Discovery in Databases. The so-called data mining is the process of searching/extracting of information hidden in it from a large amount of incomplete, noisy, fuzzy, and random application data through algorithms [KDDProcess 18]. The information usually is previously unknown and potentially useful. Data mining is often associated with computer science and achieves these goals through statistics, online analytical processing, intelligence retrieval, machine learning, expert systems, and pattern recognition. Therefore, KDD (Knowledge Discovery in

(13)

9

Database) is a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns of a data set [KDDProcess 18].

Figure 1.3 KDD process [KDD Process 18]

The KDD process has four main steps. Data consolidation is mainly for creating a data set. In the first step we often have to know what kind of data can be applied to our project; after the data was collected, the next step that must be done is to preprocess the data, such as trying to eliminate the errors and missing information in the data. Converting data to the format required by data mining tools and data mining algorithms; then the third step is to search for patterns of interest and this part is also the main task of data mining; the final step is to understand and evaluate the results of data mining [Pedreschi 18].

There are main analysis tasks and methods of data mining, Classification, Estimation, Prediction, Association rules, Clustering, Time-series pattern, Deviation Analysis. In our thesis, we mainly use the Cluster analysis. Clustering is to classify data into several categories according to similarity. The data in the same class are similar to each other, and the data in different classes are different. Clustering Analysis can establish macroscopic concepts, and discover patterns of data distribution, and the interrelationships between possible data attribute [Pedreschi 18].

1.2.2 Cluster analysis

Cluster Analysis is a process of classifying data into different clusters. In the same cluster, data objects are more similarity. Contrary, data objects between different clusters are more dissimilarity. In the same cluster, the distance between two objects named Intra-cluster

(14)

10

distances are minimized. The distance between two clusters named inter-cluster distances are maximized. The purpose of cluster analysis is to classify the research objects according to their characteristics and reduce the number of research objects.

There are many types of clustering: hierarchical clustering, Partitional clustering, exclusive, overlapping, fuzzy clustering, complete clustering, and partial clustering. Well, a set of clusters makes up a clustering. As shown in Figure 1.4, using two-dimensional points as data objects to represent the differences between clusters.

Figure 1.4 Different types of clusters [Tan 18a]

In general, we use the clustering variable to describe the degree of correspondence between two individuals (or between variables) or the measure of the degree of closeness. There are two ways to measure this. One is using indicators that describe the proximity of variable pairs. For example, "distance", the smaller the "distance", the more similar the variable. Another is using indicators that indicate similarity, such as the "correlation coefficient". The more similar the "correlation coefficient" is, the more similar the variables are. Therefore, there are many ways to calculate the distance indicator of clustering. For example, Euclidean

(15)

11

distance, Squared Euclidean distance, Manhattan and so on. And the Pearson correlation coefficient is an indicator of similarity with the equation 1.1,

𝒅𝒊𝒔𝒕 = √∑𝒏_𝒌=𝟏(𝒑_𝒌− 𝒒_𝒌)𝟐 (1.1)

Where 𝒏 is the number of dimensions (attributes), 𝒑_𝒌 and 𝒒_𝒌 are respectively the value of 𝑘𝑡ℎ_{attributes (components) or data objects}_{𝒑 and 𝒒 [Pedreschi 18].}

Clustering Algorithms we mainly involved hierarchical clustering, Density-based clustering, K-means and its variants. Hierarchical clustering is a series of nested clusters such as merge method, decomposition method, and tree diagrams. Density-based clustering is to divide noise from data objects to clusters such as DBSCAN. In this thesis, we use K-means clustering algorithm for analysis. K-means clustering is Partitional clustering approach. It is an algorithm that divides data into K classes using a mean algorithm. We will give you a detailed introduction to the next section.

1.2.3 K-means algorithm

K-means clustering algorithm is the most commonly used clustering method in Partitional clustering. Partitional clustering is simply dividing the data objects into subsets that do not overlap. This algorithm has less computational complexity, less memory, and therefore faster processing. It is very suitable for cluster analysis of large samples. However, its scope of application is limited, and it requires the number of clustering. And only the observations (samples) can be clustered. Besides, the clustering variables used must be continuous variables.

Basic algorithm of K-means clustering as following steps: 1, initialization: select (or artificially specify) some records as a cohesive point; 2, repeat the last step; 3, agglutination of the remaining records to the cohesive point, according to the principle of proximity; 4, Calculate the center position (mean) of each initial cluster and re-clustering with the calculated center position; 5, repeated cycles until the position of the cohesive point converges [Pedreschi 18]. The cohesive point usually named centroid is the center of the cluster. And in general is the mean of the points in the cluster.

Initial centroid is always randomly selected. As shown in Figure 1.5, different initial centroids will converge to different results. It can be seen that it is very important to choose the initial centroids.

(16)

12

Figure 1.5 the importance of selecting the initial centroid

K-means converge for common similarity measures such as Euclidean distance, cosine similarity, correlation and so on. And most convergence occurs in previous iterations. And most convergence occurs in the first few iterations.

In order to solve the problem of selecting initial centroids, we can take a variety of measures: 1, try to run multiple times; 2, sample and use hierarchical clustering to determine the initial centroid; 3, select more than K initial centroids, then select in these initial centroids; 4, post-processing; 5, Bisecting K-means [Pedreschi 18]. Specifically, we need to choose the best fit cluster centroid to minimize clustered SSE. SSE (Sum of Squared Errors) is the measure for evaluating K-means clustering with the equation (1.2).

𝐒𝐒𝐄 = ∑ ∑ 𝒅𝒊𝒔𝒕𝟐_(𝒎 𝒊 , 𝒙 ) 𝒙𝝐𝑪𝒊

𝑲

𝒊=𝟏 (1.2)

Where x represents one data point in the cluster 𝐶_𝑖, and 𝑚_𝑖 represents point for the cluster 𝐶𝑖. The smaller the value of SSE, the better the clustering it is. Increasing the number

(17)

13

Figure 1.6 Elbow Method

In order to find the best K, we use the Elbow Method as shown in Figure 1.6. After running the K-means algorithm, we will get some different values of K. Then, plot a line chart of K values against SSE. Finally, selecting the best K which is at the elbow point, the best value is 10 in the example above.

1.3 Python language

Python is a well-organized and powerful object-oriented programming language, similar to Java, mainly used in system programming, graphics processing, mathematical processing, text processing, database programming, network programming, Web programming, multimedia applications [IPython 17].

In our project, we mainly use Python language for data mining and analysis on the Jupyter Notebook. Jupyter Notebook is a web page that can be used to write code and run code directly on a web page. The result of the code will also be shown directly on the same page. We mainly used python3 libraries Numpy, Pandas and Matplotlib.

Numpy is the most basic package (full name: Numerical Python). It's designed for strictly digital processing and production. In Python, it provides a lot of useful functionality for n-dimensional arrays and matrix operations. NumPy provides many advanced numerical programming tools NumPy. Such as matrix data types, vector processing, and sophisticated arithmetic libraries.

Pandas is a tool based on NumPy. It was created to solve the data analysis task. Pandas provide a number of functions and methods that enable us to process data quickly and easily. It is also used for data manipulation, aggregation and visualization. Its data structure includes Series that can hold different kinds of data types, time-indexed Series, the two-dimensional tabular data structure named by DataFrame, the three-dimensional array

(18)

14

named by Panel.

Matplotlib is a Python 2D drawing library. It generates publishing quality level graphics in a variety of hardcopy formats and cross-platform interactive environments [Matplotlib 18]. With Matplotlib, we can generate plots, histograms, power spectra, bar graphs, error plots, scatter plots and don't need much code.

(19)

Chapter 2 Data Understanding and Preprocessing

In this chapter, we explain the data understanding and data preprocessing phase of our data and extract information from statistics to gain insight into the data. Data preparation uses this information to select more suitable attributes, reduce the dimension of the dataset, deal missing values and outliers, normalize and transform data, and also improve data quality. A series of useful analysis tools are available in the Microsoft suite and also in the Jupyter notebook.

(20)

16

2.1 Dataset description

The data were scraped from Booking.com [Liu 18]. All data in the file were publicly available to everyone already. Data is originally owned by Booking.com. The datasets have 515k rows represent the customer reviews and scoring of 1494 luxury hotels across Europe in the period from August 2015 to August 2017. Each row has 17 attributes explained in the following Table 2.1:

Name Type Description

Hotel_Address categorical Hotels‟ address

Additional_Number_ of_Scoring

numerical This number represents how many customers just left a rating rather than a review.

Review_Date categorical Date when reviewers left the reviews

Average_Score categorical Average Score of each hotel

Hotel_Name categorical Name of Hotel

Reviewer_Nationality categorical Nationality of Reviewers

Negative_Review text Negative Review the reviewer gave to the hotel. If

the reviewer does not give the negative review, then showed: 'No Negative'

Review_Total_Negative_ Word_Counts

numerical Total number of words in the negative review

Total_Number_of_Reviews numerical Total number of valid reviews which the hotel has

Positive_Review text Positive Review the reviewer gave to the hotel. If

the reviewer does not give the negative review, then showed: 'No Positive'

Review_Total_Positive_ Word_Counts

numerical Total number of words in the positive review

Total_Number_of_Reviews_ Reviewer_Has_Given

numerical Number of Reviews of the reviewer has given in the past

Reviewer_Score numerical The reviewer gives the hotel a rating based on

his/her experience

Tags categorical The reviewer gave the hotel Tags

days_since_review categorical The duration between the review date and scrape

date

lat discrete Latitude of the hotel

lng discrete Longitude of the hotel Table 2.1: Dataset description

2.2 Data cleaning and Data understanding

(21)

17

treat missing values, find the outliers and exclude them, check the distribution of the data and performed correlation analysis. Besides, we try to exclude some irrelevant attributes in order to reduce the dimension of variables.

For the purposes of future statistical and correlational analysis after the preliminary examination, we get the dataset to contain 515738 rows beside there are 17 rows of missing values in the Latitude and Longitude information as shown in Figure 2.1.

Figure 2.1 Visualization of missing values in data

After removing duplicates from the dataset we finally get the dataset contains 515,212 customer reviews and scoring of 1494 luxury hotels across Europe. At the same time, we find that there are about 3268 Nan (missing values) from 17 Hotels latitude and longitude information is not available in the dataset. This means only about 1.13% of Hotels latitude and longitude information is missing. We choose to use the Jupyter notebook with python languages to fill these missing values and generate a new complete data set. Besides, the attribute about „days_since_review‟ is duration between the review date and scrape date. We decided to delete it because the time for data grabbing does not affect the data that has already occurred.

The following Tables 2.2 and Tables 2.3 show the summary statistics per attribute for the dataset.

Additional_Number_of_Scoring Average_Score Reviewer_Score Total_Number_of_Review

count 515212 515212 515212 515212 mean 498.416021 8.397767 8.395532 2744.698889 std 500.668595 0.547952 1.637467 2318.090821 min 1 5.2 2.5 43 25% 169 8.1 7.5 1161 50% 342 8.4 8.8 2134 75% 660 8.8 9.6 3633 max 2682 9.8 10 16670

(22)

18

Tables 2.2 the summary statistics per attribute for the dataset (part1)

Tables 2.2 the summary statistics per attribute for the dataset (part 2)

We totally have 515212 rows data. The maximum average score of the hotel is 9.8 and the minimum is 5.2. About 75% hotels average score is 8.8 and all of these hotels got mean value about 8.39. About the reviewer score, the maximum score is 10 and the minimum score is 2.5. The span of the customer score is greater than the average of the hotel score. It is because the hotel rating is the average score of the total customer score.

Review_Total_Negative_Word_Counts and Review_Total_Positive_Word_Counts have the minimum value equal to zero because there are some reviewers did not give the negative/positive review, then it should be: 'No Negative'/'NO Positive' in the dataset, so there is no count. As can be seen from the Table 2.2, the total number of negative reviews word counts may be slightly higher than positive reviews.

25% reviews are the first evaluation, but we cannot be sure whether they are using the Booking.com for the first time or used it many times before trying to evaluate it. Among them, the customer has the most number of comments up to 355 so we think he may be a business traveler or travel commentators. The minimum of the total number of hotel reviews is 43. The maximum value of the total number of hotel reviews is 16,670. The total number of hotel reviews is small, there are two possibilities, one is just getting started with Booking.com hotels on these platforms soon, and another possibility is that these hotels are really unpopular. Vice versa, the total number of hotel reviews is large; it may be because these hotels have been using the Booking.com platform for a long time, or because they are very popular.

As shown in the following Figure 2.2, shows the correlation matrix of the different attributes in the dataset. We can easily notice intuitive positive correlations between additional number of scoring and total number of reviews. In addition to this one pair, we have no more meaningful findings. Review_Total_Negative_ Word_Counts Review_Total_Positive_ Word_Counts Total_Number_of_Reviews_Reviewer_Has_ Given count 515212 515212 515212 mean 18.540822 17.778256 7.164895 std 29.693991 21.804541 11.039354 min 0 0 1 25% 2 5 1 50% 9 11 3 75% 23 22 8 max 408 395 355

(23)

19

Figure 2.2 the correlation matrix of the different attributes in the dataset

2.2.1 Hotels and reviews

In the dataset, we totally have 1494 luxury hotels across Europe. As we can see in Figure 2.3, there are 31% of hotels from France, 27% of hotels from the UK, 14% of hotels in Spain 11% in Italy, 10% in Austria and 7% of hotels from the Netherlands. In the next chapter, we further analyze the location of the hotel (latitude and longitude) and find that the hotels in the data are mainly located in the central area of six cities(Paris, Barcelona, Milan, London, Vienna, Amsterdam) of the six countries. So, in our thesis, we say that hotels in France usually refer to hotels in Paris.

(24)

20

Figure 2.3 the number of hotels in each country in the data

Based on the number of hotel reviews, we make the percentage of reviews per country are shown in Figure 2.4. More than half of the reviews were from hotels in the UK. Other countries have similar Numbers of comments. And the most reviewed Hotel is Britannia International Hotel Canary Wharf with 4789 reviews.

Figure 2.4 Percentage of reviews per country in the data

2.2.2 Reviewers’ Nationality

As shown in Figure 2.5, over 24,000 reviewers come from the United Kingdom. There are more than 10,000 tourists in the United States, Australia, Ireland and UAE. It seems that the

Austria, 158 France, 458 Italy, 162 Netherlands, 105 Spain, 211 United Kingdom, 400 Austria 8% France 11% Italy 7% Netherlands 11% Spain 12% United Kingdom 51%

(25)

21

United Kingdom should be the country with the largest number of tourists in Europe.

Figure 2.5 Top 10 Reviewer's Country

As shown in Figure 2.6, based on the continent‟s nationality of tourists, most of the travelers are from Europe. Asia is the Second. It means that most of the tourists coming to European countries are Europeans.

Figure 2.6 Reviewers‟ Continent Pie Chart

As shown in Figure 2.7, for each continent‟s reviewer, the percentage of hotels that they mainly commented on. In our data, the main destination for people from all over the world is the United Kingdom.

245110 35349 21648 14814 10229 8940 8757 8669 7929 7883 0 50000 100000 150000 200000 250000 300000 Un ite d K in gd o m Un ite d S tat e s o f Am eric a Au stra lia Ire lan d Un ite d Ara b E m irat e s Sau d i Arab ia N eth erlan d s Sw itz erlan d Ge rm an y Can ad a T h e n u m b er o f rev iew er s

(26)

22

Figure 2.7 the number of reviews left in each country hotel by reviewers of each continent

2.2.3 Average Score

Figure 2.8 Distribution of average score

As shown in Figure 2.8, we can see that the average score ranges from 5 to 9. And most hotels have an average score between 8.1 and 9.1.

656 7274 24170 4004 1941 559 333 1601 11944 31151 8812 ₄₃₅₇ 974 574 761 8701 19983 3673 ₂₉₄₄ 684 458 1121 8404 37858 6062 ₂₃₇₉ 824 563 1242 8858 37648 7805 3328 745 523 3541 23833 205771 14205 9974 4025 949 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Africa Asia Europe North

America

Oceania Other South

America

(27)

23

Figure 2.9 the top 5 hotels based on average scores

As shown in Figure 2.9, it shows the top 5 hotels based on the average score. The hotel named Ritz Paris in France is the top hotel with 9.8 average scores.

2.2.4 Reviewer Score

Figure 2.10 Distribution of Reviewer score

Reviewer Score means the score of the reviewer has given to the hotel based on his/her experience in the hotel. As shown in Figure 2.10, the reviewer‟s score distributes more scattered. There are 115,758 reviewer‟s scores around 10. Others, mainly focus on 9.6

9.8 9.6 9.6 9.6 9.6 9.5 9.55 9.6 9.65 9.7 9.75 9.8 9.85 France United Kingdom

France Spain United

Kingdom

Ritz Paris 41 H tel de La

Tamise Esprit de France H10 Casa Mimosa 4 Sup Haymarket Hotel A v er ag e sco re 0 20000 40000 60000 80000 100000 120000 140000 2.5 3 3.3 3.8 4.2 4.5 5 5.5 5.8 6.3 6.7 7 7.5 8 8.3 8.8 9.2 9.5 10 Nu m b er o f rev iew er s Reviewer score

(28)

24

(71,110 reviewers), 9.2, 8.8 (58,526 reviewers) and 8.3 (41,090 reviewers).

Figure 2.11 Top 10 hotels based on reviewer's score

As shown in Figure 2.11, the top 10 hotels based on the reviewer‟s score are mainly from the United Kingdom, France and Spain. The top first hotel is the same as the one observed in the previous Figure 2.9. The reviewer's actual rating is more than the hotel's average score.

Figure 2.12 the top 10 Reviewed Countries of reviewers' nationality

We filter based on the average of the number of nationalities of customers, then select countries which more than the average number of nationalities. As shown in Figure 2.12, the top 10 reviewed countries of reviewers' nationality. The score left by the Americans is the highest. 9.73 9.72 9.71 9.69 9.67 9.66 9.65 9.62 9.60 9.60 9.50 9.55 9.60 9.65 9.70 9.75 France Spain United Kingdom France France Spain France Spain United Kingdom United Kingdom Ritz _Paris H o te _l Cas a Cam per 41 H t el d e L a Tami se Esp ri t d e Fran ce Le N ar ci ss e Blan _{c Sp} a H 10 Cas a Mim osa 4 Su p H o te _l Eif fe l Blom et H o te l T h e Se rra s 45 _Park _Lane _Dorc _hest er Col le ctio n Th e Soh o H o te _l Reviewer score 8.79 8.69 8.66 8.59 8.55 8.49 8.46 8.44 8.37 8.31 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 U n ite d S ta te s o f Am eric a Isra el N ew Z eala n d Au stra lia Can ad a U n ite d K in gd o m Ire lan d Chin a Sou th Afric a Au stria reviewer score

(29)

25

2.2.5 Total number of reviews reviewer has given

Figure 2.13 Top 10 of total number of reviews reviewer has given

As shown in Figure 2.13, we could see that almost 29.99% (154506 / 515212) of reviewers‟ reviewed for the first time. But we cannot be sure whether they are using the booking for the first time or used it many times before trying to evaluate it. The United Kingdom‟s hotel still occupies a major position.

2.2.6 Total Number of Reviews

The total number of valid reviews which the hotel is meant the number is larger than we actually have in our data. It depends on when the hotel open when they began to use Booking. As shown in Figure 2.14, the first is Hotel DA Vinci in Italy, which has 16,670 valid reviews in total. 1 2 3 4 5 6 7 8 9 10 United Kingdom 88424 36661 24673 17581 13545 10944 8703 7473 6179 5290 Spain 17033 8000 5600 4275 3284 2684 2215 1957 1559 1363 Netherlands 18565 7684 5040 3763 2916 2367 1943 1634 1404 1232 Italy 6669 3605 2994 2734 2213 1968 1696 1547 1276 1196 France 16363 7375 5357 4046 3377 2759 2251 1996 1776 1412 Austria 7452 3686 3138 2592 2269 1871 1791 1519 1332 1209 0 20000 40000 60000 80000 100000 120000 140000 160000 180000 T h e n u m b er o f rev iew s

(30)

26

Figure 2.14 the top 10 hotels based on total number of valid reviews

2.2.7 The reviews given dates

As shown in Figure 2.15, the number of reviews has slowly declined over time since August 2015. There was a slight increase from the end of 2015 to January 2016. The number of reviews from February 2016 rose month by month, reaching the first peak in May. After a slight decline in June, it again went up significantly and peaked in August. Since then, the number of reviews has declined over time. It has since started to recover in November and reached a small peak in January 2017. The number of reviews in February 2017 began to rise and peaked in May and July. Data disappeared in early August.

16670 10842 7371 8177 7656 7586 12158 9568 9086 7491 Hotel Da Vinci 8.1 Park Plaza Westminster Bridge London 8.7 Hotel degli Arcimboldi 8.3 Strand Palace Hotel 8.1 Britannia International Hotel Canary… Best Western Premier Hotel Couture 8.7

The Student Hotel Amsterdam City 8.7 Golden Tulip Amsterdam West 8.5 DoubleTree by Hilton Hotel London…

Glam Milano 8.8

The total number of valid reviews

(31)

27

Figure 2.15 the trend in the number of reviews

We find some interesting that in November 2015 and November 2016, the number of reviews was in the low range. From February to May of 2016 and from February to May of 2017, the trend in the number of comments has been increasing steadily. In May 2016 and May 2017, the number of reviews reached a similar peak. And there is a valley in both June 2016 and June 2017. This seems to be seasonal.

Based on the previous analysis, it shows that the UK is the choice for most tourists. We use the United Kingdom as the main target for further analysis. As shown in Figure 2.16 and Figure 2.17, in Europe always rainy in autumn and it is not suitable for travel on rainy days, which may be the reason why fewer visitors reviewed in November. Winter (December-February) time is the coldest season in the UK, with extremely low temperatures, often freezing on the road, and sometimes snowing.

(32)

28

Figure 2.16 the average monthly rainfall in the UK

Figure 2.17 the average monthly temperatures in the UK

As shown in Figure 2.17, the average monthly temperatures in the UK. After February, as the temperature gradually increased, the number of tourists gradually increased. The lack of public holidays in June is one reason why the number of comments around June is so low. People don't have time for travelling in June.

(33)

29

tourists are mainly from Europe. August is a public holiday in Europe and the holiday time is long. So there should be more visitors in August than at any other time of year. Those reasons led to the largest number of comments during July and August. Besides, in the period from the end of each year to the beginning of the second year, the increase in the number of comments is due to the holiday (Christmas, New Year's Day) effect.

Figure 2.18 Monthly reviewer score trend

Comparing the trend in the number of reviews and monthly reviewer score trend (fig 2.18), we find some contrasts. When the number of comments is the highest, the customer rating is not high (August 2016). When the number of comments fluctuates very little, there is a difference in customer ratings (from late 2015 to early 2016).

In addition, customer ratings peaked in both February 2016 and February 2017. The overall trend in customer ratings is a month-to-month decline after peaking in February. August will see a slight uptick, but the overall trend remains unchanged.

(34)

30

Figure 2.19 the trend in reviewer scores by week

As shown in figure 2.19, we modified the data display time in weeks (Sunday as the beginning of the week). We found a big fluctuation in the scoring trend. There was a sharp drop between 8/28/2016-9/3/2016 and 9/11/2016-9/17/2016. That means that the average reviewer scores in the first week are about 8.50 but at the third week is about 8.24.

Figure 2.20 the statistic number of reviews for each hotel country during these three weeks Judging from the total number of reviews as shown in Figure 2.20, the total number of reviews in all countries was reduced week by week. The number of hotels in each country, apart from the UK and Austria, is decreasing. In particular, the number of hotels in the Netherlands fell by 57% in the third week compared to in the first week. The number of hotels in the UK is still the largest.

7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8/2 /201 5 - 8/8 /201 5 8/3 0/20 15 9 /5/20 15 9/2 7/20 15 1 0/3/2 015 10 /25/2 015 10/31 /201 5 11 /22/2 015 11/28 /201 5 12 /20/2 015 12/26 /201 5 1/1 7/20 16 1 /23/2 016 2/1 4/20 16 2 /20/2 016 3/1 3/20 16 3 /19/2 016 4/1 0/20 16 4 /16/2 016 5/8 /201 6 - 5/1 4/20 16 6/5 /201 6 - 6/1 1/20 16 7/ 3 /2 01 6 7/9 /2 01 6 7/3 1/20 16 8 /6/20 16 8/2 8/20 16 9 /3/20 16 9/2 5/20 16 1 0/1/2 016 10 /23/2 016 10/29 /201 6 11 /20/2 016 11/26 /201 6 12 /18/2 016 12/24 /201 6 1/1 5/20 17 1 /21/2 017 2/1 2/20 17 2 /18/2 017 3/1 2/20 17 3 /18/2 017 4/ 9 /2 01 7 4/1 5/ 20 17 5/7 /201 7 - 5/1 3/20 17 6/4 /201 7 - 6/1 0/20 17 7/2 /201 7 - 7/8 /201 7 7/3 0/20 17 8 /4/20 17 R ev iew er s co re 504 776 472 815 848 3237 379 617 410 553 712 2091 419 573 ₃₇₅ ₃₄₉ 680 2339

Austria France Italy Netherlands Spain United

Kingdom

(35)

31

Figure 2.21 average review score of hotels in each country during these three weeks Judging from the hotel's rating as shown in Figure 2.21, the hotel guest ratings in all countries have gone down more or less in the second week. France, Spain and the UK continued to decline in the third week. Especially in France and the UK, their ratings have dropped by about 0.31. In addition to the Netherlands hotels have the highest scores in the second week, hotels in other regions were in the first week .Austria and Italy have the lowest score in the second week, and other countries are in the third week.

Figure 2.22 Average score of hotels in each country during these three weeks

As shown in Figure 2.22, the average score of hotels in the UK is the lowest among these three weeks, and Austria is the highest. There are some differences between customer ratings and hotel averages, sometimes not consistent. During this period, Europeans entered the end

8.66 8.66 8.60 8.51 8.55 8.42 8.39 8.43 8.25 8.63 8.45 8.36 8.53 8.34 8.38 8.30 8.32 8.11

Kingdom 8/28/2016 - 9/3/2016 9/4/2016 - 9/10/2016 9/11/2016 - 9/17/2016 8.56 8.44 8.48 8.40 8.57 8.34 8.54 8.40 8.41 8.42 8.50 8.35 8.55 8.43 8.44 8.45 8.56 8.29

Kingdom 8/28/2016 - 9/3/2016 9/4/2016 - 9/10/2016 9/11/2016 - 9/17/2016

(36)

32

of the vacation period, and most people are on their way back to vacation or have already ended their vacation. During these three weeks, the decline in the number of visitors, the declining score of the customer and the declining average score of the hotel all led to a significant decline in the number of reviews.

2.2.8 Positive Review and Negative Review

If we have “No Positive” in Positive reviews column, we count it as a negative review, vice versa. After preliminary statistics, we have a total of 479308 positive reviews and 387455 reviews evaluations. As shown in Figure 2.23 and Figure 2.24, we can see the top 10 hotels with the number of positive reviews and negative reviews separated. The top ten hotels are the same in both, but the order is slightly different. All of them come from the UK.

Figure 2.23 Top 10 hotels with the number of positive reviews

Figure 2.24 Top 10 hotels with the number of negative reviews 2284 2292 2428 2433 2658 2994 3238 3874 3902 4099

Millennium Gloucester Hotel London Hilton London Metropole Intercontinental London The O2 Holiday Inn London Kensington Grand Royale London Hyde Park DoubleTree by Hilton Hotel London Tower of…

Copthorne Tara Hotel London Kensington Park Plaza Westminster Bridge London Strand Palace Hotel Britannia International Hotel Canary Wharf

The number of positive reviews

1839 2141 2269 2395 2428 2488 2881 3207 3429 4262

Intercontinental London The O2 Millennium Gloucester Hotel London Hilton London Metropole DoubleTree by Hilton Hotel London Tower of…

Holiday Inn London Kensington Grand Royale London Hyde Park Copthorne Tara Hotel London Kensington Park Plaza Westminster Bridge London Strand Palace Hotel Britannia International Hotel Canary Wharf

(37)

33

As shown in Figure 2.25, the Positive Negative ratio is coherent with the star rating. We used the number of positive reviews (pos) and the number of negative reviews (neg) to formulate a ratio 2.1.

pos_neg = (pos − neg ) ÷ (pos + neg) (2.1)

Figure 2.25 the Positive Negative ratio is coherent with the star rating.

Hotels with a higher average score or reviewer score tend to have a higher Positive_Negative ratio. This is just a hypothesis. But we will use this indicator to try to find out which hotels are better and which ones are worse. Others' scores may affect the later customer rating of the hotel. Customers tend to give scores that are not too far from the average score. This is why we try to use this metric. Positive reviews and negative reviews need to be based on personal experience and descriptions, so it shows the customers may be less affected.

2.2.9 Review Total Positive Word Counts vs. Review Total

Negative Word Counts

We analyze the number of words in each review using the same method as analyzing the number of reviews. As shown in Figure 2.26, we can see that the hotels with higher star rating or reviewer score tend to have a higher ratio.

(38)

34

Figure 2.26 the ratio of the hotels with higher star rating and reviewers‟ score

We can easily notice intuitive positive correlations between the number of negative reviews for each hotel and the number of positive reviews for each hotel and also between the number of negative reviews‟ words for each hotel and the number of positive reviews‟ words for each hotel as shown Figure 2.27.

Figure 2.27 correlations between the number of negative and positive reviews vs. Correlations between the number words of negative and positive reviews

(39)

Chapter 3 Cluster analysis based on the number of

reviews

In this chapter, based on the previous basic analysis, we will decide to classify different hotels based on the number of reviews over time. It includes the analysis of the basic data of each cluster and the key words of the hotel review content. Besides, it also shows different cluster hotel density maps of different cities.

(40)

36

Based on the previous basic analysis, we decide to classify different hotels based on the number of reviews over time. We have a total of 515,212 reviews. We decided to classify each hotel based on the different trends of the monthly commentary volume. In our data time, there are some hotels that are not guaranteed to receive reviews every month. We replace this value with 0. If the hotels did not have any reviews in the first 6 months in our data, we assume those were not already open on that time then delete them as outliers. We finally selected 1,372 hotels with sufficient data for analysis.

As shown in Figure 3.1, in order to identification of the best value of k, we try a range of values of k from 2 to 50. Each time computing the SSE, we use the elbow method showing the optimal k is 7.

Figure 3.1 Elbow method of k with selected hotels clustering

Clusters based on the number of reviews over time as shown in the graph (Fig. 3.2). We find seven different hotel categories in it based on our data time. For each Cluster, we do further analysis. The collection of reviews began on August 4, 2015, and ended on August 3, 2017. We can see in Figure 3.2 that there has been a huge drop in the number of comments since August 2017. This is because the data for the month is only the first 3 days.

(41)

37

3.1 Cluster 0

Figure 3.3 the trend in the number of reviews of Cluster0

As shown in figure 3.3, the data are a gradual upward trend. There were 89 hotels and 11,403 reviews in cluster0. Initially, the data fell gradually after a small peak in the fall of 2015 and hit a trough at the end of 2015. The number of reviews in early 2016 gradually increased. The number of reviews fell slightly between August and September 2016 but still rose overall. The number of comments reached a trough in November 2016 and then rose again in early 2017. After February 2017, the number of comments gradually increased until the date deadline.

(42)

38

Figure 3.4 the basic attributes in cluster0

In cluster 0, France has the largest number of hotels and reviews. Hotels in Austria had the lowest number of reviews, but hotels had the highest average scores and the highest average reviewer score. It suggests that Austrian hotels in this cluster may receive more positive reviews. The average occupancy rate of hotels in the Netherlands is the highest, as there are only three hotels in cluster 0 and the number of hotel reviews is not low. However, the average score and the average reviewer score of hotels in the Netherlands are the lowest, which is attributed to the fact that most customers are not satisfied with their accommodation, leading to more negative reviews and thus lower scores.

Figure 3.5 the number of reviews for each hotel country based on reviewers locate continents 5 42 21 3 11 7 Au stria Fran ce It aly N eth erl a nds Sp ain U n ite d Ki n gd o m

The number of hotels

547 3465 2678 1613 1537 1563 Au stria France It aly N eth erl a nds Sp ain U n ite d Ki n gd o m

The number of reviewers

8.53 8.3 7.92 _{7.63 8.25} 8.24 8.52 8.24 7.79 7.63 8.31 8.24 Au stria Fran ce It aly N eth er lan d s Sp ain U n ite d Ki n g…

Average score and

Average reviewer score

Average score Average reviewer score

109 83 128 538 140 223 Au stria France It aly N eth er lan d s Sp ain U n ite d Ki n g…

Average occupancy per

hotel

Average occupancy per hotel

5 86 372 48 23 6 7 76 601 1879 557 254 52 46 52 490 1691 191 179 48 27 44 220 1112 147 47 21 22 38 168 1049 176 68 19 19 16 213 1115 107 69 30 13 Africa Asia Europe North America Oceania Other South America

(43)

39

As shown in Figure 3.5, for each continent‟s reviewer, the countries they mainly comment on being France and Italy, followed by the Netherlands. Most of the reviews are mainly from European tourists.

Figure 3.6 ranking the total number of historical reviews per hotel in all clusters We use the hotel's total historical comments to determine whether the hotel is a new merchant or an old merchant of Booking.com. As shown in Figure 3.6, we can see that the total number of the historical reviews of most hotels is below 2,000 in all clusters. The average number of historical reviews for each hotel is 1,357. Well, in the cluster0, the total historical number of reviews is 67,136, and the average historical review per hotel is 754 which is less than 1,357. Therefore, most hotels are new merchants in cluster0.

We project the latitude and longitude of the hotel onto the map for analysis. The hotels in our data are mainly concentrated in the major cities of the six countries; they are London, Paris, Milan, Vienna, Amsterdam, and Barcelona.

Figure 3.7 Hotel density map for cluser0 vs. Hotel density map for all clusters in Paris 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 H o tel Dan ie l Pa ris H o tel S ezz P ar is Ma n d ar in … H ilto n Gard en … H o tel Lu m en … H t el Ke p p le r H o tel Mo n ge Lo n d o n … Th e Ro o kery Ho te l De s Sain ts … Th is tle H yd e P ar k Ban ke H t el La Su ite We st… G ra n ge H o lb o rn … Att ica 21 … Ros ew o o d … Col o m b ia An ta re s H o tel… Sc h los sh o te l R… Pu llm an P ar is … G ra n H o tel … Mil le n n iu m … Cou rty ar d b y… Cl u b Q u ar ters … H ilto n L o n d o n … H ilto n Gard en … citize n M To w er…

(44)

40

As shown in figure 3.7, hotels in France in cluster 1 are mainly located near attractions in central Paris. There are several hotels far from the city center, close to train stations and subway stations.

Figure 3.8 Word Cloud for Positive reviews of hotels in Paris in Cluster0

As shown in Figure 3.8, Word Cloud for Positive reviews of hotels in Paris in Cluster0. The hotels in Paris are mainly gathered here for these reasons: room (clean), location (good), staff (friendly and helpful), and breakfast (great).

(45)

41

3.2 Cluster 1

As shown in figure 3.9, Cluster1 generally shows a downward trend. The maximum number of reviews occurs at the beginning of data time. Over time, while the number of reviews rose slightly over the summer, the number of reviews continued to decline gradually. In cluster1, historical reviews totaled 125,540, higher than cluster0. The average number of historical reviews per hotel in cluster1 is 878, lower than the average number of total hotels. So, most hotels are new businesses in the cluster.

(46)

42

As shown in figure 3.10, there are 143 hotels in cluster1. France has the most hotels, far more than any other country. Hotel reviews in France are slightly higher than in the UK, but because there are only 18 hotels in the UK, the average occupancy rate in the UK is higher. In terms of both scores, hotels are highest in Spain on both. Hotels in Italy do the opposite.

Figure 3.11 the number of reviews for each hotel country based on reviewers locate continents 7 79 19 ₆ ₁₄ 18 Au stria Fran ce It aly N eth erl a nds Sp ain U n ite d Ki n gd o m

The number of hotels

The number of reviewers

8.63 _8.31 7.94 8.42 8.89 8.22 8.73 8.3 7.83 8.41 8.88 8.22 Au stria Fran ce It aly N eth er lan d s Sp ain U n ite d Ki n g…

Average score and

Average reviewer score

average score Average reviewer score

152 137 ₉₉ 294 310 462 Au stria France It aly N eth er lan d s Sp ain U n ite d Ki n g…

Average occupancy per

hotel

(47)

43

As shown in Figure 3.11, for each continent‟s reviewer, the countries they mainly comment on are France. Most of the comments are mainly from European, Asian and North America tourists.

Figure 3.12 Hotel density map for cluser1 vs. Hotel density map all clusters in Paris Compared with the cluster density distribution of all clusters in Paris, cluster1 has a small number of hotels, low density, and scattered hotel locations as shown in Figure 3.12. In Cluster 1, Paris has a large number of hotels. They mainly gathered near the Arc de Triomphe and the Champs Elysées, until the French Concorde and the Louvre. And there are some hotels near the Paris Opera House and north of the railway station as well as Notre Dame and the Luxembourg to the south.

We conducted the following investigation and analysis to explore the reasons for the hotel distribution in Paris in cluster 1. Since the beginning of the Charlie Hebdo event in France in early 2015, the European terrorist attacks have continued. Since 2015, there have been 21 different levels of terrorist attacks in Europe. In terms of quantity, there have been the most terrorist attacks involving France, a total of 7; On November 13, 2015, a series of terrorist attacks occurred in Paris, France. On the evening of the same day, there were 6 shooting incidents, 3 explosions and a hostage hijacking in Paris. On July 14, 2016, on the French National Day, a large attack occurred in the French city of Côte d'Azur, a truck crashed into the National Day fireworks show crowd. These terrorist incidents have caused major casualties. On the Champs Elysées, the population is densely populated, but after the incident, many businesses have closed, and the customers in the still-opening clothing stores are very rare. This explains why the number of hotel reviews in Paris in Cluster 1 is generally declining, and the hotel density is concentrated near the Champs Elysées.

Although the number of comments has declined, these hotels still have people to choose to stay. It may have something to do with the previously unchangeable itinerary. As shown in Figure 3.13, word Cloud for Positive reviews of hotels in France in Cluster1. They choose to stay because of the following reasons: good location, friendly staff, comfortable beds, quiet

(48)

44

hotels, delicious breakfast and so on.

Figure3.13 Word Cloud for Positive reviews of hotels in France in Cluster1

3.3 Cluster 2

(49)

45

As shown in figure 3.14, the number of hotel reviews in cluster2 has generally become a wavy development trend. The number of hotel reviews peaked in August 2015. After the summer, the number of reviews gradually declined until the winter trough. Over time, the number of comments continued to increase in the spring of 2016, peaking steadily in the summer of 2016, and gradually declined after the summer. Reviews again hit lows in the winter. We think these hotels are mainly influenced by season and climate. Besides, in the cluster2, the total historical number of reviews is 97,580 and the average historical review per hotel is 904 which is less than the average number of total hotel, but it is higher than cluster0 and cluster1.

As shown in figure 3.15, the average score of each hotel and the average score of reviewers were all higher than 8 in cluster2. While France has the highest number of hotels and reviews, the average occupancy rate in the UK is much higher than in other countries. This is mainly because the UK has four hotels in the cluster. No hotel in cluster2 belongs to the Netherlands. 2660 8260 1663 3235 2970 Au stria Fran ce It aly Sp ain U n ite d Ki n gd o m

The number of reviews

15 52 12 25 ₄ Au stria Fran ce It aly Sp ain U n ite d Ki n gd …

The number of hotels

8.56 _8.32 8.21 8.43 8.5 8.53 _8.28 8.03 8.45 8.5 Au stria Fran ce It aly Sp ain U n ite d Ki n g…

Average score and

Average reviewer score

Average Score Average Reviewer Score

177 159 139 129 743 Au stria Fran ce It aly Sp ain U n ite d Ki n g…

Average occupancy per

hotel

(50)

46

Figure 3.16 the number of comments for each hotel countries and continents where each reviewer

As shown in Figure 3.16, for every continent‟s reviewer, the countries they mainly comment on are France. Most of the comments are mainly from European and Asian.

Figure 3.17 Hotel density map for cluser2 vs. Hotel density map all clusters in Paris As shown in figure 3.17, the hotel is located near the Champs Elysees Avenue in the 8th arrondissement in the center of Paris, close to the view in the 1st arrondissement. Others are mainly in the southern railway station (Gare Montparnasse) and southeastern railway station (Gare de Paris-Austerlitz). In addition, there are some hotels near the subway station, which is far away from the city center. Hotels in these places are more popular during the summer than in other months in cluster2,.

36 496 1623 313 130 36 26 267 1803 4026 1229 695 165 75 23 391 921 152 137 23 16 82 537 1962 387 180 53 34 34 147 2378 225 133 43 10 Africa Asia Europe North America Oceania Other South America

(51)

47

Figure 3.18 Word Cloud for Positive reviews of hotels in Paris in Cluster2

As shown in Figure 3.18, Word Cloud for Positive reviews of hotels in Paris in Cluster2. We can easily find hotels in these areas with a good location and room, friendly staff and helpful concierge. So, the guest of the review is always described in these beautiful words: great, excellent, amazing, beautiful, fantastic and so on.

(52)

48

3.4 Cluster 3

As shown in Figure 3.19, the overall change is not significant. Similar to the trend of cluster2, the number of reviews before entering the summer gradually increased, and the number of reviews declined after the summer. The difference is that in cluster3, there were very few reviews in November 2016 and February 2017; the number of reviews in December 2016 and January 2017 formed a small peak. Besides, the total historical reviews number of hotels is 375,486 in cluster 3 is higher than the previous other clusters. The average historical review of each hotel is 1,380, indicating that the hotel in cluster 3 is already a mature hotel business at Booking.com.

(53)

49

As shown in Figure 3.20, there are 272 hotels in the cluster3. The total number of comments in cluster3 is larger than the first three clusters. In addition to the UK's score of less than 8, hotels in other countries have more than 8.3. The Netherlands has the highest average hotel occupancy. The UK has the highest number of hotel reviews and France has the highest number of hotels.

Figure 3.21 the number of comments for each hotel countries and continents where each reviewer 44 75 37 16 49 51 Au stria Fran ce It aly N eth erl a nds Sp ain U n ite d Ki n gd o m

The number of hotel

The number of reviews

8.59 8.33 _{8.56 8.43 8.49 7.99} 8.61 8.36 8.51 8.49 8.52 7.92 Au stria Fran ce It aly N eth er lan d s Sp ain U n ite d Ki n g…

Average score and

Average reviewer score

average socres reviews socres

Average occupancy per

hotel

(54)

50

In cluster3, African tourists, mainly live in hotels in France and Spain as shown in the graph (Fig. 3.21). And Asian tourists mainly visit Italy and Spain European tourists are mainly in Spain and the United Kingdom. North American tourists are mainly in France and Spain. Oceania tourists are mainly in France. Tourists in South America are mainly in Italy, Spain and Austria. There are not many tourists to the Netherlands. But the Netherlands has the highest occupancy rates. That means most visitors to the Netherlands leave comments.

Figure 3.22 Hotel density map for cluser3 vs. Hotel density map all clusters in Paris As shown in Figure 3.22, the density distribution of hotels in cluster 3 is similar to the density distribution of hotels in all clusters in Paris. In some places, the density is sparse because of the small number of hotels.

Figure 3.23 Hotel density map for cluser3 vs. Hotel density map all clusters in Vienna As shown in Figure 3.23, the hotel density distribution of Vienna in cluster 3 is similar to that of all clusters in Vienna. The density of sparse in some places is due to the small number of hotels.

(55)

51

Figure 3.24 Hotel density map for cluser3 vs. Hotel density map all clusters in Milan As shown in Figure 3.24, Milan‟s hotel mainly concentrated in the historical center of Milan, near the main attractions and the train stations.

In order to clearly compare the difference between positive and negative evaluation, we used the picture of “thumbs-up" to indicate the word cloud of positive evaluation and the opposite gesture to indicate negative evaluation. The positive reviews of hotels in cluster 3 are mainly concentrated on good hotel and location, friendly staff, clean room and breakfast. Because these hotels are near the station and attractions, the positive words always including the train station and metro, the guest also thought these were convenient as shown in the graph (Fig. 3.25).

Figure 3.25 Word Cloud for Positive reviews vs. Word Cloud for Negative reviews of Milano in Cluster3

As shown in the Figure 3.25, it is very likely that the hotel is mostly concentrated near the train station, so there are key words about “street, noisy, room, small” in the negative reviews. Besides, the guest is also dissatisfied with the hotel's bathroom and network. Hotels can make improvements based on key themes of negative reviews.

(56)

52

3.5 Cluster 4

In cluster4, the number of reviews was a continuous rise and fall trend every month or every 2 months. However, the overall trend is downward as shown in the graph (Fig. 3.26). From the data, the total historical number of reviews is 548354 and the average historical review per hotel is 1642. The total number of historical evaluations is large, so this series of hotels is more mature in Booking.com.

(57)

53

As shown in Figure 3.27, there are 334 hotels in the cluster4. The average score and the average reviewer average of hotels are both quite high. Especially for hotels in Italy, the average reviewer scores are much higher than the average score. But the UK hotel's data are more prominent.

Figure 3.28 the number of comments for each hotel countries and continents where each reviewer 37 53 ₇ 35 ₂₁ 181 Au stria Fran ce It aly N eth erl an d s Sp ain U n ite d Ki n gd …

The number of hotels

The number of reviews

8.61 _{8.34 8.19 8.56} _{8.63 8.34} 8.65 8.3 8.37 8.45 8.55 8.35 Au stria Fran ce It aly N eth er lan d s Sp ain U n ite d Ki n g…

Hotel performance analysis based on Booking.com reviews

Abstract

Contents

Introduction

Chapter 1

Background

1.1 Booking.com

1.1.1 Hotel merchants

1.1.2 User registration and booking

1.1.3 Check in and evaluation

1.1.4 Calculation of accommodation ratings

1.2 Data mining background

1.2.1 Data Mining

1.2.2 Cluster analysis

1.2.3 K-means algorithm

1.3 Python language

Chapter 2

Data Understanding and Preprocessing

2.1 Dataset description

2.2 Data cleaning and Data understanding

2.2.1 Hotels and reviews

2.2.2 Reviewers’ Nationality

2.2.3 Average Score

2.2.4 Reviewer Score

2.2.5 Total number of reviews reviewer has given

2.2.6 Total Number of Reviews

2.2.7 The reviews given dates

2.2.8 Positive Review and Negative Review

2.2.9 Review Total Positive Word Counts vs. Review Total

Negative Word Counts

Chapter 3

Cluster analysis based on the number of

reviews

3.1 Cluster 0

The number of hotels

The number of reviewers

Average score and

Average reviewer score

Average occupancy per

hotel

3.2 Cluster 1

The number of hotels

The number of reviewers

Average score and

Average reviewer score

Average occupancy per

hotel

3.3 Cluster 2

The number of reviews

The number of hotels

Average score and

Average reviewer score

Average occupancy per

hotel

3.4 Cluster 3

The number of hotel

The number of reviews

Average score and

Average reviewer score

Average occupancy per

hotel

3.5 Cluster 4

The number of hotels

The number of reviews

Average score and

Average reviewer score

Average occupancy per

hotel