HotelReview Corpus

The HotelReview Corpus is a corpus of 1000 reviews extracted from booking.com where each review contain the following information:

  • The city where the hotel is located, the reviewer nationality, the date when the review was written and the type of reviewer from a set of 7 categories, such as solo traveler, young couple and group.
  • A score in 0-10 describing the overall opinion of the reviewer. This score is not given by the reviewer, but automatically calculated by booking.com from the rates assigned by the reviewer to 5 aspects: Hotel staff, Services/facilities, Cleanliness of your room, Comfort, Value for money and Location. Unfortunately, these disaggregated scores are not available in the reviews.
  • A brief free-text describing, separately, what the reviewer liked and disliked during the stay in the hotel.

As in many reviews the score asiggned by booking.com to the review bears no relation at all with the text describing the user opinion, each review has been manually tagged with a 5-classes category within the set [Excellent, Good, Fair, Poor, Very poor] and with a 3-classes category within the set [Good, Fair, Poor]. The corpus is adapted to be used in Polarity Classification (using the positive opinion and the negative opinion of each review) or in Rating Inference (using the 1000 reviews and the two sets of categories). Also, as each review includes many information about the reviewer, it is posible to see the effect of such data in the opinions expressed.

Please if you use the corpus cite:

  • Jorge Carrillo de Albornoz, Laura Plaza, Pablo Gervás, Alberto Díaz. 2011. A Joint Model of Feature Mining and Sentiment Analysis for Product Review Rating. In proceedings of the 33rd European Conference on Information Retrieval (ECIR 2011).

The HotelReview Corpus is available for reserch purpose in XML format.