Hotel Review Corpus

Photo by Unsplash

The HotelReview Corpus is a corpus of 1000 reviews extracted from where each review contain the following information:

  • The city where the hotel is located, the reviewer nationality, the date when the review was written and the type of reviewer from a set of 7 categories, such as solo traveler, young couple and group.
  • A score in 0-10 describing the overall opinion of the reviewer. This score is not given by the reviewer, but automatically calculated by from the rates assigned by the reviewer to 5 aspects: Hotel staff, Services/facilities, Cleanliness of your room, Comfort, Value for money and Location. Unfortunately, these disaggregated scores are not available in the reviews.
  • A brief free-text describing, separately, what the reviewer liked and disliked during the stay in the hotel.

As in many reviews the score asiggned by to the review bears no relation at all with the text describing the user opinion, each review has been manually tagged with a 5-classes category within the set [Excellent, Good, Fair, Poor, Very poor] and with a 3-classes category within the set [Good, Fair, Poor]. The corpus is adapted to be used in Polarity Classification (using the positive opinion and the negative opinion of each review) or in Rating Inference (using the 1000 reviews and the two sets of categories). Also, as each review includes many information about the reviewer, it is posible to see the effect of such data in the opinions expressed.

Please if you use the corpus cite:

(2011). A Joint Model of Feature Mining and Sentiment Analysis for Product Review Rating. Advances in Information Retrieval - 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings.


The HotelReview Corpus is available for reserch purpose in XML format:

Download Hotel Review Corpus

Jorge Carrillo-de-Albornoz
Jorge Carrillo-de-Albornoz
Associate Professor and Researcher in Language Technologies

My research interests include Natural Language Processing, Machine Learning and Systems Evaluation.