Sentiment Analysis of Reviews on Yelp dataset

gracecamc168
Dec 30, 2020
6 min read

Updated: Jan 7, 2021

This project is also from one of my Business Analytics courses. The datasets my teammate and I use are from Yelp website, but it is also available from Kaggle. We only selected the datasets of Business and Reviews. My teammate did the prediction by using the Business dataset, and I did sentiment analysis by using Reviews dataset. Therefore, the work of this post was done by me. But the Business dataset is also interesting, it took my teammate a lot of time to finish the data preprocessing part. I use R as the programming language for this project.

There are two main steps to building a model: data preparation and model selection. Overall, I use tidymodel for this analysis. Data preparation, I use textrecipe for the text normalization steps such as tokenize the text for 1-grams and 2-grams, respectively, remove stopwords, remove special characters and symbols, impose minimum frequency constraints and so forth. For dataset with imbalance classes, I use smote function to resample the data. I also use workflow function to bundle the preprocessing process(recipe) and model together. This method can help me manage modeling pipelines more easily. After all the preprocessing work done, using the preprocessed data, I train three machine learning (e.g., Logistic Regression, Lasso Regression and Random Forest) and compare their effectiveness at predicting sentiments.

Data Preparation The raw dataset contains approximately seven million reviews on more than 200,000 businesses. I combine the review dataset with Business dataset to extract the reviews that are only for restaurants. Due to limited computational power, we can only select 30,000 reviews for our dataset. There are four columns in our dataset, Business_id, Stars, Text_review, and a newly added column “Rating”. The column “Rating” is based on Stars ranging from 1 to 5. The two classes are negative (stars from 1 to 2) and positive (stars from 3 to 5). Therefore, this is a binary classification problem.

Distribution First we need to understand our data. For example, how the rating is distributed.

Fig.1 (left) shows the distribution of ratings in our original dataset. Fig.1 (right) shows the distribution of ratings after grouping it into two classes. By calculation, there are nearly four times more reviews being positive than being negative. Both graphs show that a significant number of customers would give out positive reviews.

Evaluation Metric We use Accuracy, Precision, Recall and F-score as the evaluation metric for measuring and comparing model performances:

From the rating distribution graphs above, we can see that our class ratio (positive/negative) is at 75/25. It is considered as a slightly imbalance classes issue but not too severe. Later, we will examine if having a completely balanced dataset would improve model performance. For imbalanced classes training set, we will use smote function to oversample the rare classes for the training set (the rare class is negative).

Data Preprocessing With either model, we perform the same data preprocessing. First, we use tidymodels to split our dataset into training and testing sets, 75% and 25%, respectively. We then load textrecipes to do the general recipe steps.

Remove special characters and symbols, such as \n ,\ and all other punctuations
Tokenize the text into 1-gram or 2-gram words
Remove 175 English stop words, such as “the”, “and”, “to”
Impose minimum frequency of 100 times and maximum frequency of 1000 times, since it is not that useful when using rare words for prediction
Weight the tokens by setting up term frequency (tf-idf)
Normalize (center and scale) all the newly created tf-idf values because the models I am using are sensitive to this.
Create a new column called “Rating” based on the original Stars from the dataset. I set the Rating being “Positive” when the Stars ranges from 3 to 5 and being “Negative” when the Stars ranges from 1 to 3 (3 is excluded).
- Resample the training set by using smote function. I also do not resample the training set in order to compare the performance.

It ends up having 22501 data points.

Methodology After the standardization work, I now fit the models to the dataset. I use Lasso regression and Random forest to compare the performance. Lasso Regression is also called Penalized Logistic Regression, imposing a penalty to the model when the dataset is noisy. Another benefit of using Lasso Regression is that it automatically chooses the most contributive predictor variables for us, forcing the coefficients of the less contributive variables to be exactly zero, thereby, keeping the most significant variables in the model(kassambara,2018). On the other hand, Random Forest is an ensemble of decision trees, usually trained with the “bagging” method; The general idea of the bagging method is that a combination of learning models increases the overall result (Donges,2019). Both Logistic Regression and Random Forest are two of the most popular machine learning models for classification and regression problems. They are both easy to use and understand.

1.Lasso Regression Fit the model without choosing the best penalty and doing resample, but only with cross-validation, I have a result of 83% overall accuracy, which is not too bad. However, I have low prediction for negative sentiments and high false positive rate. This is not the expected result.

2.Lasso Regression---model improvement: Fit the model with choosing the best penalty (0.00234), doing resample, and cross- validation, I have a result of 88% overall accuracy, which is a 5% improvement from the previous model. Moreover, this model produces a much better prediction for negative sentiments and lower false positive rate. This is the expected result for the analysis.

3.Random Forest

I now use random forest model to fit our dataset to compare the performance with Lasso Regression.I tune to get the best parameters, resample the training set by using smote function, and do 10-fold cross-validation. The overall accuracy of random forest model is 87%, which is good. But this model also has the issue of producing higher false positive rate and lower prediction power for negative sentiments. Therefore, Lasso Regression model with improvement is the better performer.

Random Forest model is not considered for this analysis, it is also because the overall training and testing time is about three times longer than Lasso Regression. Therefore, Lasso Regression is a preferable model for the analysis. Since Lasso Regression with improvement methods is the preferable model, I use this model to do further examination as below.

Balanced Classes Method I set the two classes (positive and negative) for the training set at 50/50 ratio, then train the best Lasso Regression model (Lasso Regression---model improvement) to predict the test set with imbalance classes. Surprisingly, the overall performance measures such as accuracy, recall, precision and F score are not outperforming the training set with imbalance classes. If I do resample strategy by using the smote function and 10-fold cross validation, the training set with imbalance class can beat the overall performance of the training with balance classes.

Fig. 7 shows that having a balanced training set does not give us a higher overall accuracy and F-score rate than having an imbalance training set. More importantly, with the balanced training set, I get a higher false positive rate and a lower true negative rate, which I do not want to have for the purpose of this analysis. I want to have a better prediction for the negative sentiments, lower rate of false positive. By only doing so, restaurants have a higher chance to improve the quality of their food, service and so forth. Hence, I forsook the balanced training set method.

Number of N-Grams Method I use the best Lasso Regression model (Lasso Regression---model improvement), along with n-grams method to see if the use of n-grams can improve performance. N-grams means, for example, the text corpus “the service is bad”.

Five 1-gram words: “ the”,”service”,”is”,”bad”
Three 2-gram words: “the service”,”service is”,”is bad”

For my experiment, I compare the performance of the best Lasso Regression model with 1-gram and with 2-gram. The result turns out that the overall performance of model with 2-gram is worse than the model with 1-gram. This model has lower overall accuracy, higher false positive rate. Therefore, I stick to the best Lasso Regression model with 1-gram I did previously.

Best Model Summary---Lasso Regression By evaluation, I end up using the Methodology 2---Lasso Regression with the best penalty for the final model. This model produces 88% accuracy, 89% recall rate, 95% precision rate and 92% on F-score.

From this model, I can predict the words that are positive and negative. For example, the below graph shows us the top 50 positive and negative words, respectively.

The performance of the model looks decent, however, I could improve the performance with many different other methods. One of the suggestions for future work is to try other classifiers like neural networks, Xgboost models. Furthermore, I can try to predict the negative and positive words based upon restaurant types, such as Italian, Asian, Mexican and so forth, to provide a more insightful analysis, if the dataset contains suffficent reviews for each type of restaurants. For this dataset, the Category variable, which represents restaurant type, is not that well-formatted. You may see a lot of restaurants that set their type as Chinese&Japanese&Korean together, which is a bit confusing.

Recommendation Sentiment analysis can give restaurants a brief understanding of overall sentiment based on the predicted negative and positive words. By using this model, restaurants will be able to identify customers’ opinions. For example, perhaps great reviews are primarily due to good food quality and unsatisfied reviews are caused by overpriced items. From the top 50 negative words list, I can observe that food, price, order waiting time, service from staff, inside environment are what customers care the most. Therefore, restaurants can pay attention to these areas for improvement. Moreover, restaurants can also perform their own predictions based on this model. They can dig into more detail of how customers commented by loading more words out.

Sentiment Analysis of Reviews on Yelp dataset

Recent Posts

تعليقات