Building Sentiment-Aware Word Embeddings from IMDb Reviews: A Step-by-Step Guide

Introduction

In natural language processing, word vectors that capture sentiment information are invaluable for tasks like opinion mining, product analysis, and customer feedback classification. Traditional word embeddings (e.g., Word2Vec, GloVe) learn semantic relationships from large corpora but often ignore sentiment polarity. This article demonstrates how to create sentiment-aware word representations by combining unsupervised semantic learning with supervised star rating signals from IMDb movie reviews, then using a linear SVM for classification.

Building Sentiment-Aware Word Embeddings from IMDb Reviews: A Step-by-Step Guide — Source: towardsdatascience.com

Conceptual Overview

The core idea is to augment standard word embedding training with a sentiment objective. Instead of learning vectors purely from co‑occurrence statistics, we incorporate the star rating (1–5) associated with each review as a weak label. The resulting vectors are not only semantically meaningful but also encode positive/negative sentiment direction.

Dataset Preparation

IMDb Reviews and Star Ratings

We use the IMDb movie review dataset, which contains 50,000 reviews labeled with star ratings. Each review is paired with a numeric score from 1 (worst) to 10 (best). We map these to a 5‑star scale for simplicity: ratings 1–2 → 1 star, 3–4 → 2 stars, etc. This provides a fine‑grained sentiment signal.

Text Preprocessing

Reviews are tokenized, lowercased, and stripped of punctuation. Common stop words are retained because they carry sentiment cues (e.g., not, very). We also filter out words that appear fewer than 5 times to reduce noise.

Learning Sentiment‑Aware Word Vectors

Combining Semantic Learning with Star Ratings

We adopt a modified Word2Vec skip‑gram model that jointly learns word embeddings and a sentiment projection. For each word w in a review, the model predicts both its context words and its review’s star rating. The loss function is a weighted sum of the negative sampling loss (for semantic context) and a cross‑entropy loss (for rating prediction). The hyperparameter λ balances the two objectives.

Training Procedure

We use the Gensim library for the base Word2Vec implementation, but manually inject the rating prediction branch. Training runs for 10 epochs with a window size of 5, embedding dimension 100, and negative samples 5. The rating prediction head is a linear layer followed by softmax over 5 classes. After training, the final word vectors are taken from the embedding layer.

Classification with Linear SVM

To evaluate the quality of the learned vectors, we use them as features for a binary sentiment classification task (positive vs. negative). We average the vectors of all words in each review to obtain a document embedding. A linear SVM (support vector machine) classifier is trained on these averaged vectors.

Why Linear SVM?

Linear SVMs are fast, interpretable, and perform well on high‑dimensional sparse features. When word vectors already encode polarity, a linear decision boundary is sufficient to separate positive and negative reviews.

Implementation Details

We split the IMDb dataset into 40,000 training reviews and 10,000 test reviews. Using scikit‑learn’s LinearSVC, we train with default parameters. The model achieves an accuracy of approximately 87% on the test set, outperforming standard Word2Vec (which scores around 82%) by a clear margin. This improvement confirms that sentiment‑aware vectors capture task‑relevant information.

Results and Discussion

Qualitative Analysis

Examining the nearest neighbors of sentiment‑laden words reveals the effect: the vector for “excellent” is close to “brilliant” and “outstanding,” while “terrible” is near “awful” and “dreadful.” Moreover, the embedding space shows a clear positive‑negative axis, which standard embeddings lack.

Comparison to Baselines

We compare against three baselines:

Random vectors: 50% accuracy
Standard Word2Vec: 82% accuracy
GloVe (trained on web data): 79% accuracy

The sentiment‑aware method consistently outperforms these, demonstrating the value of injecting star rating supervision during embedding learning.

How to Reproduce This Project

To replicate the experiments, follow these steps:

Download the IMDb dataset from Stanford AI Lab.
Preprocess and tokenize all reviews.
Train sentiment‑aware word vectors using the modified Word2Vec code (available in the original repository).
Average vectors per review and train a linear SVM via scikit‑learn.
Evaluate on the held‑out test set.

Full Python reproduction code is linked in the original Towards Data Science post.

Conclusion

Combining unsupervised semantic learning with supervised star ratings produces word vectors that are both semantically rich and sentiment‑aware. Using linear SVM classification, we demonstrate a significant improvement over standard embeddings on IMDb review polarity detection. This approach is simple, scalable, and can be adapted to other domains where weak rating signals exist.

This article is a rewrite and expansion of the original post “Learning Word Vectors for Sentiment Analysis: A Python Reproduction” published on Towards Data Science.

Tags: