Introduction
In this article, we will discuss about a large movie review dataset for binary semantic analysis
. It is extracted from IMDb(https://www.imdb.com/) movie sites and handled by Stanford
university. The dataset contains 50,000 movie reviews labeled as positive or negative. It is divided into training and testing sets with 25,000 and 25,000 movie reviews respectively. It is labeled as positive or negative based on the sentiment of the review. It also include an additional 50,000 unlabeled documents for unsupervised learning.
Dataset Information
Positive Reviews
The positive reviews contains 25000 movie reviews labeled as positive. A positive review has a score >= 7 out of 10. The positive reviews are mostly positive and entertaining. The reviews are mostly written in the positive tone. In the aclImdb\train\pos
directory, the suffix number of files all >= 7
. Example file name like 10_8.txt
which means star rating 8/10
from IMDb.
Negative Reviews
The negative reviews contains 25000 movie reviews labeled as negative. A negative review has a score <= 4
out of 10. The negative reviews are mostly negative and sad. The reviews are mostly written in the negative tone. In the aclImdb\train\neg
directory, the suffix number of files all <= 4
. Example file name like 10_3.txt
which means star rating 3/10
from IMDb.
The reviews with more neutral ratings are not included in the train/test sets.
Unlabeled Reviews
The unlabeled reviews contains 50000 movie reviews that are not labeled as positive or negative. These reviews are used for unsupervised learning. In the aclImdb\train\unsup
directory, these reviews are not labeled as positive or negative, the suffix number of files all are 0
which means the score. Example file name like 10_0.txt
.
Test dataset reviews
The test reviews contains 25000 movie reviews. These reviews are used for evaluating the performance of the model. In the aclImdb\test
directory, it contains pos
and neg
directories with the same structure as the train
directory.
Tokenized BoW
In addition to the review text files, there is a already-tokenized bag of words (BoW) features that were used in Standford
experiments. These are stored in .feat files in the train/test directories. Each .feat file is in LIBSVM format, an ascii sparse-vector format for labeled data. The feature indices in these files start from 0, and the text tokens corresponding to a feature index is found in [imdb.vocab]. So a line with 0:7 in a .feat file means the first word in imdb.vocab appears 7 times in that review.
Download the Dataset
The complete dataset can be downloaded from the following link: https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz. Once you download and unzip it, you can see the major dataset structure as shown below:
1 | aclImdb |
In the .txt
file, the text content is review of the movie.