A binary sentiment classification benchmark introduced in Maas et al., ACL 2011. Also known as aclImdb or simply IMDB.
This dataset contains 50,000 polarized movie reviews scraped from IMDB — 25,000 for training and 25,000 for testing — plus 50,000 additional unlabeled reviews for unsupervised or semi-supervised learning. Reviews are labeled only when strongly positive (score ≥ 7/10) or strongly negative (score ≤ 4/10), so the resulting classification task is challenging but unambiguous. Both raw text and pre-processed bag-of-words formats are provided.
In the years since release the dataset has become one of the standard text-classification benchmarks in NLP, used to evaluate landmark models including ULMFiT, ELMo, BERT, RoBERTa, XLNet, ALBERT, and DistilBERT. It ships in Hugging Face Datasets, TensorFlow Datasets, Keras, and PyTorch-NLP.
Large Movie Review Dataset v1.0 — raw text reviews plus pre-tokenized bag-of-words features.
See the included README for full file layout, class balance, and a description of the bag-of-words encoding.
| Split | Reviews | Labels |
|---|---|---|
| Train | 25,000 | balanced pos / neg |
| Test | 25,000 | balanced pos / neg |
| Unsupervised | 50,000 | unlabeled |
A non-exhaustive list of landmark papers and models that use this dataset to evaluate text-classification or representation-learning quality. Modern transformer models reach 95–97% test accuracy on the binary task; the original 2011 paper reported 88.89%.
If you use this dataset, please cite the ACL 2011 paper:
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T.
and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
Questions or comments about the dataset? Email amaas [at] cs.stanford.edu.