Stanford University · Computer Science

Large Movie Review Dataset

A binary sentiment classification benchmark introduced in Maas et al., ACL 2011. Also known as aclImdb or simply IMDB.

This dataset contains 50,000 polarized movie reviews scraped from IMDB — 25,000 for training and 25,000 for testing — plus 50,000 additional unlabeled reviews for unsupervised or semi-supervised learning. Reviews are labeled only when strongly positive (score ≥ 7/10) or strongly negative (score ≤ 4/10), so the resulting classification task is challenging but unambiguous. Both raw text and pre-processed bag-of-words formats are provided.

In the years since release the dataset has become one of the standard text-classification benchmarks in NLP, used to evaluate landmark models including ULMFiT, ELMo, BERT, RoBERTa, XLNet, ALBERT, and DistilBERT. It ships in Hugging Face Datasets, TensorFlow Datasets, Keras, and PyTorch-NLP.

50,000

Labeled reviews

7,920

Citations of the source paper

179K

Hugging Face downloads / month

1,547+

Models on Hugging Face

Download

aclImdb_v1.tar.gz ~84 MB compressed · ~133 MB extracted

Large Movie Review Dataset v1.0 — raw text reviews plus pre-tokenized bag-of-words features.

See the included README for full file layout, class balance, and a description of the bag-of-words encoding.

Splits

Split	Reviews	Labels
Train	25,000	balanced pos / neg
Test	25,000	balanced pos / neg
Unsupervised	50,000	unlabeled

Fields

text: raw movie review (string)
label: 0 = negative, 1 = positive

Mirrors & library integrations

Hugging Face Datasets

Parquet format, 100K rows, 1,547+ trained models.

load_dataset("stanfordnlp/imdb")

TensorFlow Datasets

Configurations for plain text, byte-pair, and subword encodings.

tfds.load("imdb_reviews")

Keras Datasets

Pre-tokenized integer-indexed reviews for quick experimentation.

keras.datasets.imdb.load_data()

PyTorch-NLP

Train/test splits as Python iterables.

imdb_dataset(train=True, test=True)

Zenodo

Archival mirror of the original v1.0 release.

Used as a benchmark by

A non-exhaustive list of landmark papers and models that use this dataset to evaluate text-classification or representation-learning quality. Modern transformer models reach 95–97% test accuracy on the binary task; the original 2011 paper reported 88.89%.

BERT

Devlin et al., NAACL 2019 · pretraining benchmark

ULMFiT

Howard & Ruder, ACL 2018 · transfer-learning showcase

ELMo

Peters et al., NAACL 2018 · contextualized embeddings

RoBERTa

Liu et al., 2019 · replication study of BERT

XLNet

Yang et al., NeurIPS 2019 · permutation-based pretraining

ALBERT

Lan et al., ICLR 2020 · parameter-efficient BERT

DistilBERT

Sanh et al., 2019 · distillation benchmark

Sentence-BERT

Reimers & Gurevych, EMNLP 2019 · sentence embeddings

Paragraph Vectors / Doc2Vec

Le & Mikolov, ICML 2014 · document representations

NLP-progress leaderboard

Sebastian Ruder · sentiment-analysis SOTA tracking

Citation

If you use this dataset, please cite the ACL 2011 paper:

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.
               and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for
               Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}

Download .bib file · Read the paper (PDF)

Contact

Questions or comments about the dataset? Email amaas [at] cs.stanford.edu.