OCR dataset

This dataset contains handwritten words dataset collected by Rob Kassel at MIT Spoken Language Systems Group. I selected a "clean" subset of the words and rasterized and normalized the images of each letter. Since the first letter of each word was capitalized and the rest were lowercase, I removed the first letter and only used the lowecase letters. The tab delimited data file (letter.data.gz) contains a line for each letter, with its label, pixel values, and several additional fields listed in letter.names file.


  1. id: each letter is assigned a unique integer id
  2. letter: a-z
  3. next_id: id for next letter in the word, -1 if last letter
  4. word_id: each word is assigned a unique integer id (not used)
  5. position: position of letter in the word (not used)
  6. fold: 0-9 -- cross-validation fold
  7. p_i_j: 0/1 -- value of pixel in row i, column j