OCR dataset

This dataset contains handwritten words dataset collected by Rob Kassel at MIT Spoken Language Systems Group. I selected a "clean" subset of the words and rasterized and normalized the images of each letter. Since the first letter of each word was capitalized and the rest were lowercase, I removed the first letter and only used the lowecase letters. The tab delimited data file (letter.data.gz) contains a line for each letter, with its label, pixel values, and several additional fields listed in letter.names file.

Fields

id: each letter is assigned a unique integer id
letter: a-z
next_id: id for next letter in the word, -1 if last letter
word_id: each word is assigned a unique integer id (not used)
position: position of letter in the word (not used)
fold: 0-9 -- cross-validation fold
p_i_j: 0/1 -- value of pixel in row i, column j