Translation to text files: ========================== - The following is a descrption of the preprocessing applied to NIPS 2000-2003 papers. - The online pdf files from 2003 and 2002 were downloaded together with ps.gz files from 2001 and 2000. See exceptions below. - Text was extracted from pdf files using pdftotext version 2.02pl1 (Copyright 1996-2003 Glyph & Cog, LLC). and from ps files using pstotext (version 1.8g of 25 January 2000) - The file NIPS2003_.pdf renamed NIPS2003_CN13.pdf - The following pdf files were replaced by an online versions from the internet, because the pdf file in the NIPS database could not be converted to text. NIPS2001: VS02 AP08 IM02 NIPS2002: AA11 AA12 AA53 NS11 CS10 NS22 IM10 NIPS2003: CS03 - The following 2002,2003 files had their texts extracted from the ps version, using pstotext (version 1.8g of 25 January 2000) because pdftotext didnt produce readable text. NIPS2003: AA45 NIPS2002: CS02 CS07 AA13 AA14 AA19 AA21 AA22 AA27 AA43 AA50 AA54 AA57 AA63 VS05 VS17 LT07 LT09 LT14 LT18 NS09 NS17 NS20 CN06 CN09 SP* AP* - The papers NIPS2000/EdelmanIntrator NIPS2000/Brown NIPS2000/CadezSmith NIPS2000/ZengerKoch NIPS2001/AA09 NIPS2001/IM06 NIPS2001/NS13 NIPS2002/AA10 NIPS2002/LT15 NIPS2002/SP01 NIPS2003/NS08 were excluded from the data because translation to text failed and no readable version was found online. - NIPS2001/IM01 was excluded from the data, since the version in the NIPS repository couldnot be translated and extraction of text from the online version was blocked. - continuation lines were unified, and ligature characters were fixed using: foreach file(`ls *.txt`) cat $file \ | perl -pe 's/-\n//g' \ | perl -pe 's/\\214/fi/g' \ | perl -pe 's/\\256/fi/g' \ | perl -pe 's/\\257/fl/g' \ > tmp mv tmp $file; end Calculating counts ================== - words counts were extracted using the bow package (http://www-2.cs.cmu.edu/~mccallum/bow/), with the processing used by sam roweis in processing the nips1-12 data. rainbow -d $NIPSDIR/Rainbow -i $NIPSDIR/NIPS200?/Texts -D5 --use-stoplist --shortest-word=3 rainbow -d $NIPSDIR/Rainbow -i $NIPSDIR/NIPS200?/Texts -D5 --use-stoplist --shortest-word=3 -B=sin | sort +0 -1 > counts rainbow -d $NIPSDIR/Rainbow -i $NIPSDIR/NIPS200?/Texts -D5 --use-stoplist --shortest-word=3 -B=a | head -1 > wordlist Counts and authors names were unified with the NIPS 1-12 data prepared by Roweiss. Author names were checked manually for overlaps and multiple spellings.