People Playing Musical Instrument (PPMI)
                                                ---- A dataset of human and object interaction activities

The PPMI dataset contains images of humans interacting with twelve different musical instruments. They are: bassoon, cello, clarinet, erhu, flute, French horn, guitar, harp, recorder, saxophone, trumpet, and violin.

  • Images of bassoon, erhu, flute, French horn, guitar, saxophone, and violin were collected by Bangpeng Yao and published in [Yao and Fei-Fei, 2010].
  • Images of cello, clarinet, harp, recorder, and trumpet were collected by Aditya Khosla and released in Septempber 2010.
  • More annotations of this data set and some baseline results will be coming soon ...
     
Download
  The 7-class dataset: [Download original images, 392MB]  [Download normalized images, 58MB]
  The 12-class dataset: [Download original images, 833MB]  [Download normalized images, 100MB]


Dataset Reference
  Bangpeng Yao and Li Fei-Fei. Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.  [pdf]  [full version]  [BibTex]


What's new in PPMI
  • Differernt interactions with the same object. For each instrument, there are images that contain a person playing the instrument (PPMI+), as well as images that contain a person holding the instrument without playing (PPMI-).
     PPMI+:
       
       
     PPMI-:
       
       
     
  • Original and cropped & normalized images. On each image, we also crop image neighborhood of the face(s) of the "target person(s)", then normalize the image neighborhood so that the size of human face is 32-by-32 pixels.
                          Original image              Normalized image                            Original image              Normalized image
                                    
     
  • Real world images, background clutter. All the images are download from internet. Resources of the images include image search engines Google, Yahoo, Baidu, and Bing, and photo hosting webistes Flickr, Picasa and Photobucket.
     
Image statistics
 Original image:
bassoon cello clarinet erhu flute French horn
PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI-
train test train test train test train test train test train test train test train test train test train test train test train test
85 87 83 81 94 93 89 89 88 91 86 79 96 95 71 77 89 88 69 64 91 88 78 71

guitar harp recorder saxophone trumpet violin
PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI-
train test train test train test train test train test train test train test train test train test train test train test train test
100 100 98 90 100 99 95 98 85 87 71 75 99 99 83 86 97 95 91 87 89 96 83 84

 Normalized image:
bassoon cello clarinet erhu flute French horn
PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI-
train test train test train test train test train test train test train test train test train test train test train test train test
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100

guitar harp recorder saxophone trumpet violin
PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI- PPMI+ PPMI-
train test train test train test train test train test train test train test train test train test train test train test train test
100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100



Baseline Results
This section contains baseline results on two tasks:
  • 24-class Classification Task
    All PPMI+ and PPMI- images from the 12 different musical instruments forming 24 classes
  • 12 Binary Classification Tasks
    PPMI+ vs PPMI- for each of the 12 different musical instruments
The experimental setting is described below.

Accuracy & Mean Average Precision (mAP) on the 24-class Classification Task:
 BoWSPM [1]LLC [2]Grouplets [3]
AccuracymAPAccuracymAPAccuracymAPAccuracymAP
Single Patch Size-18.3%-35.6%-39.8%-36.7%
Multiple Patch Sizes25.8%22.7%41.8%39.1%44.2%41.8%--

Mean Average Precision (mAP) on the 12 Binary Classification Tasks using a Single Patch Size:
InstrumentBoWSPM [1]LLC [2]Grouplets [3]
Bassoon65.4%77.1%77.9%78.5%
Erhu79.5%84.3%90.6%87.6%
Flute83.8%93.6%95.9%95.7%
FrenchHorn73.8%88.3%82.0%84.0%
Guitar75.4%87.5%89.7%87.7%
Saxphone75.3%88.0%84.7%87.7%
Violin77.7%91.0%89.0%93.0%
Trumpet70.4%74.9%76.8%76.3%
Cello66.9%81.3%81.1%84.6%
Clarinet68.7%81.1%75.8%82.3%
Harp69.9%85.3%83.5%87.1%
Recorder69.8%72.1%70.3%76.5%
Average73.1%83.7%83.9%85.1%

Accuracy & Mean Average Precision (mAP) on the 12 Binary Classification Tasks using Multiple Patch Sizes:
InstrumentBoWSPM [1]LLC [2]
AccuracymAPAccuracymAPAccuracymAP
Bassoon62.0%73.6%76.0%84.6%77.0%85.0%
Erhu78.0%82.2%81.0%88.0%82.0%89.5%
Flute76.0%86.3%88.5%95.3%92.0%97.3%
FrenchHorn71.5%79.0%87.0%93.2%85.0%93.6%
Guitar75.0%85.1%86.0%93.7%83.0%92.4%
Saxphone72.5%84.4%81.0%89.5%80.5%88.2%
Violin73.0%80.6%86.5%93.4%89.0%96.3%
Trumpet63.5%69.3%71.5%82.5%74.0%86.7%
Cello71.0%77.3%77.0%85.7%77.0%83.3%
Clarinet65.0%70.5%73.0%82.7%77.5%84.8%
Harp67.5%75.0%82.5%92.1%85.5%93.9%
Recorder65.0%73.0%70.5%78.0%70.0%79.1%
Average70.0%78.0%80.0%88.2%81.0%89.2%


Experimental Setting
All of the experiments use the normalized PPMI images for both training and testing. Each of the tasks is evaluated on two settings: (1) Single patch size; (2) Multiple patch sizes. This refers to the number of patch sizes used when extracting the SIFT descriptor from each of the locations in the grid.

The remaining parameters are set to the following values:
  • Type of SIFT descriptors: Grayscale
  • SIFT patch sizes: 8, 10, 14, 18, 22, 26, 30
  • SIFT grid spacing: 4 pixels
  • Pyramid Levels: 1+2+4+8 (4 levels)
  • Dictionary Size: 1024
  • Kernel: linear for LLC [2], and histogram intersection kernel for SPM [1]

References
[1] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006.

[2] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.

[3] B. Yao and L. Fei-Fei. Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.



Contact: bangpeng@cs.stanford.edu
                 aditya86@stanford.edu