|
The PPMI dataset contains images of humans interacting with twelve different musical instruments. They are: bassoon, cello, clarinet, erhu, flute, French horn, guitar, harp, recorder, saxophone, trumpet, and violin.
The 7-class dataset: [Download original images, 392MB] [Download normalized images, 58MB] The 12-class dataset: [Download original images, 833MB] [Download normalized images, 100MB] Dataset Reference Bangpeng Yao and Li Fei-Fei. Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. [pdf] [full version] [BibTex] What's new in PPMI
|
Image statistics |
Original image: |
bassoon | cello | clarinet | erhu | flute | French horn | ||||||
PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- |
train test | train test | train test | train test | train test | train test | train test | train test | train test | train test | train test | train test |
85 87 | 83 81 | 94 93 | 89 89 | 88 91 | 86 79 | 96 95 | 71 77 | 89 88 | 69 64 | 91 88 | 78 71 |
guitar | harp | recorder | saxophone | trumpet | violin | ||||||
PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- |
train test | train test | train test | train test | train test | train test | train test | train test | train test | train test | train test | train test |
100 100 | 98 90 | 100 99 | 95 98 | 85 87 | 71 75 | 99 99 | 83 86 | 97 95 | 91 87 | 89 96 | 83 84 |
Normalized image: |
bassoon | cello | clarinet | erhu | flute | French horn | ||||||
PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- |
train test | train test | train test | train test | train test | train test | train test | train test | train test | train test | train test | train test |
100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 |
guitar | harp | recorder | saxophone | trumpet | violin | ||||||
PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- | PPMI+ | PPMI- |
train test | train test | train test | train test | train test | train test | train test | train test | train test | train test | train test | train test |
100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 | 100 100 |
Baseline Results This section contains baseline results on two tasks:
Accuracy & Mean Average Precision (mAP) on the 24-class Classification Task: |
BoW | SPM [1] | LLC [2] | Grouplets [3] | |||||
Accuracy | mAP | Accuracy | mAP | Accuracy | mAP | Accuracy | mAP | |
Single Patch Size | - | 18.3% | - | 35.6% | - | 39.8% | - | 36.7% |
Multiple Patch Sizes | 25.8% | 22.7% | 41.8% | 39.1% | 44.2% | 41.8% | - | - |
Mean Average Precision (mAP) on the 12 Binary Classification Tasks using a Single Patch Size: |
Instrument | BoW | SPM [1] | LLC [2] | Grouplets [3] |
Bassoon | 65.4% | 77.1% | 77.9% | 78.5% |
Erhu | 79.5% | 84.3% | 90.6% | 87.6% |
Flute | 83.8% | 93.6% | 95.9% | 95.7% |
FrenchHorn | 73.8% | 88.3% | 82.0% | 84.0% |
Guitar | 75.4% | 87.5% | 89.7% | 87.7% |
Saxphone | 75.3% | 88.0% | 84.7% | 87.7% |
Violin | 77.7% | 91.0% | 89.0% | 93.0% |
Trumpet | 70.4% | 74.9% | 76.8% | 76.3% |
Cello | 66.9% | 81.3% | 81.1% | 84.6% |
Clarinet | 68.7% | 81.1% | 75.8% | 82.3% |
Harp | 69.9% | 85.3% | 83.5% | 87.1% |
Recorder | 69.8% | 72.1% | 70.3% | 76.5% |
Average | 73.1% | 83.7% | 83.9% | 85.1% |
Accuracy & Mean Average Precision (mAP) on the 12 Binary Classification Tasks using Multiple Patch Sizes: |
Instrument | BoW | SPM [1] | LLC [2] | |||
Accuracy | mAP | Accuracy | mAP | Accuracy | mAP | |
Bassoon | 62.0% | 73.6% | 76.0% | 84.6% | 77.0% | 85.0% |
Erhu | 78.0% | 82.2% | 81.0% | 88.0% | 82.0% | 89.5% |
Flute | 76.0% | 86.3% | 88.5% | 95.3% | 92.0% | 97.3% |
FrenchHorn | 71.5% | 79.0% | 87.0% | 93.2% | 85.0% | 93.6% |
Guitar | 75.0% | 85.1% | 86.0% | 93.7% | 83.0% | 92.4% |
Saxphone | 72.5% | 84.4% | 81.0% | 89.5% | 80.5% | 88.2% |
Violin | 73.0% | 80.6% | 86.5% | 93.4% | 89.0% | 96.3% |
Trumpet | 63.5% | 69.3% | 71.5% | 82.5% | 74.0% | 86.7% |
Cello | 71.0% | 77.3% | 77.0% | 85.7% | 77.0% | 83.3% |
Clarinet | 65.0% | 70.5% | 73.0% | 82.7% | 77.5% | 84.8% |
Harp | 67.5% | 75.0% | 82.5% | 92.1% | 85.5% | 93.9% |
Recorder | 65.0% | 73.0% | 70.5% | 78.0% | 70.0% | 79.1% |
Average | 70.0% | 78.0% | 80.0% | 88.2% | 81.0% | 89.2% |
Experimental Setting All of the experiments use the normalized PPMI images for both training and testing. Each of the tasks is evaluated on two settings: (1) Single patch size; (2) Multiple patch sizes. This refers to the number of patch sizes used when extracting the SIFT descriptor from each of the locations in the grid. The remaining parameters are set to the following values:
References [1] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features:Spatial pyramid matching for recognizing natural scene categories. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2006. [2] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. [3] B. Yao and L. Fei-Fei. Grouplet: A Structured Image Representation for Recognizing Human and Object Interactions. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010. Contact: bangpeng@cs.stanford.edu aditya86@stanford.edu |