The Stanford AI Lab Blog
http://ai.stanford.edu/blog/
The Stanford AI Lab (SAIL) Blog is a place for SAIL students, faculty, and researchers to share our work with the general public.Mon, 03 May 2021 12:37:23 -0700Stanford AI Lab Papers and Talks at ICLR 2021
/blog/iclr-2021/
/blog/iclr-2021/<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/logo.png" /></p>
<p>The <a href="https://iclr.cc">International Conference on Learning Representations</a> (ICLR) 2021 is being hosted virtually from May 3rd - May 7th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!</p>
<h2 id="list-of-accepted-papers">List of Accepted Papers</h2>
<h4 id="adaptive-procedural-task-generation-for-hard-exploration-problems"><a href="https://arxiv.org/pdf/2007.00350.pdf">Adaptive Procedural Task Generation for Hard-Exploration Problems</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img10" />
<strong>Authors</strong>: Kuan Fang, Yuke Zhu, Silvio Savarese, Li Fei-Fei
<br /><strong>Contact</strong>: kuanfang@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2007.00350.pdf">Paper</a> | <a href="https://www.youtube.com/watch?v=JrkW8XFJPcE">Video</a> | <a href="https://kuanfang.github.io/apt-gen/">Website</a>
<br /><strong>Keywords</strong>: curriculum learning, reinforcement learning, procedural generation</p>
<hr />
<h4 id="anytime-sampling-for-autoregressive-models-via-ordered-autoencoding"><a href="https://openreview.net/forum?id=TSRTzJnuEBS">Anytime Sampling for Autoregressive Models via Ordered Autoencoding</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img1" />
<strong>Authors</strong>: Yilun Xu, Yang Song, Sahaj Garg, Linyuan Gong, Rui Shu, Aditya Grover, and Stefano Ermon
<br /><strong>Contact</strong>: ylxu@mit.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/forum?id=TSRTzJnuEBS">Paper</a>
<br /><strong>Keywords</strong>: autoregressive models, anytime algorithm, sampling</p>
<hr />
<h4 id="improved-autoregressive-modeling-with-distribution-smoothing"><a href="https://arxiv.org/abs/2103.15089">Improved Autoregressive Modeling with Distribution Smoothing</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img25" />
<strong>Authors</strong>: Chenlin Meng, Jiaming Song, Yang Song, Shengjia Zhao and Stefano Ermon
<br /><strong>Contact</strong>: chenlin@cs.stanford.edu
<br /><strong>Award nominations:</strong> Oral presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2103.15089">Paper</a> | <a href="https://cs.stanford.edu/~chenlin/smoothing/">Website</a>
<br /><strong>Keywords</strong>: generative models, autoregressive models</p>
<hr />
<h4 id="concept-learners-for-few-shot-learning"><a href="https://openreview.net/pdf?id=eJIJF3-LoZO">Concept Learners for Few-Shot Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img6" />
<strong>Authors</strong>: Kaidi Cao*, Maria Brbić*, Jure Leskovec
<br /><strong>Contact</strong>: kaidicao@cs.stanford.edu, mbrbic@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/pdf?id=eJIJF3-LoZO">Paper</a> | <a href="https://snap.stanford.edu/comet/">Website</a>
<br /><strong>Keywords</strong>: few-shot learning, meta learning</p>
<hr />
<h4 id="conditional-negative-sampling-for-contrastive-learning-of-visual-representations"><a href="https://arxiv.org/abs/2010.02037">Conditional Negative Sampling for Contrastive Learning of Visual Representations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img9" />
<strong>Authors</strong>: Mike Wu, Milan Mosse, Chengxu Zhuang, Daniel Yamins, Noah Goodman
<br /><strong>Contact</strong>: wumike@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.02037">Paper</a>
<br /><strong>Keywords</strong>: contrastive learning, negative samples, mutual information</p>
<hr />
<h4 id="cut-out-the-annotator-keep-the-cutout-better-segmentation-with-weak-supervision"><a href="https://openreview.net/pdf?id=bjkX6Kzb5H">Cut out the annotator, keep the cutout: better segmentation with weak supervision</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img15" />
<strong>Authors</strong>: Sarah Hooper, Michael Wornow, Ying Hang Seah, Peter Kellman, Hui Xue, Frederic Sala, Curtis Langlotz, Christopher Ré
<br /><strong>Contact</strong>: smhooper@stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/pdf?id=bjkX6Kzb5H">Paper</a>
<br /><strong>Keywords</strong>: medical imaging, segmentation, weak supervision</p>
<hr />
<h4 id="evaluating-the-disentanglement-of-deep-generative-models-through-manifold-topology"><a href="https://openreview.net/forum?id=djwS0m4Ft_A">Evaluating the Disentanglement of Deep Generative Models through Manifold Topology</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img3" />
<strong>Authors</strong>: Sharon Zhou, Eric Zelikman, Fred Lu, Andrew Y. Ng, Gunnar E. Carlsson, Stefano Ermon
<br /><strong>Contact</strong>: sharonz@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/forum?id=djwS0m4Ft_A">Paper</a> | <a href="https://github.com/stanfordmlgroup/disentanglement">Website</a>
<br /><strong>Keywords</strong>: generative models, evaluation, disentanglement</p>
<hr />
<h4 id="heteroskedastic-and-imbalanced-deep-learning-with-adaptive-regularization"><a href="https://openreview.net/pdf?id=mEdwVCRJuX4">Heteroskedastic and Imbalanced Deep Learning with Adaptive Regularization</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img5" />
<strong>Authors</strong>: Kaidi Cao, Yining Chen, Junwei Lu, Nikos Arechiga, Adrien Gaidon, Tengyu Ma
<br /><strong>Contact</strong>: kaidicao@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/pdf?id=mEdwVCRJuX4">Paper</a>
<br /><strong>Keywords</strong>: deep learning, noise robust learning, imbalanced learning</p>
<hr />
<h4 id="how-does-mixup-help-with-robustness-and-generalization"><a href="https://openreview.net/pdf?id=8yKEo06dKNo">How Does Mixup Help With Robustness and Generalization?</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img12" />
<strong>Authors</strong>: Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, James Zou
<br /><strong>Contact</strong>: jamesz@stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/pdf?id=8yKEo06dKNo">Paper</a>
<br /><strong>Keywords</strong>: mixup, adversarial robustness, generalization</p>
<hr />
<h4 id="denoising-diffusion-implicit-models"><a href="https://arxiv.org/abs/2010.02502">Denoising Diffusion Implicit Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img23" />
<strong>Authors</strong>: Jiaming Song, Chenlin Meng, Stefano Ermon
<br /><strong>Contact</strong>: tsong@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.02502">Paper</a>
<br /><strong>Keywords</strong>: generative models</p>
<hr />
<h4 id="in-n-out-pre-training-and-self-training-using-auxiliary-information-for-out-of-distribution-robustness"><a href="https://arxiv.org/abs/2012.04550">In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img8" />
<strong>Authors</strong>: Sang Michael Xie*, Ananya Kumar*, Robbie Jones*, Fereshte Khani, Tengyu Ma, Percy Liang
<br /><strong>Contact</strong>: xie@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2012.04550">Paper</a> | <a href="https://github.com/p-lambda/in-n-out">Website</a>
<br /><strong>Keywords</strong>: pre-training, self-training theory, robustness, out-of-distribution, unlabeled data, auxiliary information, multi-task learning theory, distribution shift</p>
<hr />
<h4 id="interactive-weak-supervision-learning-useful-heuristics-for-data-labeling"><a href="https://arxiv.org/abs/2012.06046">Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img4" />
<strong>Authors</strong>: Benedikt Boecking, Willie Neiswanger, Eric Xing, Artur Dubrawski
<br /><strong>Contact</strong>: neiswanger@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2012.06046">Paper</a> | <a href="https://github.com/benbo/interactive-weak-supervision">Website</a>
<br /><strong>Keywords</strong>: weak supervision, active learning, interactive learning, data programming, level set estimation</p>
<hr />
<h4 id="negative-data-augmentation"><a href="https://arxiv.org/abs/2102.05113">Negative Data Augmentation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img24" />
<strong>Authors</strong>: Abhishek Sinha*, Ayush Kumar*, Jiaming Song*, Burak Ukzent, Hongxia Jin, Stefano Ermon
<br /><strong>Contact</strong>: tsong@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2102.05113">Paper</a>
<br /><strong>Keywords</strong>: negative data augmentation, generative model, representation learning</p>
<hr />
<h4 id="language-agnostic-representation-learning-of-source-code-from-structure-and-context"><a href="https://arxiv.org/abs/2103.11318">Language-Agnostic Representation Learning of Source Code from Structure and Context</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img11" />
<strong>Authors</strong>: Daniel Zügner, Tobias Kirschstein, Michele Catasta, Jure Leskovec, Stephan Günnemann
<br /><strong>Contact</strong>: pirroh@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2103.11318">Paper</a> | <a href="https://twitter.com/DanielZuegner/status/1376989614867636224?s=20">Blog Post</a> | <a href="https://www.code-transformer.org">Website</a>
<br /><strong>Keywords</strong>: transformer; source code; ml4code</p>
<hr />
<h4 id="learning-energy-based-models-by-diffusion-recovery-likelihood"><a href="https://arxiv.org/abs/2012.08125">Learning Energy-Based Models by Diffusion Recovery Likelihood</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img2" />
<strong>Authors</strong>: Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P. Kingma
<br /><strong>Contact</strong>: ruiqigao@ucla.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2012.08125">Paper</a>
<br /><strong>Keywords</strong>: energy-based models, diffusion score models, generative modeling</p>
<hr />
<h4 id="learning-from-protein-structure-with-geometric-vector-perceptrons"><a href="https://openreview.net/forum?id=1YLJDvSx6J4">Learning from Protein Structure with Geometric Vector Perceptrons</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img21" />
<strong>Authors</strong>: Bowen Jing, Stephan Eismann, Patricia Suriana, Raphael John Lamarre Townshend, Ron Dror
<br /><strong>Contact</strong>: bjing@cs.stanford.edu, seismann@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/forum?id=1YLJDvSx6J4">Paper</a> | <a href="https://github.com/drorlab/gvp-pytorch">Website</a>
<br /><strong>Keywords</strong>: structural biology, graph neural networks, proteins, geometric deep learning</p>
<hr />
<h4 id="mongoose-a-learnable-lsh-framework-for-efficient-neural-network-training-"><a href="https://openreview.net/forum?id=wWK7yXkULyh">MONGOOSE: A Learnable LSH Framework for Efficient Neural Network Training </a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img20" />
<strong>Authors</strong>: Beidi Chen, Zichang Liu, Binghui Peng, Zhaozhuo Xu, Jonathan Lingjie Li, Tri Dao, Zhao Song , Anshumali Shrivastava , Christopher Ré
<br /><strong>Contact</strong>: beidic@stanford.edu
<br /><strong>Award nominations:</strong> Oral
<br /><strong>Links:</strong> <a href="https://openreview.net/forum?id=wWK7yXkULyh">Paper</a> | <a href="https://youtu.be/aTpI4ba2lPY">Video</a> | <a href="https://github.com/HazyResearch/mongoose">Website</a>
<br /><strong>Keywords</strong>: efficient training; locality sensitive hashing; nearest-neighbor search;</p>
<hr />
<h4 id="model-patching-closing-the-subgroup-performance-gap-with-data-augmentation"><a href="https://arxiv.org/pdf/2008.06775.pdf">Model Patching: Closing the Subgroup Performance Gap with Data Augmentation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img7" />
<strong>Authors</strong>: Karan Goel*, Albert Gu*, Sharon Li, Christopher Re
<br /><strong>Contact</strong>: kgoel@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2008.06775.pdf">Paper</a> | <a href="https://hazyresearch.stanford.edu/data-aug-part-4">Blog Post</a> | <a href="https://www.youtube.com/watch?v=IqRh-SVNl-c">Video</a> | <a href="https://github.com/HazyResearch/model-patching">Website</a>
<br /><strong>Keywords</strong>: data augmentation, robustness, consistency training</p>
<hr />
<h4 id="nearest-neighbor-machine-translation"><a href="https://openreview.net/pdf?id=7wCBOfJ8hJM">Nearest Neighbor Machine Translation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img22" />
<strong>Authors</strong>: Urvashi Khandelwal, Angela Fan, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis
<br /><strong>Contact</strong>: urvashik@stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/pdf?id=7wCBOfJ8hJM">Paper</a>
<br /><strong>Keywords</strong>: nearest neighbors, machine translation</p>
<hr />
<h4 id="on-the-critical-role-of-conventions-in-adaptive-human-ai-collaboration"><a href="https://arxiv.org/abs/2104.02871">On the Critical Role of Conventions in Adaptive Human-AI Collaboration</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img19" />
<strong>Authors</strong>: Andy Shih, Arjun Sawhney, Jovana Kondic, Stefano Ermon, Dorsa Sadigh
<br /><strong>Contact</strong>: andyshih@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2104.02871">Paper</a> | <a href="https://ai.stanford.edu/blog/conventions/">Blog Post</a> | <a href="https://github.com/Stanford-ILIAD/Conventions-ModularPolicy">Website</a>
<br /><strong>Keywords</strong>: multi-agent systems, human-robot interaction</p>
<hr />
<h4 id="pmi-masking-principled-masking-of-correlated-spans"><a href="https://openreview.net/pdf?id=3Aoft6NWFej">PMI-Masking: Principled masking of correlated spans</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img14" />
<strong>Authors</strong>: Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown, Moshe Tennenholtz, Yoav Shoham
<br /><strong>Contact</strong>: shoham@cs.stanford.edu
<br /><strong>Award nominations:</strong> Spotlight selection
<br /><strong>Links:</strong> <a href="https://openreview.net/pdf?id=3Aoft6NWFej">Paper</a>
<br /><strong>Keywords</strong>: masked language models, pointwise mutual information (pmi)</p>
<hr />
<h4 id="score-based-generative-modeling-through-stochastic-differential-equations"><a href="https://arxiv.org/abs/2011.13456">Score-Based Generative Modeling through Stochastic Differential Equations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img0" />
<strong>Authors</strong>: Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole
<br /><strong>Contact</strong>: yangsong@cs.stanford.edu
<br /><strong>Award nominations:</strong> Outstanding Paper Award
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.13456">Paper</a> | <a href="Coming soon">Blog Post</a> | <a href="https://github.com/yang-song/score_sde">Website</a>
<br /><strong>Keywords</strong>: generative modeling, stochastic differential equations, score matching, inverse problems, likelihood</p>
<hr />
<h4 id="selective-classification-can-magnify-disparities-across-groups"><a href="https://arxiv.org/abs/2010.14134">Selective Classification Can Magnify Disparities Across Groups</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img13" />
<strong>Authors</strong>: Erik Jones*, Shiori Sagawa*, Pang Wei Koh*, Ananya Kumar, Percy Liang
<br /><strong>Contact</strong>: erjones@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.14134">Paper</a>
<br /><strong>Keywords</strong>: selective classification, group disparities, log-concavity, robustness</p>
<hr />
<h4 id="theoretical-analysis-of-self-training-with-deep-networks-on-unlabeled-data"><a href="https://openreview.net/forum?id=rC8sJ4i6kaH">Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img18" />
<strong>Authors</strong>: Colin Wei, Kendrick Shen, Yining Chen, Tengyu Ma
<br /><strong>Contact</strong>: colinwei@stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/forum?id=rC8sJ4i6kaH">Paper</a>
<br /><strong>Keywords</strong>: deep learning theory, domain adaptation theory, unsupervised learning theory, semi-supervised learning theory</p>
<hr />
<h4 id="viewmaker-networks-learning-views-for-unsupervised-representation-learning"><a href="https://arxiv.org/abs/2010.07432">Viewmaker Networks: Learning Views for Unsupervised Representation Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img16" />
<strong>Authors</strong>: Alex Tamkin, Mike Wu, Noah Goodman
<br /><strong>Contact</strong>: atamkin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.07432">Paper</a> | <a href="https://ai.stanford.edu/blog/viewmaker/">Blog Post</a>
<br /><strong>Keywords</strong>: contrastive learning, domain-agnostic, pretraining, self-supervised, representation learning</p>
<hr />
<h4 id="practical-deepfake-detection-vulnerabilities-in-global-contexts">Practical Deepfake Detection: Vulnerabilities in Global Contexts</h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-05-03-iclr-2021/img17" />
<strong>Authors</strong>: Yang Andrew Chuming, Daniel Jeffrey Wu, Ken Hong
<br /><strong>Contact</strong>: ycm@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight talk at the ICLR-21 Workshop on Responsible AI
<br /><strong>Keywords</strong>: deepfake, deepfakes, robustness, corruption, low-bandwidth, faceforensics</p>
<hr />
<p>We look forward to seeing you at ICLR 2021!</p>
Mon, 03 May 2021 00:00:00 -0700Conventions in Multi-Agent Collaboration
/blog/conventions/
/blog/conventions/<p>Humans are good at collaborating with each other — e.g., playing team sports — in part because we adapt to our teammates over multiple repeated interactions. Through these interactions, teammates build a shared understanding of the collaboration strategy, which we refer to as conventions. For example, when playing basketball, teams formulate conventions for signaling when to pass the ball, which offensive formation to take, which players on the opposing team to guard, and more. This ability to build conventions is critical to a team’s success.</p>
<p>The notion of conventions as applied to collaborative tasks has been well-studied, especially in the linguistics literature <sup id="fnref:Hawkins17"><a href="#fn:Hawkins17" class="footnote">1</a></sup><sup id="fnref:Hawkins21"><a href="#fn:Hawkins21" class="footnote">2</a></sup>, where people have been shown to reduce their speech length when referring to the same objects with the same partners over repeated interactions. Even outside of linguistics, there are many cultural conventions (e.g., in the U.S., driving on the right side of an unmarked road) or personal conventions (e.g., personalized handshakes with friends) that we use.</p>
<p>It would be nice if we could apply the idea of conventions to human-AI collaboration, for example through assistive robotics. But before deploying robots and artificial agents into people’s homes (for cooking, cleaning, assembling furniture), we must be sure they can identify and learn such conventions in order to collaborate seamlessly with human partners.</p>
<p>In particular, collaboration in multi-agent tasks (e.g., team basketball) often involves two types of skills:</p>
<ul>
<li>Task-specific: fundamental skills relevant to the task (e.g., dribbling/shooting basketball)</li>
<li>Partner-specific: shared strategy developed with the partner (e.g., when to pass the ball)</li>
</ul>
<p>Task-specific skills are useful no matter who the partner is, such as learning the rules of the game. Partner-specific skills, on the other hand, refer to the shared strategy developed with the partner, i.e. conventions.</p>
<h3 id="breaking-symmetry">Breaking Symmetry</h3>
<p>A challenge in collaborating with others is that of symmetry. When there is only one optimal strategy, players can unambiguously go for that strategy. However, when there exist many strategies that are optimal, players might go for different ones, resulting in a combined joint action that is suboptimal. Breaking symmetry, thus, is when players develop conventions as a mechanism to break ties between a set of equally optimal strategies.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-28-conventions/image5.jpg" /></p>
</div></figure>
<p>For example, in the image above, Toronto Raptors point guard Kyle Lowry is giving a signal to coordinate an offensive play. To us, this signal could refer to almost anything, and there’s no way a priori to break symmetry between any of its possible meanings (e.g., does he want pick-and-roll, isolation, or something else?). Fortunately, his teammates can understand these signals based on conventions they’ve built through practice. As we can see, conventions are important in multi-agent collaboration since they solve the problem of breaking symmetry.</p>
<p>More concretely, consider a game of friendly Rock<script type="math/tex">(R)</script>-Paper<script type="math/tex">(P)</script>-Scissors<script type="math/tex">(S)</script>, where the goal is for two friends to throw the same hand. The joint action space of the two friends is <script type="math/tex">\{R,P,S\} \times \{R,P,S\}</script>, and the joint actions <script type="math/tex">(R,R), (P,P), (S,S)</script> are all optimal.</p>
<p>This problem seems trivial, but the two friends must make their actions independently without communicating, and without prior knowledge of the other person’s strategy. That is, their joint policy is factored: <script type="math/tex">p(a_1, a_2) = p(a_1) p(a_2)</script>. Even though any of the 3 optimal joint actions are good, on their first attempt there is no way to know which of the 3 to pick! This is the symmetry breaking problem — there may be many optimal joint actions, but the players must still collectively decide on the same one.</p>
<p>Fortunately, by trial-and-error and building a history of repeated interactions, we can eventually converge on a convention (always pick Rock) with our partners and break symmetry with this shared strategy.</p>
<h3 id="generalizing-to-new-partners-and-new-tasks">Generalizing to New Partners and New Tasks</h3>
<p>So far, we’ve described task-specific skills as important for learning about the fundamentals of the task, and partner-specific skills (i.e. conventions) as important for breaking symmetry. Why make this distinction between the two types of skills? The point is that if we can learn separate representations for tasks and partners, then we can perhaps transfer our knowledge over to new partners and new tasks!</p>
<p>Let’s look at a block placing game that reveals the interplay between task-specific and partner-specific skills. There is a 2x2 grid with a target Goal configuration that a red player (Bob) and a blue player (Alice) have to construct together. Only the red player sees the Goal configuration. The players start with an empty grid, and take turns. On each turn, a player can choose to move/place a block of their own color.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_50" src="/blog/assets/img/posts/2021-04-28-conventions/image3.png" /></p>
</div></figure>
<p>Suppose that Alice and Bob have been playing this game many times, and through trial-and-error have converged on a signaling strategy where on turn 1 Bob always places the red block horizontally opposite the blue block location. Below we see a possible progression of their game (with 4 turns, from left to right). On turn 1 Bob places the red block at the top-right corner. On turn 2 Alice does nothing. On turn 3 Bob places the red block in the correct bottom-right corner. On turn 4, based on signaling conventions that they’ve established, Alice correctly deduces that the blue block should be at the top-left corner.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-28-conventions/image1.gif" /></p>
</div></figure>
<p>From Bob’s perspective, the task-specific skill is moving the red block to the bottom-left to match the Goal configuration, whereas the partner-specific skill is signaling the correct blue block location to Alice using his action on turn 1. With this example, we can see how task-specific skills (placing the red block correctly) can be transferred to new partners, and how partner-specific signals can be transferred to tasks with similar symmetries (e.g., if the rules change such that the red block must end up at one of the positions that is empty in the Goal configuration, Bob can still re-use the same conventions to signal to Alice the location of the blue block).</p>
<h3 id="building-convention-aware-agents">Building Convention-Aware Agents</h3>
<p>Given the importance of building conventions, how can we build convention-aware artificial agents? In this work, we design an artificial agent to work well with new tasks and new partners based on the above intuition of separating task-specific and partner-specific representations. We consider two-player collaborative tasks with no external communication, where our agent plays as one of the players, and knows the identity of the task and the partner (e.g., a cooking robot might know if it is working in the same kitchen or with the same person that it has worked with before).</p>
<h5 id="modular-policy">Modular Policy</h5>
<p>We use a modular architecture that learns a task module for each task and a partner module for each partner. Given the task and the partner we are playing with, we use the corresponding modules to parameterize our policy. In our design, the task module first processes the input (the state observations of the task), and outputs a 1) latent representation <script type="math/tex">z</script> and 2) an action distribution <script type="math/tex">g^t</script>. Then the partner modules takes <script type="math/tex">z</script> as input and predicts another action distribution <script type="math/tex">g^p</script>, and the final policy is given by the multiple of the two actions distributions <script type="math/tex">\pi(a \vert s) = g^t(a \vert s) g^p(a \vert z)</script>.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2021-04-28-conventions/image4.gif" /></p>
</div></figure>
<p>The intuition behind this sequential setup is that the task module action distribution <script type="math/tex">g^t</script> assigns high probability to all the actions that are potentially good (roughly speaking, there exists a complementary partner action <script type="math/tex">a'</script> such that <script type="math/tex">Q(a, a')</script> is good). If there is only one such action, then <script type="math/tex">g^t</script> may be very sharp; if all actions are good then <script type="math/tex">g^t</script> may be uniform. Then, the partner module action distribution <script type="math/tex">g^p</script> outputs how to break the tie between the equally good actions, which we can interpret as the convention built with this partner.</p>
<p>Finally, to prevent the task module from being uninformative and pushing all the hard work to the partner module, we add a regularization term (Eq 1) so that the task module output distribution should match the marginal of the different partner module output distributions (that is, what we should do if we don’t know which partner we’re playing with).</p>
<figure class="figure"><div class="figure__main">
<table style="width:100%">
<colgroup>
<col span="1" style="width: 95%" />
<col span="1" style="width: 5%;" />
</colgroup>
<tr>
<td style="border:none"><img class="postimage_50" src="/blog/assets/img/posts/2021-04-28-conventions/image10.png" /></td>
<td style="border:none">(1)</td>
</tr>
</table>
</div></figure>
<p>When given a new partner, we train a new partner module with the same task module — this enables us to transfer over the marginal distribution of the good actions to make, and only worry about learning the tie-breaking preferences of the new partner.</p>
<p>When given a new task (with the same states/actions/dynamics, but different rewards), we train a new task module with the same partner module. This enables us to learn the complexities of the new task, while also recalling the preferences of the partner in terms of breaking ties between equally optimal actions.</p>
<h3 id="experiments">Experiments</h3>
<p>We ran experiments on multi-armed bandits, the block placing task described above, and a simplified 2-player version of Hanabi.</p>
<figure class="figure"><div class="figure__main">
<table style="width:100%">
<colgroup>
<col span="1" style="width: 30%;" />
<col span="1" style="width: 30%;" />
<col span="1" style="width: 30%;" />
</colgroup>
<tr>
<th>Multi-armed Bandit</th>
<th>Block Placing</th>
<th>Hanabi</th>
</tr>
<tr>
<td><img class="postimage_100" src="/blog/assets/img/posts/2021-04-28-conventions/image9.gif" /></td>
<td><img class="postimage_100" src="/blog/assets/img/posts/2021-04-28-conventions/image1.gif" /></td>
<td><img class="postimage_100" src="/blog/assets/img/posts/2021-04-28-conventions/image6.gif" /></td>
</tr>
</table>
</div></figure>
<p>We won’t go into details about the Multi-armed Bandit and the Hanabi tasks in this blogpost, but check out our <a href="https://arxiv.org/abs/2104.02871">paper</a> for more details and results!</p>
<p>Here we show some plots from the block placing task for both transferring to new partners and new tasks. The max reward for the block placing task is 20. For transferring to new partners, we first train a single task module by playing with a pool of 6 partners, and then test with 6 new partners. Throughout, we use the same task module but use a different partner module for each partner. We compare with baselines BaselineAgg, which aggregates the gradients from all the training partners during training, and First-Order Model-Agnostic Meta-Learning (FOMAML). In contrast to these baselines, our modular setup allows us to reinitialize only the partner-specific representations while re-using the task-specific representations, and we see that this enables faster adaptation to new partners.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-28-conventions/image2.png" /></p>
</div></figure>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-28-conventions/image7.png" /></p>
</div></figure>
<p>For transferring to a new task, we tweak the rule of the game such that the red player (Bob) must place the red block at one of the positions that is empty (white) in the Goal configuration. We train a task module for this tweaked task, and test if our modular architecture can directly generalize to the new task rules while remembering signalling conventions with old partners in a zero-shot manner. We compare with a baseline method that is similarly modular, but does not use a marginal regularization (see Equation 1 above) to push the task module to learn the right representations. Our results suggest that the marginal regularization term is important for transferring to new tasks.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-28-conventions/image8.png" /></p>
</div></figure>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-28-conventions/image11.png" /></p>
</div></figure>
<h3 id="takeaways">Takeaways</h3>
<p>We studied the role of task-specific skills and partner-specific skills (i.e., conventions) in multi-agent collaborative tasks. We explored the use of a modular architecture to train agents that can separate task-specific and partner-specific representations. With the modular setup, we are able to piece together new combinations of modules to adapt more quickly to novel combinations of tasks and partners!</p>
<p>For more details check out our ICLR 2021 paper “On the Critical Role of Conventions in Adaptive Human-AI Collaboration”.</p>
<p><a href="https://arxiv.org/abs/2104.02871">Paper</a></p>
<p><a href="https://github.com/Stanford-ILIAD/Conventions-ModularPolicy">Code</a></p>
<h3 id="acknowledgments">Acknowledgments</h3>
<p>Thanks to Sidd Karamcheti and Jacob Schreiber for their helpful comments on this blogpost!</p>
<div class="footnotes">
<ol>
<li id="fn:Hawkins17">
<p>Robert D Hawkins, Mike Frank, and Noah D Goodman. <a href="https://cogsci.mindmodeling.org/2017/papers/0098/paper0098.pdf">Convention-formation in iterated reference games</a>. In Proceedings of the 39th Annual Meeting of the Cognitive Science Society, 2017. <a href="#fnref:Hawkins17" class="reversefootnote">↩</a></p>
</li>
<li id="fn:Hawkins21">
<p>Robert D. Hawkins, Michael Franke, Michael C. Frank, Kenny Smith, Thomas L. Griffiths, Noah D. Goodman. <a href="https://arxiv.org/abs/2104.05857">From partners to populations: A hierarchical Bayesian account of coordination and convention</a>. 2021. <a href="#fnref:Hawkins21" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Wed, 28 Apr 2021 00:00:00 -0700Broadening the Reach of Contrastive Learning with Viewmaker Networks
/blog/viewmaker/
/blog/viewmaker/<h2 id="the-benefits-and-bounds-of-self-supervised-pretraining">The Benefits and Bounds of Self-Supervised Pretraining</h2>
<p>Deep learning is data hungry. Neural networks sometimes require millions of human-labeled data points to perform well, making it hard for the average person or company to train these models. This constraint keeps many important applications out of reach, including for rare diseases, low-resource languages, or even developers who want to train models on their own custom datasets.</p>
<p>Fortunately, self-supervised pretraining has recently come to the rescue. These algorithms teach models to learn from large amounts of raw data without requiring humans to label each data point. The resulting models need drastically fewer labeled examples to achieve the same performance on a particular task.</p>
<p>But currently, the pretraining methods used for different kinds of data are all distinct. Since new domains require new algorithms, pretraining is still underexplored in many high-impact domains, including healthcare, astronomy, and remote sensing, as well as multimodal settings that involve learning the relationships between different modalities, like language and vision.</p>
<h2 id="learning-views-for-contrastive-learning">Learning Views for Contrastive Learning</h2>
<p>In our <a href="https://arxiv.org/abs/2010.07432">ICLR 2021 paper</a>, we make progress on this problem by developing viewmaker networks, a single algorithm which enables competitive or superior pretraining performance on three diverse modalities: natural images, speech recordings, and wearable sensor data.</p>
<p>At its core, our method extends a number of view-based pretraining methods in computer vision. In this family of methods, depicted below, the network’s goal is to tell whether two distorted examples—known as <strong>views</strong>—were produced from the same original data point. These methods include contrastive learning methods such as <a href="https://arxiv.org/abs/2002.05709">SimCLR</a>, <a href="https://arxiv.org/abs/1911.05722">MoCo</a>, and <a href="https://arxiv.org/abs/1805.01978">InstDisc</a>, along with non-contrastive algorithms like <a href="https://arxiv.org/abs/2006.07733">BYOL</a> and <a href="https://arxiv.org/abs/2006.09882">SwAV</a>.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_90" src="/blog/assets/img/posts/2021-04-20-viewmaker/image7.png" /></p>
</div></figure>
<p>A key challenge for these methods is determining <strong>what kinds of views</strong> to produce from an input—since this determines how hard the task will be for the network, along with what capabilities the network will need to learn as it solves it. In computer vision, for example, the views are carefully-chosen combinations of image-specific data augmentation functions—such as cropping, blurring, and changes in hue, saturation, brightness, and contrast. Selecting views is currently more an art than a science, and requires both domain expertise and trial and error.</p>
<p>In our work, we train a new generative model—called a viewmaker network—to learn good views, without extensive hand tuning or domain knowledge. Viewmaker networks enable pretraining on a wide range of different modalities, including ones where what makes a good view is still unknown. Remarkably, even without the benefits of domain knowledge or carefully-curated transformation functions, our method produces models with comparable or superior transfer learning accuracy to handcrafted views on the three diverse domains we consider! This suggests that viewmaker networks may be an important step towards general pretraining methods that work across modalities.</p>
<h2 id="a-stochastic-bounded-adversary">A Stochastic Bounded Adversary</h2>
<p>At its core, a viewmaker network is a stochastic bounded adversary. Let’s break these terms down, one at a time:</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_90" src="/blog/assets/img/posts/2021-04-20-viewmaker/image4.png" /></p>
</div></figure>
<p><strong>Stochastic</strong>: Viewmaker networks accept a training example and a random noise vector, and output a perturbed input. Stochasticity enables the network to learn an infinite number of different views to apply to the input during pretraining.</p>
<p><strong>Bounded</strong>: The perturbations applied to an input shouldn’t be too large in magnitude—otherwise the pretraining task would be impossible. Because of this, the viewmaker network perturbations are bounded in strength.</p>
<p>But how can we control the strength of a perturbation in a domain-agnostic way? We use a simple L1-norm bound on the input—this gives the viewmaker the flexibility to make either strong changes to a small part of an input, or weaker transformations to a larger part. In practice, we train the viewmaker to directly output a delta—the difference to the eventual perturbation—which is added to the input after being scaled to an L1 radius.</p>
<p>This radius, or “distortion budget,” is tuned as a hyperparameter, but we found a single setting to work well across the three different modalities we considered.</p>
<p><strong>Adversary</strong>: What objective function should the viewmaker have? We train it adversarially—in other words, the viewmaker tries to increase the contrastive loss of the encoder network (e.g. SimCLR or InstDisc) as much as possible given the bounded constraint. Unlike GANs, which are known to suffer from training instability, we find viewmakers to be easier to train—perhaps because perturbing the data is a less challenging task than generating it.</p>
<h2 id="visualizing-the-learned-views">Visualizing the Learned Views</h2>
<p>Here are example perturbations for two of the different modalities we consider: natural images (left) and speech recordings (right). The center square shows a training example, while the outer images show the scaled perturbation generated by the viewmaker. The images show the diversity of views learned for a single input, as well as how the viewmaker network tailors the perturbations to the input image.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog//assets/img/posts/2021-04-20-viewmaker/image5.png" />
<img class="postimagehalf" src="/blog/assets/img/posts/2021-04-20-viewmaker/image3.png" /></p>
</div></figure>
<h2 id="performance-on-transfer-tasks">Performance on Transfer Tasks</h2>
<p>Remarkably, despite not requiring domain-specific assumptions, viewmaker networks match the performance of expert-tuned views used for images, as measured by performance on a range of transfer tasks. Furthermore, they outperform common views used for spectrograms (e.g. SpecAugment) when pretraining on speech and sensor data—improving transfer accuracy by +9% and +17% points, resp. on average. This suggests viewmakers may be an important ingredient for developing pretraining methods that work across modalities. The tables below show linear evaluation accuracies on image, audio, and wearable sensor datasets (in that order). See <a href="https://www.google.com/url?q=https://arxiv.org/abs/2010.07432&sa=D&source=editors&ust=1617490953716000&usg=AOvVaw2lmd_8dcmUPilJ1DAIO29X">the paper</a> for more details.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimagethird" src="/blog//assets/img/posts/2021-04-20-viewmaker/image6.png" />
<img class="postimagehalf" src="/blog/assets/img/posts/2021-04-20-viewmaker/image1.png" />
<img class="postimage_90" src="/blog/assets/img/posts/2021-04-20-viewmaker/image2.png" /></p>
</div></figure>
<h2 id="conclusion--future-work">Conclusion & Future Work</h2>
<p>Viewmaker networks untether contrastive learning from a particular set of domain-specific augmentations, resulting in a more general pretraining method. Our results show that viewmakers enable strong pretraining performance on three diverse modalities, without requiring handcrafted expertise or domain knowledge for each domain.</p>
<p>In terms of future work, we’re excited to see viewmakers applied to other domains, either by themselves or as a way to supplement existing handcrafted views. We’re also excited by potential applications of viewmakers to supervised learning and robustness research. Please check out our <a href="http://github.com/alextamkin/viewmaker">repo for code</a> and <a href="https://www.google.com/url?q=https://arxiv.org/abs/2010.07432&sa=D&source=editors&ust=1617490953717000&usg=AOvVaw0s8VKjGIewi6He45DNxrCQ">the paper</a> for more details!</p>
Tue, 20 Apr 2021 00:00:00 -0700Stanford AI Lab Papers and Talks at AISTATS 2021
/blog/aistats-2021/
/blog/aistats-2021/<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/logo.png" /></p>
<p>The <a href="https://aistats.org/aistats2021/">International Conference on Artificial Intelligence and Statistics</a> (AISTATS) 2021 is being hosted virtually from April 13th - April 15th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!</p>
<h2 id="list-of-accepted-papers">List of Accepted Papers</h2>
<h4 id="active-online-learning-with-hidden-shifting-domains"><a href="https://arxiv.org/abs/2006.14481">Active Online Learning with Hidden Shifting Domains</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/img1" />
<strong>Authors</strong>: Yining Chen, Haipeng Luo, Tengyu Ma, Chicheng Zhang
<br /><strong>Contact</strong>: cynnjjs@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.14481">Paper</a>
<br /><strong>Keywords</strong>: online learning, active learning, domain adaptation</p>
<hr />
<h4 id="a-constrained-risk-inequality-for-general-losses">A Constrained Risk Inequality for General Losses</h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/img0" />
<strong>Authors</strong>: Feng Ruan
<br /><strong>Contact</strong>: fengruan@stanford.edu
<br /><strong>Keywords</strong>: constrained risk inequality; super-efficiency</p>
<hr />
<h4 id="comparing-the-value-of-labeled-and-unlabeled-data-in-method-of-moments-latent-variable-estimation"><a href="https://arxiv.org/abs/2103.02761">Comparing the Value of Labeled and Unlabeled Data in Method-of-Moments Latent Variable Estimation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/img4" />
<strong>Authors</strong>: Mayee F. Chen, Benjamin Cohen-Wang, Stephen Mussmann, Frederic Sala, Christopher Ré
<br /><strong>Contact</strong>: mfchen@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2103.02761">Paper</a>
<br /><strong>Keywords</strong>: latent variable graphical model, method-of-moments, semi-supervised learning, model misspecification</p>
<hr />
<h4 id="efficient-computation-and-analysis-of-distributional-shapley-values"><a href="http://proceedings.mlr.press/v130/kwon21a.html">Efficient computation and analysis of distributional Shapley values</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/img7" />
<strong>Authors</strong>: Yongchan Kwon, Manuel A. Rivas, James Zou
<br /><strong>Contact</strong>: yckwon@stanford.edu
<br /><strong>Links:</strong> <a href="http://proceedings.mlr.press/v130/kwon21a.html">Paper</a> | <a href="https://github.com/ykwon0407/fast_dist_shapley">Website</a>
<br /><strong>Keywords</strong>: data valuation, distributional shapley value</p>
<hr />
<h4 id="improving-adversarial-robustness-via-unlabeled-out-of-domain-data"><a href="http://proceedings.mlr.press/v130/deng21b.html">Improving Adversarial Robustness via Unlabeled Out-of-Domain Data</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/img6" />
<strong>Authors</strong>: Zhun Deng, Linjun Zhang, Amirata Ghorbani, James Zou
<br /><strong>Contact</strong>: jamesz@stanford.edu
<br /><strong>Links:</strong> <a href="http://proceedings.mlr.press/v130/deng21b.html">Paper</a>
<br /><strong>Keywords</strong>: adversarial robustness, deep learning, out of domain data</p>
<hr />
<h4 id="misspecification-in-prediction-problems-and-robustness-via-improper-learning"><a href="http://proceedings.mlr.press/v130/marsden21a/marsden21a.pdf">Misspecification in Prediction Problems and Robustness via Improper Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/img3" />
<strong>Authors</strong>: Annie Marsden, John Duchi, Gregory Valiant
<br /><strong>Contact</strong>: marsden@stanford.edu
<br /><strong>Award nominations:</strong> Oral Presentation
<br /><strong>Links:</strong> <a href="http://proceedings.mlr.press/v130/marsden21a/marsden21a.pdf">Paper</a>
<br /><strong>Keywords</strong>: machine learning, probabilistic forecasting, statistical learning theory</p>
<hr />
<h4 id="online-model-selection-for-reinforcement-learning-with-function-approximation"><a href="https://arxiv.org/abs/2011.09750">Online Model Selection for Reinforcement Learning with Function Approximation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/img2" />
<strong>Authors</strong>: Jonathan Lee, Aldo Pacchiano, Vidya Muthukumar, Weihao Kong, Emma Brunskill
<br /><strong>Contact</strong>: jnl@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.09750">Paper</a>
<br /><strong>Keywords</strong>: reinforcement learning, model selection</p>
<hr />
<h4 id="right-decisions-from-wrong-predictions-a-mechanism-design-alternative-to-individual-calibration"><a href="http://proceedings.mlr.press/v130/zhao21a.html">Right Decisions from Wrong Predictions: A Mechanism Design Alternative to Individual Calibration</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-04-13-aistats-2021/img5" />
<strong>Authors</strong>: Shengjia Zhao, Stefano Ermon
<br /><strong>Contact</strong>: sjzhao@stanford.edu
<br /><strong>Award nominations:</strong> Oral
<br /><strong>Links:</strong> <a href="http://proceedings.mlr.press/v130/zhao21a.html">Paper</a> | <a href="https://ermongroup.github.io/blog/mechanism/">Blog Post</a>
<br /><strong>Keywords</strong>: uncertainty, trustworthiness, reliability</p>
<hr />
<p>We look forward to seeing you virtually at AISTATS!</p>
Tue, 13 Apr 2021 00:00:00 -0700Inside Chirpy Cardinal: Stanford's Open-Source Social Chatbot that Won 2nd place in the Alexa Prize
/blog/chirpy-cardinal/
/blog/chirpy-cardinal/<p style="font-size:1.3rem;">Last year, Stanford won 2nd place in the <a href="https://developer.amazon.com/alexaprize/challenges/past-challenges/challenge3">Alexa Prize Socialbot Grand Challenge 3</a> for social chatbots. In this post, we look into building a chatbot that combines the flexibility and naturalness of neural dialog generation with the reliability and practicality of scripted dialogue. We also announce an open-source version of our socialbot with the goal of enabling future research.</p>
<p style="font-size:1.3rem;">Our bot, Chirpy, is a modern social chatbot, tested and validated by real users, capable of discussing a broad range of topics. We can’t wait to introduce it to you!</p>
<h2 style="margin-top: 2rem" id="what-makes-chirpy-special">What makes Chirpy special?</h2>
<p>Social conversations – such as one you would have with a friend – challenge chatbots to demonstrate human traits: emotional intelligence and empathy, world knowledge, and conversational awareness. They also challenge us – researchers and programmers – to imagine novel solutions and build fast and scalable systems. As an <a href="https://developer.amazon.com/alexaprize/challenges/past-challenges/challenge3">Alexa Prize team</a>, we created Chirpy, a social chatbot that interacted with hundreds of thousands of people who rated the conversations and validated our approaches. We were able to successfully incorporate neural generation into Chirpy and here we explain what went into making that happen.</p>
<p><em>Recently, there has been enormous progress in building large neural net models that can produce fluent, coherent text by training them on a very large amount of text <sup id="fnref:BART"><a href="#fn:BART" class="footnote">1</a></sup><sup id="fnref:GPT-3"><a href="#fn:GPT-3" class="footnote">2</a></sup>. One might wonder, “Why not just extend these models and fine-tune them (or train them) on large dialogue corpora?”</em> In fact Meena<sup id="fnref:Meena"><a href="#fn:Meena" class="footnote">3</a></sup>, BlenderBot<sup id="fnref:BlenderBot"><a href="#fn:BlenderBot" class="footnote">4</a></sup>, and DialoGPT<sup id="fnref:DialoGPT"><a href="#fn:DialoGPT" class="footnote">5</a></sup> attempt to do exactly this, yet they are not being used to chat with people in the real world – why? First, they lack controllability and can respond unpredictably to new, or out-of-domain inputs. For example they can generate toxic language and cause safety issues. Furthermore, they lack consistency over long conversations – forgetting, repeating, and contradicting themselves. Although neurally-generated dialog is more flexible and natural than scripted dialog, as the number of neural turns increases, their errors compound and so does the likelihood of producing inconsistent or illogical responses.</p>
<p><em>Since neural generation is unreliable, practical conversational agents are dominated by hand-written rules:</em> dialogue trees, templated responses, topic ontologies, and so on. Figure 1 shows an example of this type of conversation. The bot gives a series of scripted responses, choosing responses based on fixed criteria – for example, whether or not a song name is detected. This design has benefits: developers writing these rules can interpret the system’s choices, control the direction of the conversation, and ensure consistency. But these rules are brittle! An unanticipated user utterance that gets misclassified into a wrong branch can lead to absurd responses and lead down a path that is hard to recover from. Additionally, depending on a predetermined set of templated responses limits the range of possibilities and can make the bot sound unnatural.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" style="float:right; display:inline;" src="/blog/assets/img/posts/2021-04-09-chirpy-cardinal/figure1_600p.png" /></p>
<figcaption>
<strong>Figure 1</strong>: Example of a hand-written dialogue tree. Unanticipated or misclassified user responses can lead to absurd responses and paths that are hard to recover from.
</figcaption>
</div></figure>
<p>Both neural and scripted dialog have clear benefits and drawbacks. We designed Chirpy to take advantage of both, choosing a modular architecture that lets us fluidly combine elements of neural and scripted conversations. We refer to our bot’s modules as response generators, or RGs. Each response generator is designed for a specific type of conversation such as talking about music, exchanging opinions, or sharing factual information. Based on their roles, some response generators are scripted, some are entirely neural, and others use a combination of neural and scripted dialog. We track entities, topics, emotions, and opinions using rules that maintain consistency across response generators, without sacrificing the flexibility of our neural components. This modular design allows users to add new response generators without needing to alter large parts of the codebase every time they want to extend Chirpy’s coverage. In the following sections, we highlight a few response generators and how they fit into our broader system.</p>
<h2 id="response-generators">Response Generators</h2>
<p style="font-size:1.25rem;">Our response generators range from fully rule-based to fully neural. First, we’ll highlight three to demonstrate this range: the music response generator, which is entirely rule-based, the personal chat response generator, which relies entirely on a neural generative model, and the wikipedia response generator, which combines both rule-based and neurally-generated content. In the next section, we’ll discuss how Chirpy decides which RG to use, so that it can leverage their different strengths.</p>
<h4 id="music-response-generator">Music Response Generator</h4>
<p>Chirpy’s music response generator uses rule-based, scripted dialog trees, like the example shown in Figure 1. It asks the user a series of questions about their musical preferences and, depending on their answers, selects a response. All possible responses are hand-written in advance, so this RG is highly effective at handling cases where the user responds as expected, but has more difficulty when they say something it doesn’t have a rule for.</p>
<h4 id="personal-chat-response-generator">Personal Chat Response Generator</h4>
<p>We wanted Chirpy to have the ability to discuss users’ personal experiences and emotions. Since these are highly varied, we used neural generation because of its flexibility when handling previously unseen utterances. As shown in Figure 2, neural generation models input a context and then generate the response word-by-word. We fine-tuned a GPT2-medium model <sup id="fnref:transfer-transfo"><a href="#fn:transfer-transfo" class="footnote">6</a></sup> on the EmpatheticDialogues dataset <sup id="fnref:empathetic-dialogues"><a href="#fn:empathetic-dialogues" class="footnote">7</a></sup>, which consists of conversations between a speaker describing an emotional personal experience and a listener who responds to the speaker. Since the conversations it contains are relatively brief, grounded in specific emotional situations, and focused on empathy, this dataset is well suited to the personal chat RG.</p>
<p>To keep neural conversations focused and effective, we begin each personal chat discussion by asking the user a scripted starter question, e.g., What do you like to do to relax? On each subsequent turn, we pass the conversation history as context to the GPT-2 model, and then sample 20 diverse responses. When selecting the final response, we prioritize generations that contain questions. However, if fewer than one third of the responses contain questions, we assume that the model no longer has a clear path forward, select a response without a question, and hand over to another response generator. Not continuing neurally generated conversation segments for too long is a simple but effective strategy for preventing the overall conversation quality from degrading.</p>
<h4 id="wiki-response-generator">Wiki Response Generator</h4>
<p>We wanted Chirpy to be able to discuss a broad range of topics in depth. One source of information for a broad range of topics is Wikipedia, which provides in-depth content for millions of entities. Chirpy tracks the entity under discussion and if it is able to find a corresponding Wikipedia article, the Wiki RG searches for relevant sentences using TF-IDF, a standard technique used by search engines to find relevant documents based on text overlap with an underlying query. To encourage such overlap, we have our bot ask a handwritten open-ended question that is designed to evoke a meaningful response, eg in Figure 2 “I’m thinking about visiting the Trevi fountain. Do you have any thoughts about what I should do?”</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" style="float:right; display:inline;" src="/blog/assets/img/posts/2021-04-09-chirpy-cardinal/figure2_600p.png" /></p>
<figcaption>
<strong>Figure 2</strong>: In the first utterance, we have our bot ask a handwritten question. The user in response provides a meaningful answer which we can use to find related content from the Wikipedia page on the Trevi Fountain. The neural rephrasing model takes the two prior turns and the snippet as input to produce a response that weaves factual sentences into the conversational narrative.
</figcaption>
</div></figure>
<p>However, quoting sentences from Wikipedia is insufficient, due to its encyclopedic, stiff writing style. When people share information they connect it to the conversation thus far and style it conversationally. Thus, we use a neural rephrasing model that takes the conversational history and the retrieved wikipedia sentence as input and generates a conversationally phrased reply. This model is trained on a modified version of the Topical Chat Dataset <sup id="fnref:topical-chat"><a href="#fn:topical-chat" class="footnote">8</a></sup> which contains conversations paired with factual sentences. Unfortunately, the model isn’t perfect and makes mistakes from time to time. We handle user’s confusion with a few handwritten rules.</p>
<h2 id="how-do-the-response-generators-fit-together">How do the response generators fit together?</h2>
<p>There are clear benefits to all three types of dialog – entirely scripted, partially scripted, and entirely neural. So on a given turn, how do we decide which kind of response to give?</p>
<p>Many other chatbots use a dialog management strategy where on a given turn, the bot gets a user utterance, decides which module can handle it best, and then returns the next response from this module. Our bot delays that decision until after generation, so that it can make a more informed choice. On each turn, every module within the bot generates a response and a self-assessed priority using module specific context. Once every response generator has produced some output, the bot will then use these priorities to select the highest priority response.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" style="float:right; display:inline;" src="/blog/assets/img/posts/2021-04-09-chirpy-cardinal/figure3_600p.png" /></p>
<figcaption>
<strong>Figure 3</strong>: Example conversation with our bot. In the second utterance, our bot uses a response from Launch RG appended with a prompt from Personal Chat RG. When the topic of conversation shifts to music, Music RG takes over. In the last turn, a lack of interest, the Music RG produces a response to acknowledge and hands over to Categories RG which provides a prompt.
</figcaption>
</div></figure>
<p>Since a module may decide it has finished discussing a topic, we allow another module to append a prompt and take over on the same turn. The first module’s response acknowledges the users’ previous utterance, and the second module’s prompt gives the user direction for the next turn. For example, the user might receive a response “I also like avocados” from the opinions response generator, which is used for casual exchange of personal opinions, and then a prompt “Would you like to know more about the history of avocados?” from the Wikipedia response generator, which is used for sharing factual information. Figure 3 shows an example of the type of conversation our bot has, with different RGs handling their own specialized types of dialog.</p>
<p>The animations below show how the response and prompt selection works. The opinion, wikipedia, and personal chat modules use the state and annotations to generate responses, the bot selects the best response and prompt, and then the bot updates its state based on this choice.</p>
<figure class="figure"><div class="figure__main">
<video muted="" controls="" playsinline="" class="postimage">
<source src="/blog/assets/img/posts/2021-04-09-chirpy-cardinal/chirpy_animation_1.mp4" type="video/mp4" />
</video>
<figcaption style="text-align:center;">
<strong>Step 1</strong>: Annotators run on new user message and their annotations are stored in Chirpy's state.
</figcaption>
</div></figure>
<figure class="figure"><div class="figure__main">
<video muted="" controls="" playsinline="" class="postimage">
<source src="/blog/assets/img/posts/2021-04-09-chirpy-cardinal/chirpy_animation_2_no_border.mp4" type="video/mp4" />
</video>
<figcaption style="text-align:center;">
<strong>Step 2</strong>: Response generators use annotations to produce responses and the response with the highest priority is selected.
</figcaption>
</div></figure>
<figure class="figure"><div class="figure__main">
<video muted="" controls="" playsinline="" class="postimage">
<source src="/blog/assets/img/posts/2021-04-09-chirpy-cardinal/chirpy_animation_3_no_border.mp4" type="video/mp4" />
</video>
<figcaption style="text-align:center;">
<strong>Step 3</strong>: Since the highest priority response did not include a prompt, response generators are also asked to produce prompts. The highest-priority prompt is chosen and combined with the previously selected response. This message is returned to the user.
</figcaption>
</div></figure>
<h2 id="how-can-i-use-chirpy">How can I use Chirpy?</h2>
<p>We’re open-sourcing Chirpy Cardinal, so that others can expand on the existing socialbot, create their own, or simply have a social conversation. This release is unique for several reasons.</p>
<dl>
<dt>User-tested design</dt>
<dd>Our bot has already been tested over hundreds of thousands of conversations during the Alexa Prize Competition. Its strategies are verified and can appeal to a broad range of users with diverse interests.</dd>
<dt>Essential building blocks</dt>
<dd>We’ve implemented time-consuming but essential basics, such as entity-linking and dialog management, so that you won’t have to. This allows new developers to move chatbot design forward faster, focusing on higher-level research and design.</dd>
<dt>Customizable architecture</dt>
<dd>Our bot’s flexible architecture allows easier customization for your own unique use cases. You can introduce new areas of content by creating specialized response generators, and you can choose what your bot prioritizes by adjusting the settings of the dialog manager.</dd>
<dt>Experiment framework</dt>
<dd>Finally, our bot was designed to enable dialog research. Users can create their own experiments, which are stored as parameters of the state, determine how often these experiments should be triggered, and then use the collected data to compare different strategies or models.</dd>
</dl>
<p>To get started, you can try the <a href="https://stanfordnlp.github.io/chirpycardinal/live_demo/">live demo</a> of chirpy yourself before diving into the code in our <a href="https://github.com/stanfordnlp/chirpycardinal">github repo</a>.</p>
<p>You can find more details about our system in a 30 minute <a href="https://www.youtube.com/watch?v=2pmAvOJOmGg">overview presentation</a> or our <a href="https://arxiv.org/abs/2008.12348">technical paper</a>.</p>
<p>Our team continues to work on improving open-domain dialogue. You can find more about our current Alexa Prize team, publications and other updates at <a href="https://stanfordnlp.github.io/chirpycardinal/">https://stanfordnlp.github.io/chirpycardinal/</a></p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>We thank our colleages at <a href="https://stanfordnlp.github.io/chirpycardinal/people/">Stanford’s Alexa Prize Team</a>: Abigail See (co-lead), Kathleen Kenealy, Haojun Li, Peng Qi, Kaushik Ram Sadagopan, Nguyet Minh Phu, Dilara Soylu, Christopher D. Manning (faculty advisor).</p>
<p>We also thank Haojun Li and Dilara Soylu for helping us with open sourcing of the codebase.</p>
<p>Thanks to Siddharth Karamcheti, Megha Srivastava and rest of the SAIL Blog Team for reviewing and publishing our blog post.</p>
<p>This research was supported in part by <a href="https://oval.cs.stanford.edu">Stanford Open Virtual Assistant Lab (OVAL)</a> (for Amelia Hardy) and Alexa Prize Stipend Award (for Haojun Li, Kaushik Ram Sadagopan, Nguyet Minh Phu, Dilara Soylu).</p>
<div class="footnotes">
<ol>
<li id="fn:BART">
<p>Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR, abs/1910.13461. <a href="#fnref:BART" class="reversefootnote">↩</a></p>
</li>
<li id="fn:GPT-3">
<p>Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, … and Dario Amodei. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165 <a href="#fnref:GPT-3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:Meena">
<p>Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, … Quoc V. Le. (2020). Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977. <a href="#fnref:Meena" class="reversefootnote">↩</a></p>
</li>
<li id="fn:BlenderBot">
<p>Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, … Jason Weston. (2020). Recipes for building an open-domain chatbot. arXiv preprint arXiv:2004.13637. <a href="#fnref:BlenderBot" class="reversefootnote">↩</a></p>
</li>
<li id="fn:DialoGPT">
<p>Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, & Bill Dolan. (2020). DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. arXiv preprint arXiv:1911.00536. <a href="#fnref:DialoGPT" class="reversefootnote">↩</a></p>
</li>
<li id="fn:transfer-transfo">
<p>Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. (2019). Language Models are Unsupervised Multitask Learners. <a href="#fnref:transfer-transfo" class="reversefootnote">↩</a></p>
</li>
<li id="fn:empathetic-dialogues">
<p>Hannah Rashkin, Eric Michael Smith, Margaret Li and Y-Lan Boureau. (2019). Towards Empathetic Open-domain Conversation Models: a New Benchmark and Dataset. arXiv preprint arXiv:1811.00207. <a href="#fnref:empathetic-dialogues" class="reversefootnote">↩</a></p>
</li>
<li id="fn:topical-chat">
<p>Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, & Dilek Hakkani-Tür (2019). Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019 (pp. 1891–1895). <a href="#fnref:topical-chat" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Fri, 09 Apr 2021 00:00:00 -0700Neural Mechanics: Symmetry and Broken Conservation Laws In Deep Learning Dynamics
/blog/neural-mechanics/
/blog/neural-mechanics/<p>Just like the fundamental laws of classical and quantum mechanics taught us how to control and optimize the physical world for engineering purposes, a better understanding of the laws governing neural network learning dynamics can have a profound impact on the optimization of artificial neural networks. This raises a foundational question: what, if anything, can we quantitatively understand about the learning dynamics of state-of-the-art deep learning models driven by real-world datasets?</p>
<p>In order to make headway on this extremely difficult question, existing works have made major simplifying assumptions on the architecture, such as restricting to a single hidden layer <sup id="fnref:saad1995dynamics"><a href="#fn:saad1995dynamics" class="footnote">1</a></sup>, linear activation functions <sup id="fnref:saxe2013exact"><a href="#fn:saxe2013exact" class="footnote">2</a></sup>, or infinite width layers <sup id="fnref:jacot2018neural"><a href="#fn:jacot2018neural" class="footnote">3</a></sup>. These works have also ignored the complexity introduced by the optimizer through stochastic and discrete updates. In the present work, rather than introducing unrealistic assumptions on the architecture or optimizer, we identify combinations of parameters with simpler dynamics (as shown Fig. 1) that can be solved exactly!</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image1.gif" /></p>
</div></figure>
<p><strong>Fig. 1.</strong> <em>We plot the per-parameter dynamics (left) and per-filter squared Euclidean norm dynamics (right) for the convolutional layers of a VGG-16 model (with batch normalization) trained on Tiny ImageNet with SGD with learning rate <script type="math/tex">\eta = 0.1</script>, weight decay <script type="math/tex">\lambda = 10^{-4}</script>, and batch size <script type="math/tex">S = 256</script>. Each color represents a different convolutional block. While the parameter dynamics are noisy and chaotic, the neuron dynamics are smooth and patterned.</em></p>
<h2 id="symmetries-in-the-loss-shape-gradient-and-hessian-geometry">Symmetries in the loss shape gradient and Hessian geometry</h2>
<p>While we commonly initialize neural networks with random weights, their gradients and Hessians at all points in training, no matter the loss or dataset, obey certain geometric constraints. Some of these constraints have been noticed previously as a form of implicit regularization, while others have been leveraged algorithmically in applications from network pruning to interpretability. Remarkably, all these geometric constraints can be understood as consequences of numerous symmetries in the loss introduced by neural network architectures.</p>
<p>A set of parameters observes a symmetry in the loss if the loss doesn’t change under a certain transformation of these parameters. This invariance introduces associated geometric constraints on the gradient and Hessian. We consider three families of symmetries (translation, scale, and rescale) that commonly appear in modern neural network architectures.</p>
<ul>
<li>Translation symmetry is defined by the transformation <script type="math/tex">\psi(\theta, \alpha) = \theta + \alpha\mathbb{1}_{\mathcal{A}}</script> where <script type="math/tex">\mathbb{1}_{\mathcal{A}}</script> is the indicator vector for some subset <script type="math/tex">\mathcal{A}</script> of the parameters <script type="math/tex">\{\theta_1, ..., \theta_m\}</script>. Any network using the softmax function gives rise to translation symmetry for the parameters immediately preceding the function.</li>
<li>Scale symmetry is defined by the transformation <script type="math/tex">\psi(\theta, \alpha) = \alpha_\mathcal{A} \odot \theta</script> where <script type="math/tex">\alpha_\mathcal{A} = \alpha \mathbb{1}_\mathcal{A} + \mathbb{1}_\mathcal{A^\mathsf{c}}</script>. Batch normalization leads to scale invariance for the parameters immediately preceding the function.</li>
<li>Rescale symmetry is defined by the transformation <script type="math/tex">\psi(\theta, \alpha) = \alpha_{\mathcal{A}_1} \odot \alpha^{-1}_{\mathcal{A}_2} \odot \theta</script> where <script type="math/tex">\mathcal{A}_1</script> and <script type="math/tex">\mathcal{A}_2</script> are two disjoint sets of parameters. For networks with continuous, homogeneous activation functions <script type="math/tex">\phi(z) = \phi'(z)z</script> (e.g. ReLU, Leaky ReLU, linear), this symmetry emerges at every hidden neuron by considering all incoming and outgoing parameters to the neuron.</li>
</ul>
<p>These symmetries enforce geometric constraints on the gradient of a neural network <script type="math/tex">g</script>,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\textbf{Translation:}&\quad\langle g, \mathbb{1}_\mathcal{A} \rangle = 0\\
\textbf{Scale:}&\quad\langle g, \theta_\mathcal{A} \rangle = 0\\
\textbf{Rescale:}&\quad\langle g, \theta_{\mathcal{A}_1} - \theta_{\mathcal{A}_2}\rangle = 0
\end{aligned} %]]></script>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image2.jpg" /></p>
</div></figure>
<p><strong>Fig. 2.</strong> <em>We visualize the vector fields associated with simple network components that have translation, scale, and rescale symmetry. On the right we consider the vector field associated with a neuron <script type="math/tex">% <![CDATA[
\sigma\left(\begin{bmatrix}\theta_1 & \theta_2\end{bmatrix}^\intercal x\right) %]]></script> where <script type="math/tex">\sigma</script> is the softmax function. In the middle we consider the vector field associated with a neuron <script type="math/tex">% <![CDATA[
\text{BN}\left(\begin{bmatrix}\theta_1 & \theta_2\end{bmatrix}\begin{bmatrix}x_1 & x_2\end{bmatrix}^\intercal\right) %]]></script> where <script type="math/tex">\text{BN}</script> is the batch normalization function. On the left we consider the vector field associated with a linear path <script type="math/tex">\theta_2\theta_1 x</script>.</em></p>
<p></p>
<h2 id="symmetry-leads-to-conservation-laws-under-gradient-flow">Symmetry leads to conservation laws under gradient flow</h2>
<p>We now consider how geometric constraints on gradients and Hessians, arising as a consequence of symmetry, impact the learning dynamics given by stochastic gradient descent (SGD). We will consider a model parameterized by <script type="math/tex">\theta</script>, a training dataset <script type="math/tex">\{x_{1}, ..., x_{N}\}</script> of size <script type="math/tex">N</script>, and a training loss <script type="math/tex">\mathcal{L}(\theta) = \frac{1}{N}\sum_{i=1}^N\ell(\theta, x_i)</script> with corresponding gradient <script type="math/tex">g(\theta) = \frac{\partial \mathcal{L}}{\partial\theta}</script>. The gradient descent update with learning rate <script type="math/tex">\eta</script> is <script type="math/tex">\theta^{(n+1)} = \theta^{(n)} - \eta g(\theta^{(n)})</script>, which is a forward Euler discretization with step size <script type="math/tex">\eta</script> of the ordinary differential equation (ODE) <script type="math/tex">\frac{d\theta}{dt} = -g(\theta)</script>. In the limit as <script type="math/tex">\eta \to 0</script>, gradient descent exactly matches the dynamics of this ODE, which is commonly referred to as gradient flow. Equipped with a continuous model for the learning dynamics, we now ask how do the dynamics interact with the geometric properties introduced by symmetries?</p>
<p>Strikingly similar to <a href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Noether%2527s_theorem&sa=D&source=editors&ust=1614205020229000&usg=AOvVaw1FkghDm15tT1bYlTSo-QKm">Noether’s theorem</a>, which describes a fundamental relationship between symmetry and conservation for physical systems governed by Lagrangian dynamics, every symmetry of a network architecture has a corresponding “conserved quantity” through training under gradient flow. Just as the total kinetic and potential energy is conserved for an idealized spring in harmonic motion, certain combinations of parameters are constant under gradient flow dynamics.</p>
<p>Consider some subset of the parameters <script type="math/tex">\mathcal{A}</script> that respects either a translation, scale, or rescale symmetry. As discussed earlier, the gradient of the loss <script type="math/tex">g(\theta)</script> is always perpendicular to the vector field that generates the symmetry <script type="math/tex">\partial_\alpha \psi</script>. Projecting the gradient flow learning dynamics onto the generator vector field yields a differential equation <script type="math/tex">\langle\frac{d\theta}{dt}, \partial_\alpha \psi\rangle = 0</script>. Integrating this equation through time results in the conservation laws,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\textbf{Translation:}&\quad\langle \theta_\mathcal{A}(t), \mathbb{1} \rangle = \langle \theta_\mathcal{A}(0), \mathbb{1} \rangle\\
\textbf{Scale:}&\quad|\theta_\mathcal{A}(t)|^2 = |\theta_\mathcal{A}(0)|^2\\
\textbf{Rescale:}&\quad|\theta_{\mathcal{A}_1}(t)|^2 - |\theta_{\mathcal{A}_2}(t)|^2 = |\theta_{\mathcal{A}_1}(0)|^2 - |\theta_{\mathcal{A}_2}(0)|^2
\end{aligned} %]]></script>
<p>Each of these equations define a conserved constant through training, effectively restricting the possible trajectory the parameters take through learning. For parameters with translation symmetry, their sum is conserved, effectively constraining their dynamics to a hyperplane. For parameters with scale symmetry, their Euclidean norm is conserved, effectively constraining their dynamics to a sphere. For parameters with rescale symmetry, their difference in squared Euclidean norm is conserved, effectively constraining their dynamics to a hyperbola.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image6.gif" /></p>
</div></figure>
<p><strong>Fig. 3.</strong> <em>Associated with each symmetry is a conserved quantity constraining the gradient flow dynamics to a surface. For translation symmetry (right) the flow is constrained to a hyperplane where the intercept is conserved. For scale symmetry (middle) the flow is constrained to a sphere where the radius is conserved. For rescale symmetry (left) the flow is constrained to a hyperbola where the axes are conserved. The color represents the value of the conserved quantity, where blue is positive and red is negative, and the black lines are level sets.</em>
</p>
<h2 id="a-realistic-continuous-model-for-stochastic-gradient-descent">A realistic continuous model for stochastic gradient descent</h2>
<p>While the conservation laws derived with gradient flow are quite striking, empirically we know they are broken, as demonstrated in Fig. 1. Gradient flow is too simple of a continuous model for realistic SGD training, it fails to account for the effect of hyperparameters such as weight decay and momentum, the effect of stochasticity introduced by random batches of data, and the effect of discrete updates due to a finite learning rate. Here, we consider how to address these effects individually to construct more realistic continuous models of SGD.</p>
<p>Modeling weight decay. Explicit regularization through the addition of an <script type="math/tex">L_2</script> penalty on the parameters, with regularization constant <script type="math/tex">\lambda</script>, is a very common practice when training neural networks. Weight decay modifies the gradient flow trajectory pulling the network towards the origin in parameter space.</p>
<p>Modeling momentum. Momentum is a common extension to SGD that uses an exponentially moving average of gradients to update parameters rather than a single gradient evaluation. The method introduces an additional hyperparameter <script type="math/tex">\beta</script>, which controls how past gradients are used in future updates, resulting in a form of “inertia” that accelerates the learning dynamics rescaling time, but leaves the gradient flow trajectory intact.</p>
<p>Modeling stochasticity. Stochastic gradients arise when we consider a batch <script type="math/tex">\mathcal{B}</script> of size <script type="math/tex">S</script> drawn uniformly from the indices <script type="math/tex">\{1,...,N\}</script> forming the unbiased gradient estimate <script type="math/tex">\hat{g}_{\mathcal{B}}(\theta) = \frac{1}{S}\sum_{i\in\mathcal{B}}\nabla\ell(\theta, x_i)</script>. We can model the batch gradient <script type="math/tex">\hat{g}_{\mathcal{B}}(\theta)</script> as a noisy version of the true gradient <script type="math/tex">g(\theta)</script>. However, because both the batch gradient and true gradient observe the same geometric properties introduced by symmetry, this noise has a special low-rank structure. In other words, stochasticity introduced by random batches does not affect the gradient flow dynamics in the directions associated with symmetry.</p>
<p>Modeling discretization. Gradient descent always moves in the direction of steepest descent on a loss function <script type="math/tex">\mathcal{L}</script> at each step, however, due to the finite nature of the learning rate, it fails to remain on the continuous steepest descent path given by gradient flow. In order to model this discrepancy, we borrow tools from the numerical analysis of partial differential equations. In particular, we use modified equation analysis <sup id="fnref:warming1974modified"><a href="#fn:warming1974modified" class="footnote">4</a></sup>, which determines how to model the numerical artifacts introduced by a discretization of a PDE. In our paper we present two methods based on modified equation analysis and recent works <sup id="fnref:barrett2020implicit"><a href="#fn:barrett2020implicit" class="footnote">5</a></sup>, <sup id="fnref:kovachki2019analysis"><a href="#fn:kovachki2019analysis" class="footnote">6</a></sup>, which modify gradient flow, with either higher order derivatives of the loss or higher order temporal derivatives of the parameters, to account for the effect of discretization on the learning dynamics.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image4.gif" /></p>
</div></figure>
<p><strong>Fig. 4.</strong> <em>We visualize the trajectories of gradient descent with momentum (black dots), gradient flow (blue line), and the modified dynamics (red line) on the quadratic loss <script type="math/tex">% <![CDATA[
\mathcal{L}(w) = w^\intercal\begin{bmatrix}2.5 & -1.5\\ -1.5 & 2 \end{bmatrix}w %]]></script>. The modified continuous dynamics visually track the discrete dynamics much better than the original gradient flow dynamics.</em></p>
<h2 id="combining-symmetry-and-modified-gradient-flow-to-derive-exact-learning-dynamics">Combining symmetry and modified gradient flow to derive exact learning dynamics</h2>
<p>We now study how weight decay, momentum, stochastic gradients, and finite learning rates all interact to break the conservation laws of gradient flow. Remarkably, even when using a more realistic continuous model for stochastic gradient descent, we can derive exact learning dynamics for the previously conserved quantities. To do this we (i) consider a realistic continuous model for SGD, (ii) project these learning dynamics onto the generator vector fields <script type="math/tex">\partial_\alpha \psi</script> associated with each symmetry, (iii) harness the geometric constraints introduced by symmetry to derive simplified ODEs, and (iv) solve these ODEs to obtain exact dynamics for the previously conserved quantities. We first consider the continuous model of SGD without momentum incorporating weight decay, stochasticity, and a finite learning rate. In this setting, the exact dynamics for the parameter combinations tied to the symmetries are,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\textbf{Translation:}&\quad\langle \theta_\mathcal{A}(t), \mathbb{1} \rangle = e^{-\lambda t} \langle \theta_\mathcal{A}(0), \mathbb{1} \rangle\\
\textbf{Scale:}&\quad|\theta_\mathcal{A}(t)|^2 = e^{- 2 \lambda t} |\theta_\mathcal{A}(0)|^2 + \eta \int_0^t e^{-2\lambda (t-\tau)} \left| g_\mathcal{A} \right|^2 d\tau\\
\textbf{Rescale:}&\quad|\theta_{\mathcal{A}_1} (t)|^2 - |\theta_{\mathcal{A}_2} (t)|^2 = \\
&\quad e^{- 2 \lambda t} (|\theta_{\mathcal{A}_1} (0)|^2 - |\theta_{\mathcal{A}_2} (0)|^2) + \eta \int_0^t e^{-2\lambda (t-\tau)} \left(\left| g_{\theta_{\mathcal{A}_1}} \right|^2 - \left| g_{\theta_{\mathcal{A}_2}} \right|^2\right)
d\tau
\end{aligned} %]]></script>
<p>Notice how these equations are equivalent to the conservation laws when <script type="math/tex">\eta = \lambda = 0</script>. Remarkably, even in typical hyperparameter settings (weight decay, stochastic batches, finite learning rates), these solutions match nearly perfectly with empirical results from modern neural networks (VGG-16) trained on real-world datasets (Tiny ImageNet), as shown in Fig. 5.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image3.gif" /></p>
</div></figure>
<p><strong>Fig. 5.</strong> <em>We plot the column sum of the final linear layer (left) and the difference between squared channel norms of the fifth and fourth convolutional layer (right) of a VGG-16 model without batch normalization. We plot the squared channel norm of the second convolution layer (middle) of a VGG-16 model with batch normalization. Both models are trained on Tiny ImageNet with SGD with learning rate <script type="math/tex">\eta = 0.1</script>, weight decay <script type="math/tex">\lambda=0</script>, batch size <script type="math/tex">S = 256</script>, for <script type="math/tex">100</script> epochs. Colored lines are empirical and black dashed lines are the theoretical predictions.</em></p>
<p>Translation dynamics. For parameters with translation symmetry, this equation implies that the sum of these parameters decays exponentially to zero at a rate proportional to the weight decay. In particular, the dynamics do not directly depend on the learning rate <script type="math/tex">\eta</script> nor any information of the dataset due to the lack of curvature in the gradient field for these parameters (as shown in Fig. 2).</p>
<p>Scale dynamics. For parameters with scale symmetry, this equation implies that the norm for these parameters is the sum of an exponentially decaying memory of the norm at initialization and an exponentially weighted integral of gradient norms accumulated through training. Compared to the translation dynamics, the scale dynamics do depend on the data through the gradient norms accumulated throughout training.</p>
<p>Rescale dynamics. For parameters with rescale symmetry, this equation is the sum of an exponentially decaying memory of the difference in norms at initialization and an exponentially weighted integral of difference in gradient norms accumulated through training. Similar to the scale dynamics, the rescale dynamics do depend on the data through the gradient norms, however unlike the scale dynamics we have no guarantee that the integral term is always positive.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Despite being the central guiding principle in the exploration of the physical world, symmetry has been underutilized in understanding the mechanics of neural networks. In this paper, we constructed a unifying theoretical framework harnessing the geometric properties of symmetry and realistic continuous equations for SGD that model weight decay, momentum, stochasticity, and discretization. We use this framework to derive exact dynamics for meaningful combinations of parameters, which we experimentally verified on large scale neural networks and datasets. Overall, our work provides a first step towards understanding the mechanics of learning in neural networks without unrealistic simplifying assumptions.</p>
<p>For more details check out our ICLR <a href="https://openreview.net/forum?id=q8qLAbQBupm">paper</a> or this seminar <a href="http://www.physicsmeetsml.org/posts/sem_2021_02_24/">presentation</a>!</p>
<h3 id="acknowledgments">Acknowledgments</h3>
<p>We would like to thank our collaborator <a href="https://www.javiersagastuy.com/">Javier Sagastuy-Brena</a> and advisors <a href="https://profiles.stanford.edu/surya-ganguli">Surya Ganguli</a> and <a href="https://web.stanford.edu/~yamins/">Daniel Yamins</a>.
We would also like to thank <a href="https://web.stanford.edu/~meghas/">Megha Srivastava</a> for very helpful feedback on this post.</p>
<div class="footnotes">
<ol>
<li id="fn:saad1995dynamics">
<p>David Saad and Sara Solla. Dynamics of on-line gradient descent learning for multilayer neural networks.Advances in neural information processing systems, 8:302–308, 1995. <a href="#fnref:saad1995dynamics" class="reversefootnote">↩</a></p>
</li>
<li id="fn:saxe2013exact">
<p>Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proc. Natl. Acad. Sci. U. S. A., May 2019. <a href="#fnref:saxe2013exact" class="reversefootnote">↩</a></p>
</li>
<li id="fn:jacot2018neural">
<p>Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp.8571–8580, 2018 <a href="#fnref:jacot2018neural" class="reversefootnote">↩</a></p>
</li>
<li id="fn:warming1974modified">
<p>RF Warming and BJ Hyett. The modified equation approach to the stability and accuracy analysis of finite-difference methods. Journal of computational physics, 14(2):159–179, 1974. <a href="#fnref:warming1974modified" class="reversefootnote">↩</a></p>
</li>
<li id="fn:barrett2020implicit">
<p>David GT Barrett and Benoit Dherin. Implicit gradient regularization.arXiv preprintarXiv:2009.11162, 2020. <a href="#fnref:barrett2020implicit" class="reversefootnote">↩</a></p>
</li>
<li id="fn:kovachki2019analysis">
<p>Nikola B Kovachki and Andrew M Stuart. Analysis of momentum methods.arXiv preprint arXiv:1906.04285, 2019. <a href="#fnref:kovachki2019analysis" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 25 Feb 2021 00:00:00 -0800Do Language Models Know How Heavy an Elephant Is?
/blog/scalar-probing/
/blog/scalar-probing/<p>How heavy is an elephant? How expensive is a wedding ring?</p>
<p>Humans have a pretty good sense of <em>scale</em>, or reasonable ranges of these
<em>numeric attributes</em>, of different objects, but do pre-trained language
representations? Although pre-trained Language Models (LMs) like
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1810.04805&sa=D&source=editors&ust=1613552260369000&usg=AOvVaw2sJUKWCZGDMLa3LWoqOEZ7">BERT</a> have
shown a remarkable ability to learn all kinds of knowledge, including
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1909.01066&sa=D&source=editors&ust=1613552260369000&usg=AOvVaw27gyPje50D9HeU8ZaY_8VY">factual
knowledge</a>,
it remains unclear whether their representations can capture these types
of numeric attributes from text alone without explicit training data.</p>
<!-- ![](/assets/img/posts/2021-02-17-scalar-probing/image1.png) -->
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 500px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image1.png" /></p>
</div></figure>
<p>In our <a href="https://www.google.com/url?q=https://arxiv.org/abs/2010.05345&sa=D&source=editors&ust=1613552260370000&usg=AOvVaw2jns7eFEtJBLkPB-VDrx6F">recent
paper</a>,
we measure the amount of scale information that is captured in several
kinds of pre-trained text representations and show that, although
generally a <strong>significant amount</strong> of such information is captured, there is
still a <strong>large gap</strong> between their current performance and the theoretical
upper bound. We identify that specifically those text representations
that are <strong>contextual</strong> and <strong>good at numerical reasoning</strong> capture scale
better. We also come up with a <strong>new version of BERT</strong>, called <em>NumBERT</em>, with
improved numerical reasoning by <strong>replacing numbers in the pretraining
text corpus with their scientific notation</strong>, which more readily exposes
the magnitude to the model, and demonstrate that NumBERT representations
capture scale significantly better than all those previous text
representations.</p>
<h1 id="scalar-probing">Scalar Probing</h1>
<p>In order to understand to what extent pre-trained text representations, like
BERT representations, capture scale information, we propose a task
called <em>scalar probing</em>: probing the ability to predict a
<em>distribution</em> over values of a scalar attribute of an object. In this
work, we focus specifically on three kinds of scalar attributes: weight,
length, and price.</p>
<p>Here is the basic architecture of our scalar probing task:</p>
<!-- ![](images/image2.png) -->
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 900px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image2.png" /></p>
</div></figure>
<p>In this example, we are trying to see whether the representation of
“dog” extracted by a pre-trained encoder can be used to predict/recover
the distribution of the weight of a dog through a linear model. We probe
three baseline language representations:
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1301.3781&sa=D&source=editors&ust=1613552260374000&usg=AOvVaw08p9HhtI6FTvvcqpFd5NDn">Word2vec</a>,
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1802.05365&sa=D&source=editors&ust=1613552260374000&usg=AOvVaw1ngIpQf6a40ItFoq0MM78w">ELMo</a>,
and
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1810.04805&sa=D&source=editors&ust=1613552260374000&usg=AOvVaw1BGbokiyXp_QdBgvlV6B2J">BERT</a>.
Since the latter two are contextual representations that operate on
sentences instead of words, we feed in sentences constructed using fixed
templates. For example, for weight, we use the template “The X is
heavy”, where X is the object in interest.</p>
<p>We explore the kind of probe that predicts a <em>point estimate</em> of the value
and the kind that predicts the <em>full distribution</em>. For predicting a point
estimate, we use a standard linear <strong>R</strong>e<strong>GR</strong>ession (we denote as “<strong>rgr</strong>”)
trained to predict the log of the median of all values for each object
for the scale attribute under consideration. We predict the log because,
again, we care about the general scale rather than the exact value. The
loss is calculated using the prediction and the log of the median of the
ground-truth distribution. For predicting the full distribution, we use
a linear softmax <strong>M</strong>ulti-<strong>C</strong>lass <strong>C</strong>lassifier (we denote as “<strong>mcc</strong>”) producing a
categorical distribution over the 12 orders of magnitude. The
categorical distribution predicted using the NumBERT (our improved
version of BERT; will be introduced <a href="#numbert">later</a>) representations is shown as
the orange histogram in the above example.</p>
<p>The ground-truth distributions we use come from the <a href="https://www.google.com/url?q=https://arxiv.org/abs/1906.01327&sa=D&source=editors&ust=1613552260377000&usg=AOvVaw3IFP_sUANrnAsBdvZRbBJV">Distributions over
Quantities</a> (DoQ)
dataset which consists of <em>empirical counts</em> of scalar attribute values
associated with >350K nouns, adjectives, and verbs over 10 different
attributes, <em>automatically extracted</em> from a large web text corpus. Note
that during the construction of the dataset, all units for a certain
attribute are first unified to a single one (e.g.
centimeter/meter/kilometer -> meter) and the numeric values are scaled
accordingly. We convert the collected counts for each object-attribute
pair in DoQ into a <em>categorical distribution over 12 orders of magnitude</em>.
In the above example of the weight of a dog, the ground-truth
distribution is shown as the grey histogram, which is concentrated
around 10-100kg.</p>
<p><strong>The better the predictive performance is across all the object-attribute
pairs we are dealing with, the better the pre-trained representations
encode the corresponding scale information.</strong></p>
<h1 id="numbert"><a name="numbert"></a>NumBERT</h1>
<p>Before looking at the scalar probing results of these different language
presentations, let’s also think about what kind of representations might
be good at capturing scale information and how to improve existing LMs
to capture scale better. All of these models are trained using large
online text corpora like Wikipedia, news, etc. How can their
representations pick up scale information from all this text?</p>
<p>Here is a piece of text from the first document I got when I searched on
Google “elephant weight”:</p>
<blockquote>
<p>“…African elephants can range from 5,000 pounds to more than 14,000 pounds (6,350 kilograms)…”</p>
</blockquote>
<p>So it is highly likely that <strong>the learning of scale is partly mediated by
the transfer of scale information from the numbers</strong> (here “5,000”,
“14,000”, etc.) <strong>to nouns</strong> (here “elephants”) and <strong>numeracy</strong>, i.e. the
ability to reason about numbers, <strong>is probably important for representing
scale</strong>!</p>
<p>However, <a href="https://www.google.com/url?q=https://www.aclweb.org/anthology/D19-1534/&sa=D&source=editors&ust=1613552260381000&usg=AOvVaw294kIi9L87A-KaO__fkYLk">previous
work</a> has
shown that existing pre-trained text representations, including BERT,
ELMo, and Word2Vec, are not good at reasoning over numbers. For example,
beyond the magnitude of ~500, they cannot even decode a number from its
word embedding, e.g. embedding(“710”) <script type="math/tex">\nrightarrow</script> 710. Thus, we propose to improve
the numerical reasoning abilities of these representations by replacing
every instance of a number in the LM training data with its <em>scientiﬁc
notation</em>, and re-pretraining BERT (which we call <em>NumBERT</em>). This enables
the model to more easily associate objects in the sentence directly with
the <em>magnitude</em> expressed in the <em>exponent</em>, ignoring the relatively
insigniﬁcant mantissa.
<!-- ![](images/image4.png) --></p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 900px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image4.png" /></p>
</div></figure>
<h1 id="results">Results</h1>
<h3 id="scalar-probing-1">Scalar Probing</h3>
<!-- ![](images/image6.png) -->
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 500px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image6.png" /></p>
</div></figure>
<p>The above table shows the results of scalar probing on the DoQ data. We
use three evaluation metrics: <em>Accuracy</em>, <em>Mean Squared Error (MSE)</em>, and
<em>Earth Mover’s distance (EMD)</em>, and we do the experiments in four domains:
<em>Lengths</em>, <em>Masses</em>, <em>Prices</em> and <em>Animal Masses</em> (a subset of Masses). For MSE
and EMD, the best possible score is 0, while we compute a loose <em>upper
bound</em> of accuracy by sampling from the ground-truth distribution and
evaluating against the mode. This upper bound achieves accuracies of
0.570 for lengths, 0.537 for masses, and 0.476 for prices.</p>
<p>For the <em>Aggregate</em> baseline, for each attribute, we compute the empirical
distribution over buckets across all objects in the training set, and
use that as the predicted distribution for all objects in the test set.
Compared with this baseline, we can see that the <strong>mcc</strong> probe over the best
text representations capture about <strong>half</strong> (as measured by accuracy) to <strong>a
third</strong> (by MSE and EMD) of the distance to the upper bound mentioned
above, suggesting that <strong>while a signiﬁcant amount of scalar information
is available, there is a long way to go to support robust commonsense
reasoning</strong>.</p>
<p>Specifically, <strong>NumBERT representations do consistently better than all
the others</strong> on <em>Earth Mover’s Distance</em> (EMD), which is the <em>most
robust</em> metric because of its <a href="https://www.google.com/url?q=https://ieeexplore.ieee.org/document/710701&sa=D&source=editors&ust=1613552260385000&usg=AOvVaw221Lk2TXvCNo_SHGAj7IN6">better convergence
properties</a> and
<a href="https://www.google.com/url?q=http://proceedings.mlr.press/v97/liu19b.html&sa=D&source=editors&ust=1613552260386000&usg=AOvVaw1Q1SF3K0mlfjt8HERQiUVj">robustness to adversarial perturbations of the data
distribution</a>. <strong>Word2Vec
performs signiﬁcantly worse than the contextual representations</strong> – even
though the task is <em>noncontextual</em> (since we do not have different
ground-truths for an object occurring in different contexts in our
setting). Also, despite being weaker than BERT on downstream NLP tasks,
<strong>ELMo does better on scalar probing</strong>, consistent with it <a href="https://www.google.com/url?q=https://www.aclweb.org/anthology/D19-1534/&sa=D&source=editors&ust=1613552260387000&usg=AOvVaw366Vf1Or1N_arhzIwSF0a4">being better at
numeracy</a> due
to its <em>character-level tokenization</em>.</p>
<h3 id="zero-shot-transfer">Zero-shot transfer</h3>
<p>We note that DoQ is derived heuristically from web text and contains
noise. So we also evaluate probes trained on DoQ on 2 datasets
containing <em>ground truth labels</em> of scalar attributes:
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1706.03799&sa=D&source=editors&ust=1613552260388000&usg=AOvVaw1TjMr0Kp_kSo377e-Vl7KB">VerbPhysics</a> and
<a href="https://www.google.com/url?q=https://jmcauley.ucsd.edu/data/amazon/&sa=D&source=editors&ust=1613552260388000&usg=AOvVaw006j3ja6jmqXMQh2XejA0G">Amazon Price
Dataset</a>.
The ﬁrst is a human labeled dataset of relative comparisons, e.g.
(person, fox, weight, bigger). Predictions for this task are made by
comparing the point estimates for <strong>rgr</strong> and highest-scoring buckets for
<strong>mcc</strong>. The second is a dataset of empirical distributions of product
prices on Amazon. We retrained a probe on DoQ prices using 12 power-of-4
buckets to support ﬁner grained predictions.</p>
<!-- ![](images/image3.png)![](images/image5.png) -->
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 400px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image3.png" /></p>
<p><img class="postimage_unpadded" style="max-width: 400px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image5.png" /></p>
</div></figure>
<p>The results are shown in the tables above. On VerbPhysics (the table on
the top), <strong>rgr</strong>+NumBERT performed best, approaching the performance of
using DoQ as an oracle, though short of <a href="https://www.google.com/url?q=https://www.aclweb.org/anthology/P18-2102/&sa=D&source=editors&ust=1613552260389000&usg=AOvVaw1sQRwGz2TwHxKQUagdmqsf">specialized
models</a> for
this task. Scalar probes trained with <strong>mcc</strong> perform poorly, possibly
because a ﬁner-grained model of predicted distribution is not useful for
the 3-class comparative task. On the Amazon Price Dataset (the table on
the bottom) which is a full distribution prediction task, <strong>mcc</strong>+NumBERT did
best on both distributional metrics. On both zero-shot transfer tasks,
<strong>NumBERT representations were the best</strong> across all conﬁgurations of
metrics/objectives, suggesting that manipulating numeric representations
of the text in the pre-training corpora can signiﬁcantly improve
performance on scale prediction.</p>
<h1 id="moving-forward">Moving Forward</h1>
<p>In the work above, we introduce a new task called <em>scalar probing</em> used to
measure how much information of numeric attributes of objects
pre-trained text representations have captured and find out that while
there is a <strong>significant amount of scale information</strong> in object
representations (half to a third to the theoretical upper bound), these
models are <strong>far from achieving common sense scale understanding</strong>. We also
come up with an <strong>improved version of BERT</strong>, called <em>NumBERT</em>, whose
representations <strong>capture scale information significantly better</strong> than all
the previous ones.</p>
<p>Scalar probing opens up new exciting research directions to explore. For
example, lots of work has pre-trained large-scale <em>vision & language
models</em>, like
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1908.02265&sa=D&source=editors&ust=1613552260391000&usg=AOvVaw3-rig6UgNOniW4jV0cJEzz">ViLBERT</a> and
<a href="https://www.google.com/url?q=https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf&sa=D&source=editors&ust=1613552260392000&usg=AOvVaw0FSByZ1nvSs_nkiucRIZ4N">CLIP</a>.
Probing their representations to see how much scale information has been
captured and performing systematic comparisons between them and
representations learned by language-only models can be quite
interesting.</p>
<p>Also, models learning text representations that predict scale better can
have a <strong>great real-world impact</strong>. Consider a web query like:</p>
<blockquote>
<p>“How tall is the tallest building in the world?”</p>
</blockquote>
<p>With a common sense understanding of what a reasonable range of heights
for “building” is, we can detect errors in the current web QA system when there are mistakes in
retrieval or parsing, e.g. when a wikipedia sentence about a building is
mistakenly parsed as being 19 miles high instead of meters.</p>
<p>Check out the paper <a href="https://www.google.com/url?q=https://arxiv.org/abs/2010.05345&sa=D&source=editors&ust=1613552260393000&usg=AOvVaw1QGVJEuhUKZ9jhfPl06j56">Do Language Embeddings Capture
Scales?</a> by
Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan
Roth.</p>
Wed, 17 Feb 2021 00:00:00 -0800Removing Spurious Features can Hurt Accuracy and Affect Groups Disproportionately
/blog/removing-spuriousfeature/
/blog/removing-spuriousfeature/<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_CHTML"></script>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/feature.png" /></p>
<h1 id="introduction">Introduction</h1>
<p>Machine learning models are susceptible to learning irrelevant patterns.
In other words, they rely on some spurious features that we humans know
to avoid. For example, assume that you are training a model to predict
whether a comment is toxic on social media platforms. You would expect
your model to predict the same score for similar sentences with
different identity terms. For example, “some people are Muslim” and
“some people are Christian” should have the same toxicity score.
However, as shown in <sup id="fnref:dixon2018measuring"><a href="#fn:dixon2018measuring" class="footnote">1</a></sup>, training a convolutional
neural net leads to a model which assigns different toxicity scores to
the same sentences with different identity terms. Reliance on spurious
features is prevalent among many other machine learning models. For
instance, <sup id="fnref:xiao2020noise"><a href="#fn:xiao2020noise" class="footnote">2</a></sup> shows that state of the art models in object
recognition like Resnet-50 <sup id="fnref:resnet"><a href="#fn:resnet" class="footnote">3</a></sup> rely heavily on background, so
changing the background can also change their predictions .</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image10.png" />
<img class="postimagehalf" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image1.png" />
<em>(Left) Machine learning models assign different toxicity scores to the
same sentences with different identity terms.
(Right) Machine learning models make different predictions on the same
object against different backgrounds.</em></p>
</div></figure>
<blockquote>
<p>Machine learning models rely on spurious features such as background in an image or identity terms in a comment. Reliance on spurious features conflicts with fairness and robustness goals.</p>
</blockquote>
<p>Of course, we do not want our model to rely on such spurious features
due to fairness as well as robustness concerns. For example, a model’s
prediction should remain the same for different identity terms
(fairness); similarly its prediction should remain the same with
different backgrounds (robustness). The first instinct to remedy this
situation would be to try to remove such spurious features, for example,
by masking the identity terms in the comments or by removing the
backgrounds from the images. However, removing spurious features can
lead to drops in accuracy at test time <sup id="fnref:zemel2013learning"><a href="#fn:zemel2013learning" class="footnote">4</a></sup><sup id="fnref:wang2019balanced"><a href="#fn:wang2019balanced" class="footnote">5</a></sup>. In this
blog post, we explore the causes of such drops in accuracy.</p>
<p>There are two natural explanations for accuracy drops:</p>
<ol>
<li>Core (non-spurious) features can be noisy or not expressive enough
so that even an optimal model has to use spurious features to
achieve the best accuracy
<sup id="fnref:khani2020noise"><a href="#fn:khani2020noise" class="footnote">6</a></sup><sup id="fnref:kleinberg2019simplicity"><a href="#fn:kleinberg2019simplicity" class="footnote">7</a></sup><sup id="fnref:credit_blur"><a href="#fn:credit_blur" class="footnote">8</a></sup>.</li>
<li>Removing spurious features can corrupt the core features
<sup id="fnref:zhao2019inherent"><a href="#fn:zhao2019inherent" class="footnote">9</a></sup><sup id="fnref:credit_sport"><a href="#fn:credit_sport" class="footnote">10</a></sup>.</li>
</ol>
<p>One valid question to ask is whether removing spurious features leads to
a drop in accuracy even in the absence of these two reasons. We answer
this question affirmatively in our recently published work in ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) <sup id="fnref:paper"><a href="#fn:paper" class="footnote">11</a></sup>. Here, we explain our results.</p>
<blockquote>
<p>Removing spurious features can lead to drop in accuracy even when spurious features are removed properly and core features exactly determine the target!</p>
</blockquote>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image14.png" />
<img class="postimagehalf" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image8.png" />
<em>(Left) When core features are not representative (blurred image), the
spurious feature (the background) provides extra information to identify
the object. (Right) Removing spurious features (gender
information) in the sport prediction task has corrupted other core
features (the weights and the bar).</em></p>
</div></figure>
<p>Before delving into our result, we note that understanding the reasons
behind the accuracy drop is crucial for mitigating such drops. Focusing
on the wrong mitigation method fails to address the accuracy drop.</p>
<blockquote>
<p>Before trying to mitigate the accuracy drop resulting from the removal of the spurious features, we must understand the reasons for the drop.</p>
</blockquote>
<table>
<thead>
<tr>
<th> </th>
<th>Previous work</th>
<th>Previous work</th>
<th>This work</th>
</tr>
</thead>
<tbody>
<tr>
<td> </td>
<td><img width="85%" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image18.png" /></td>
<td><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image19.png" /></td>
<td><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image20.png" /></td>
</tr>
<tr>
<td>Removing spurious features causes drops in accuracy because…</td>
<td>core features are noisy and not sufficiently expressive.</td>
<td>spurious features are not removed properly and thus corrupt core features.</td>
<td>a lack of training data causes spurious connections between some features and the target.</td>
</tr>
<tr>
<td>We can mitigate such drops by…</td>
<td>focusing on collecting more expressive features (e.g., high-resolution images)</td>
<td>focusing on more accurate methods for removing spurious features.</td>
<td>focusing on collecting more diverse training data. We show how to leverage unlabeled data to achieve such diversity.</td>
</tr>
</tbody>
</table>
<blockquote>
<p><img style="float: right;" width="30%" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/nut.png" /></p>
<h3 id="this-work-in-a-nutshell"><strong>This work in a nutshell:</strong></h3>
<ul>
<li>We study overparameterized models that fit training data perfectly.</li>
<li>We compare the “core model” that only uses core features (non-spurious) with the “full model” that uses both core features and spurious features.</li>
<li>Using the spurious feature, the full model can fit training data with a smaller norm.</li>
<li>In the overparameterized regime, since the number of training examples is less than the number of features, there are some directions of data variation that are not observed in the training data (unseen directions).</li>
<li>Though both models fit the training data perfectly, they have different “assumptions’’ for the unseen directions. This difference can lead to
<ul>
<li>Drop in accuracy</li>
<li>Affecting different test distributions (we also call them groups) disproportionately (increasing accuracy in some while decreasing accuracy in others).</li>
</ul>
</li>
</ul>
</blockquote>
<h1 id="noiseless-linear-regression">Noiseless Linear Regression</h1>
<p>Over the last few years, researchers have observed some surprising
phenomena about deep networks that conflict with classical machine
learning. For example, training models to zero training loss leads to
better generalization instead of overfitting <sup id="fnref:double_descent"><a href="#fn:double_descent" class="footnote">12</a></sup>. A line
of work <sup id="fnref:montanari"><a href="#fn:montanari" class="footnote">13</a></sup><sup id="fnref:aditi_michael"><a href="#fn:aditi_michael" class="footnote">14</a></sup> found that these unintuitive
results happen even for simple models such as linear regression if the
number of features are greater than the number of training data, known
as the overparameterized regime.</p>
<p>Accuracy drops due to the removal of spurious features is also
unintuitive. Classical machine learning tells us that removing spurious
features should decrease generalization error (since these features are,
by definition, irrelevant for the task). Analogous to the mentioned
work, we will explain this unintuitive result in overparameterized
linear regression as well. </p>
<blockquote>
<p>Accuracy drop due to removal of the spurious feature can be explained in overparameterized linear regression.</p>
</blockquote>
<p>Let’s first formalize the noiseless linear regression setup. Recall
that we are going to study a setup in which the target is completely
determined by the core features, and the spurious feature is a single
feature that can be removed perfectly without affecting predictive
performance. Formally, we assume there are \(d\) core features
\(z \in \mathbb{R}^d\) that determine the target \(y \in
\mathbb{R}\) perfectly, i.e., \( y = {\theta^\star}^\top z\).
In addition, we assume there is a single spurious feature \(s\) that
can also be determined by the core features \(s =
{\beta^\star}^\top z\). Note that the spurious feature can have
information about features that determine the target or it can be
completely unrelated to the target (i.e., for all \(i\),
\(\beta^\star_i \theta^\star_i=0\)).</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image13.png" />
<em>We consider a setup where target (\(y\)) is a deterministic function
of core features (\(z\)). In addition, there is a spurious feature
(\(s\)) that can also be determined by the core feature. We compare
two models, the core model that only uses \(z\) to predict \(y\) and the full model which uses both \(z\) and \(s\) to predict
\(y\).</em></p>
<p>We consider two models:</p>
<ul>
<li>Core model that only uses the core features \(z\) to predict the
target \(y\), and it is parametrized by
\({\theta^\text{-s}}\). For a data point with core features
\(z\), its prediction is \(\hat y =
{\theta^\text{-s}}^\top z\).</li>
<li>Full model that uses the core features \(z\) and also uses the
spurious feature \(s\), and it is parametrized by
\({\theta^\text{+s}}\), and \(w\), For a data point with
core feature \(z\) and a spurious feature \(s\), its
prediction is \(\hat y = {\theta^\text{+s}}^\top z + ws\).</li>
</ul>
<p>In this setup, the mentioned two reasons that naturally can cause
accuracy drop after removing the spurious feature (depicted in the table
above) do not exist.</p>
<ol>
<li>The spurious feature \(s\) adds no information about the target
\(y\) beyond what already exists in the core features
\(z\) (reason 1),</li>
<li>Removing \(s\) does not corrupt \(z\) (reason 2).</li>
</ol>
<p>Motivated by recent work in deep learning, which speculates that
gradient descent converges to the minimum-norm solution that fits
training data perfectly <sup id="fnref:gunasekar2017implicit"><a href="#fn:gunasekar2017implicit" class="footnote">15</a></sup>, we consider the
minimum-norm solution. </p>
<ul>
<li>Training data: We assume we have \(n < d\) triples of
\((z_i, s_i, y_i)\)</li>
<li>Test data: We assume core features in the test data are from a
distribution with covariance matrix \(\Sigma =
\mathbb{E}[zz^\top]\) (we use group and test data distribution
exchangeably).</li>
</ul>
<p>In this simple setting, one might conjecture that removing the spurious
feature should only help accuracy. However, we show that this is not
always the case. We exactly characterize the test distributions that are
negatively affected by removing spurious features, as well as the ones
that are positively affected by it.</p>
<h1 id="example">Example</h1>
<p>Let’s first look at a simple example with only one training data and
three core features (\(z_1, z_2\) and \(z_3\)). Let the true
parameters \(\theta^\star =[2,2,2]^\top\) which results in
\(y=2\), and let the spurious feature parameter \({\beta^\star}
= [1,2,-2]^\top\) which results in \(s=1\).</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image11_1.png" /></p>
<p>First, note that the smallest L2-norm vector that can fit the training
data for the core model is \({\theta^\text{-s}}=[2,0,0]\). On
the other hand, in the presence of the spurious feature, the full model
can fit the training data perfectly with a smaller norm by assigning
weight \(1\) for the feature \(s\)
(\(|{\theta^\text{-s}}|_2^2 = 4\) while
\(|{\theta^\text{+s}}|_2^2 + w^2 = 2 < 4\)).</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image11_2.png" /></p>
<p>Generally, in the overparameterized regime, since the number of training
examples is less than the number of features, there are some directions
of data variation that are not observed in the training data. In this
example, we do not observe any information about the second and third
features. The core model assigns weight \(0\) to the unseen
directions (weight \(0\) for the second and third features in this
example). However, the non-zero weight for the spurious feature leads to
a different assumption for the unseen directions. In particular, the
full model does not assign weight \(0\) to the unseen directions.
Indeed, by substituting \(s\) with \({\beta^\star}^\top
z\), we can view the full model as not using \(s\) but
implicitly assigning weight \(\beta^\star_2=2\) to the second
feature and \(\beta^\star_3=-2\) to the third feature (unseen
directions at training).</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image11_3.png" /></p>
<p>Let’s now look at different examples and the prediction of these two
models:</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image7.png" /></p>
<p>In this example, removing \(s\) reduces the error for a test
distribution with high deviations from zero on the second feature,
whereas removing \(s\) increases the error for a test distribution
with high deviations from zero on the third feature.</p>
<h1 id="main-result">Main result</h1>
<p>As we saw in the previous example, by using the spurious feature, the
full model incorporates \({\beta^\star}\) into its estimate. The
true target parameter (\(\theta^\star\)) and the true spurious
feature parameters (\({\beta^\star}\)) agree on some of the
unseen directions and do not agree on the others. Thus, depending on
which unseen directions are weighted heavily in the test time, removing
\(s\) can increase or decrease the error.</p>
<p>More formally, the weight assigned to the spurious feature is
proportional to the projection of \(\theta^\star\) on
\({\beta^\star}\) on the seen directions. If this number is close
to the projection of \(\theta^\star\) on \({\beta^\star}\)
on the unseen directions (in comparison to 0), removing \(s\)
increases the error, and it decreases the error otherwise. Note that
since we are assuming noiseless linear regression and choose models that
fit training data, the model predicts perfectly in the seen directions
and only variations in unseen directions contribute to the error.</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image6.png" />
<em>(Left) The projection of \(\theta^\star\) on
\(\beta^\star\) is positive in the seen direction, but it is
negative in the unseen direction; thus, removing \(s\) decreases the
error. (Right) The projection of \(\theta^\star\) on
\(\beta^\star\) is similar in both seen and unseen directions;
thus, removing \(s\) increases the error.</em></p>
<blockquote>
<p>Drop in accuracy in test time depends on the relationship between the true target parameter (\(\theta^\star\)) and the true spurious feature parameters (\({\beta^\star}\)) in the seen directions and unseen direction.</p>
</blockquote>
<p>Let’s now formalize the conditions under which removing the spurious
feature (\(s\)) increases the error. Let \(\Pi =
Z(ZZ^\top)^{-1}Z\) denote the column space of training data (seen
directions), thus \(I-\Pi\) denotes the null space of training data
(unseen direction). The below equation determines when removing the
spurious feature decreases the error.</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image9.png" />
<em>The left side is the difference between the projection of \(\theta^\star\) on \(\beta^\star\) in the seen direction
with their projection in the unseen direction scaled by test time
covariance. The right side is the difference between 0 (i.e., not using
spurious features) and the projection of \(\theta^\star\) on
\(\beta^\star\) in the unseen direction scaled by test time
covariance. Removing \(s\) helps if the left side is greater than
the right side.</em></p>
<h1 id="experiments">Experiments</h1>
<p>While the theory applies only to linear models, we now show that in
non-linear models trained on real-world datasets, removing a spurious
feature reduces the accuracy and affects groups disproportionately.</p>
<p>Datasets. We are going to study the CelebA dataset <sup id="fnref:liu2015"><a href="#fn:liu2015" class="footnote">16</a></sup> which
contains photos of celebrities along with 40 different attributes.
\footnote{See our paper for the results on the
comment-toxicity-detection and MNIST datasets} We choose wearing
lipstick (indicating if a celebrity is wearing lipstick) as the target
and wearing earrings (indicating if a celebrity is wearing earrings) as
the spurious feature. </p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image5.png" /></p>
<p>Note that although wearing earrings is correlated with wearing lipstick,
we expect our model to not change its prediction if we tell the model
the person is wearing earrings.</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image3.png" /></p>
<p>In the CelebA dataset wearing earrings is correlated with wearing
lipstick. In this dataset, if a celebrity wears earrings, it is almost
five times more likely that they will wear lipstick than not wearing
lipstick. Similarly, if a celebrity does not wear earrings, it is
almost two times more likely for them not to wear lipstick than wearing
lipstick.</p>
<p>Setup. We train a two-layer neural network with 128 hidden units. We
flatten the picture and concatenate the binary variable of wearing
earrings to it (we tuned a multiplier for it). We also want to know how
much each model relies on the spurious feature. In other words, we want
to know how much the model prediction changes as we change the wearing
earrings variable. We call this attacking the model (i.e, swapping the
value of the binary feature of wearing earrings). We run each experiment
50 times and report the average.</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image12.png" /></p>
<p>Results. The below diagram shows the accuracy of different models, and
their accuracies when they are attacked. Note that, because our attack
focuses on the spurious feature, the core model’s accuracy will remain
the same.</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image16.png" /></p>
<p>Removal of the wearing lipstick decreases the overall accuracy. The
decrease in accuracy is not monotonic among different groups. The
accuracy has decreased in the group where people are not wearing
lipstick or earrings and in the group that they both have lipstick and
earrings. On the other hand, accuracy increases for the group that only
wears one of them.</p>
<p>Let’s break down the diagram and analyze each section.</p>
<table>
<tbody>
<tr>
<td><img width="2000" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image4.png" /></td>
<td>All celebrities together: have a reasonable accuracy of 82% The overall accuracy drops 1% when we remove the spurious feature (core model accuracy). The full model relies on the spurious feature a lot, thus attacking the full model leads to a ~ 17% drop in overall accuracy.</td>
</tr>
<tr>
<td><img width="2000" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image2.png" /></td>
<td>The celebrities who follow the stereotype (people who do not have earrings or lipstick, and people who wear both) have a good accuracy overall (both above 85%); The accuracy of both groups drop as we remove the wearing earrings (i.e., core model accuracy). Using the spurious feature helps their accuracy, thus attacking the full model leads to a ~30% drop in their accuracy.</td>
</tr>
<tr>
<td><img width="2000" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image15.png" /></td>
<td>The celebrities who do not follow the stereotypes have a very low accuracy; this is especially worse for people who only wear earrings (33% accuracy in comparison to the average of 85%). Removing the wearing earring increases their accuracy substantially. Using the spurious feature does not help their accuracy, thus attacking the full model does not change accuracy for these groups.</td>
</tr>
</tbody>
</table>
<blockquote>
<p> In non-linear models trained on real-world datasets, removing a spurious feature reduces the accuracy and affects groups disproportionately.</p>
</blockquote>
<h1 id="qa-other-results">Q&A (Other results):</h1>
<p><strong>I know about my problem setting, and I am certain that disjoint features
determine the target and the spurious feature (i.e., for all \(i\),
\(\theta^\star_i\beta^\star_i=0\)). Can I be sure that my
model will not rely on the spurious feature, and removing the spurious
feature definitely reduces the error?</strong> No! Actually, for any
\(\theta^\star\) and \({\beta^\star}\), we can construct a
training set and two test sets with \(\theta^\star\) and
\({\beta^\star}\) as the true parameters and the spurious feature
parameter, such that removing the spurious feature reduces the error in
one but increases the error in the other one (see Corollary 1 in our
paper).</p>
<p><strong>I am collecting a balanced dataset such that the spurious feature and
the target are completely independent (i.e., \(p[y,s]= p[y]p[s]\)).
Can I be sure that my model will not rely on the spurious feature, and
removing the spurious feature definitely reduces the error?</strong>
No! for any
\(S \in \mathbb{R}^n\) and \(Y \in \mathbb{R}^n\), we can
generate a training set and two test sets with \(S\) and \(Y\)
as their spurious feature and targets, respectively, such that removing
the spurious feature reduces the error in one but increases the error in
the other (see Corollary 2 in our paper).</p>
<p><strong>What happens when we have many spurious features?</strong> Good question! Let’s
say \(s_1\) and \(s_2\) are two spurious features. We show
that:</p>
<ol>
<li>Removing \(s_1\) makes the model more sensitive against
\(s_2\), and</li>
<li>If a group has high error because of the new assumption about unseen
direction enforced by using \(s_2\), then it will have an even
higher error by removing \(s_1\).
(See Proposition 3 in our paper).</li>
</ol>
<p><strong>Is it possible to have the same model (a model with the same assumptions
on unseen directions as the full model) without relying on the spurious
feature (i.e., be robust against the spurious feature)?</strong> Yes! You can
recover the same model as the full model without relying on the spurious
feature via robust self-training and unlabeled data (See Proposition 4).</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this work, we first showed that overparameterized models are
incentivized to use spurious features in order to fit the training data
with a smaller norm. Then we demonstrated how removing these spurious
features altered the model’s assumption on unseen directions.
Theoretically and empirically, we showed that this change could hurt the
overall accuracy and affect groups disproportionately. We also proved
that robustness against spurious features (or error reduction by
removing the spurious features) cannot be guaranteed under any condition
of the target and spurious feature. Consequently, balanced datasets do
not guarantee a robust model and practitioners should consider other
features as well. Studying the effect of removing noisy spurious
features is an interesting future direction.</p>
<h1 id="acknowledgement">Acknowledgement</h1>
<p>I would like to thank Percy Liang, Jacob Schreiber and Megha Srivastava for their useful comments. The images in the introduction are from <sup id="fnref:xiao2020noise2"><a href="#fn:xiao2020noise2" class="footnote">17</a></sup><sup id="fnref:credit_gay_straight"><a href="#fn:credit_gay_straight" class="footnote">18</a></sup> <sup id="fnref:credit_blur2"><a href="#fn:credit_blur2" class="footnote">19</a></sup><sup id="fnref:credit_sport2"><a href="#fn:credit_sport2" class="footnote">20</a></sup>.</p>
<div class="footnotes">
<ol>
<li id="fn:dixon2018measuring">
<p>Dixon, Lucas, et al. “Measuring and mitigating unintended bias in text classification.” Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 2018. <a href="#fnref:dixon2018measuring" class="reversefootnote">↩</a></p>
</li>
<li id="fn:xiao2020noise">
<p>Xiao, Kai, et al. “Noise or signal: The role of image backgrounds in object recognition.” arXiv preprint arXiv:2006.09994 (2020). <a href="#fnref:xiao2020noise" class="reversefootnote">↩</a></p>
</li>
<li id="fn:resnet">
<p>He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. <a href="#fnref:resnet" class="reversefootnote">↩</a></p>
</li>
<li id="fn:zemel2013learning">
<p>Zemel, Rich, et al. “Learning fair representations.” International Conference on Machine Learning. 2013. <a href="#fnref:zemel2013learning" class="reversefootnote">↩</a></p>
</li>
<li id="fn:wang2019balanced">
<p>Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. <a href="#fnref:wang2019balanced" class="reversefootnote">↩</a></p>
</li>
<li id="fn:khani2020noise">
<p>Khani, Fereshte, and Percy Liang. “Feature Noise Induces Loss Discrepancy Across Groups.” International Conference on Machine Learning. PMLR, 2020. <a href="#fnref:khani2020noise" class="reversefootnote">↩</a></p>
</li>
<li id="fn:kleinberg2019simplicity">
<p>Kleinberg, Jon, and Sendhil Mullainathan. “Simplicity creates inequity: implications for fairness, stereotypes, and interpretability.” Proceedings of the 2019 ACM Conference on Economics and Computation. 2019. <a href="#fnref:kleinberg2019simplicity" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_blur">
<p>photo from Torralba, Antonio. “Contextual priming for object detection.” International journal of computer vision 53.2 (2003): 169-191. <a href="#fnref:credit_blur" class="reversefootnote">↩</a></p>
</li>
<li id="fn:zhao2019inherent">
<p>Zhao, Han, and Geoff Gordon. “Inherent tradeoffs in learning fair representations.” Advances in neural information processing systems. 2019. <a href="#fnref:zhao2019inherent" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_sport">
<p>photo from Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE International Conference on Computer Vision. 2019. <a href="#fnref:credit_sport" class="reversefootnote">↩</a></p>
</li>
<li id="fn:paper">
<p>Khani, Fereshte, and Percy Liang. “Removing Spurious Features can Hurt Accuracy and Affect Groups Disproportionately.” arXiv preprint arXiv:2012.04104 (2020). <a href="#fnref:paper" class="reversefootnote">↩</a></p>
</li>
<li id="fn:double_descent">
<p>Nakkiran, Preetum, et al. “Deep double descent: Where bigger models and more data hurt.” arXiv preprint arXiv:1912.02292 (2019). <a href="#fnref:double_descent" class="reversefootnote">↩</a></p>
</li>
<li id="fn:montanari">
<p>Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560. <a href="#fnref:montanari" class="reversefootnote">↩</a></p>
</li>
<li id="fn:aditi_michael">
<p>Raghunathan, Aditi, et al. “Understanding and mitigating the tradeoff between robustness and accuracy.” arXiv preprint arXiv:2002.10716 (2020). <a href="#fnref:aditi_michael" class="reversefootnote">↩</a></p>
</li>
<li id="fn:gunasekar2017implicit">
<p>Gunasekar, Suriya, et al. “Implicit regularization in matrix factorization.” 2018 Information Theory and Applications Workshop (ITA). IEEE, 2018. <a href="#fnref:gunasekar2017implicit" class="reversefootnote">↩</a></p>
</li>
<li id="fn:liu2015">
<p>Liu, Ziwei, et al. “Deep learning face attributes in the wild.” Proceedings of the IEEE international conference on computer vision. 2015. <a href="#fnref:liu2015" class="reversefootnote">↩</a></p>
</li>
<li id="fn:xiao2020noise2">
<p>Xiao, Kai, et al. “Noise or signal: The role of image backgrounds in object recognition.” arXiv preprint arXiv:2006.09994 (2020). <a href="#fnref:xiao2020noise2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_gay_straight">
<p>Garg, Sahaj, et al. “Counterfactual fairness in text classification through robustness.” Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 2019. <a href="#fnref:credit_gay_straight" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_blur2">
<p>photo from Torralba, Antonio. “Contextual priming for object detection.” International journal of computer vision 53.2 (2003): 169-191. <a href="#fnref:credit_blur2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_sport2">
<p>photo from Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE International Conference on Computer Vision. 2019. <a href="#fnref:credit_sport2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 24 Jan 2021 00:00:00 -0800Blue People v. City of Ney
/blog/Bluepeoplevs.Neycity/
/blog/Bluepeoplevs.Neycity/<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_CHTML"></script>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image8.jpg" /></p>
</div></figure>
<h1 id="introduction">Introduction</h1>
<p>Discriminatory behavior towards certain groups by machine learning (ML) models is especially concerning in critical applications such as hiring. This blog post explains one source of discrimination: the reliance of ML models on different groups’ data distributions. We will show that when ML models use noisy features (which are pervasive in the real world, e.g., exam scores), they’re incentivized to devalue a good candidate from a lower-performing group. This blog post is based on:</p>
<p><em>Fereshte Khani and Percy Liang, “Feature Noise Induces Loss Discrepancy
Across Groups.” International Conference on Machine Learning. PMLR, 2020</em></p>
<p>The findings are illustrated by reviewing the hiring process in the
fictitious city of Ney, where recently a group of people has accused the
government of discrimination.</p>
<h1 id="hiring-people-in-ney">Hiring people in Ney</h1>
<p>The government of Ney wants to hire qualified people. Each person in Ney has a skill level that is normally distributed with a mean \(\mu\) and a standard deviation
of \(\sigma_\text{skill}\). A person is qualified if their skill level is greater than 0 and non-qualified
otherwise. The government wants to hire qualified people (all people
with skills greater than 0). For example, Alice with skill level 2, is
qualified, but Bob with the skill level of -1 is not qualified.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image13.png" />
<em>The skills level of the people in Ney is normally distributed with a mean of \(\mu\) and a standard deviation of \(\sigma_\text{skill}\).</em></p>
</div></figure>
<p>To assess people’s skills, the government created an exam. The exam score is a noisy indicator of the applicant’s skill since it cannot capture the true skill of a person (e.g., the same applicant would score differently on different versions of SAT). In the city of Ney, exam noise is nice and simple: If an individual has skill \(z\), then their
score is distributed as \(\mathcal{N} (z,
\sigma_\text{noise}^2)\),
where \(\sigma_\text{noise}^2\) indicates the variance of noise
on the exam.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image11.png" />
<em>The exam score of an individual with a skill of \(z\) is a random variable normally distributed with a mean of \(z\) and a standard deviation of \(\sigma_\text{noise}\).</em></p>
</div></figure>
<p>The government wants to choose a threshold \(\tau\), and hire all
people whose exam scores are greater than \(\tau\). There are two
kinds of errors that the government can make:</p>
<ol>
<li>Not hiring a qualified person (\(z > 0 \land x \le \tau\))</li>
<li>Hiring a non-qualified person (\(z \le 0 \land x > \tau\))</li>
</ol>
<p>For simplicity, let’s assume the government cares about these two types
of errors equally and wants to minimize the overall error, i.e., the
number of non-qualified hired people plus the number of qualified
non-hired people.</p>
<script type="math/tex; mode=display">\begin{align}
\text{Error} = \mathbb{E}\left[[z>0] \neq [x > \tau]\right]\\
\end{align}</script>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image4.png" />
<em>The government’s goal is to find a cut-off threshold such that it minimizes the error.</em></p>
</div></figure>
<p>Given all exam scores and knowledge of the skill distribution of the people,
what cut-off threshold should the government use to minimize the error (the above equation)?
Is it a good strategy for the government to simply use 0 as the
threshold and hire all individuals with scores greater than zero?</p>
<p>Let’s consider an example where the skill distribution
is \(\mathcal{N}(-1,1)\), and the exam noise
has a standard deviation of \(\sigma_\text{noise}=1\). The following lines of code plot
the average error for various thresholds for this example. As
illustrated, 0 is not the best threshold to use. In fact, in this
example, a threshold of \(\tau=1\) leads to minimum error.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image1.png" />
<em>A simple example with \(\mu=-1\) and \(\sigma_\text{skill}=\sigma_\text{noise}=1\). As shown on the right, accepting individuals with a score higher than \(0\) does not result in the minimum error.</em></p>
</div></figure>
<blockquote>
<blockquote>
<h4 id="the-government-wants-to-minimize-the-number-of-hired-people-with-negative-skill-levels--the-number-of-non-hired-people-with-positive-skill-levels-hiring-all-people-with-positive-exam-scores-a-noisy-indicator-of-the-skill-is-not-optimal">The government wants to minimize the number of hired people with negative skill levels + the number of non-hired people with positive skill levels. Hiring all people with positive exam scores (a noisy indicator of the skill) is not optimal.</h4>
</blockquote>
</blockquote>
<p>If 0 is not always the optimal threshold, then what is the optimal
threshold for minimizing error for different values of \(\mu,
\sigma_\text{skill}\) and \(\sigma_\text{noise}\)?
Generally, given a person’s exam score (\(x\)) and the skill level distribution (\(\mathbb{P}(z)\)), what can we infer
about their real skill (\(z\))? Here is where Bayesian inference
comes in.</p>
<h1 id="bayesian-inference-">Bayesian inference </h1>
<p>Let’s see what we can infer about a person’s skill given their exam score and knowing the skill level distribution
\(\mathbb{P} (z)\) (known as the <em>prior distribution</em> since it shows the prior over a person’s skill). Using Bayes rule, we can calculate \(\mathbb{P} (z|x)\) (known as the <em>posterior distribution</em> since it shows the distribution over a person’s skill after observing their score).</p>
<p>Let’s first consider two extreme cases:</p>
<ol>
<li>If the exam is completely precise
(i.e., \(\sigma_\text{noise}=0\)), then the exam score is
the exact indicator of a person’s skill (irrespective of the prior
distribution).</li>
<li>If the exam is pure noise (i.e., \(\sigma_\text{noise}
\rightarrow \infty\)), then the exam score is meaningless, and
the best estimate for a person’s skill is the average
skill \(\mu\) (irrespective of the exam score).</li>
</ol>
<p>Intuitively, when the noise variance has a value between \(0\) and \(\infty\), the best estimate of a person’s skill is a number
between their exam score (\(x\)) and the average skill
(\(\mu\)). The figure below shows the standard formulation of the
posterior distribution \(\mathbb{P} (z \mid x)\) after observing
an exam score (\(x_0\)). For more details on how to derive this
formula, see
<a href="https://www.google.com/url?q=https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf&sa=D&ust=1608704068777000&usg=AOvVaw1E_EmGAxQ8A_gOtp6_dTHk">this</a>.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image3.png" />
<em>Posterior distribution of a person’s skill after observing their exam score (\(x_0\)).</em></p>
</div></figure>
<p>Based on this formula (and as hypothesized), depending on the amount of noise, \(\mathbb{E} [z\mid x]\) is a number between \(x\) and \(\mu\).</p>
<blockquote>
<blockquote>
<h4 id="an-applicants-expected-skill-level-is-between-their-exam-score-and-the-average-skill-among-ney-people-if-the-exam-is-noisier-it-is-closer-to-the-average-skill-if-the-exam-is-more-precise-it-is-closer-to-the-exam-score">An applicant’s expected skill level is between their exam score and the average skill among Ney people. If the exam is noisier, it is closer to the average skill; if the exam is more precise, it is closer to the exam score.</h4>
</blockquote>
</blockquote>
<h1 id="optimal-threshold">Optimal threshold</h1>
<p>Now that we have exactly characterized the posterior distribution
(\(\mathbb{P} (z \mid x)\)), the government can find the optimal
threshold. For any exam score \(x\), if the government hires people
with score \(x\), it incurs \(\mathbb{P}(z \le 0 \mid x) \)
error (probability of hiring non-qualified people). On the other hand,
if it does not hire people with score \(x\), it
incurs \(\mathbb{P}(z > 0 \mid x)\) error (probability of
non-hiring qualified people). Thus, in order to minimize the error, the
government should hire a person iff \(\mathbb{P} (z > 0 \mid x) >
\mathbb{P}(z \le 0 \mid x)\). Since the posterior distribution is a
normal distribution, the government must hire an applicant
iff \(\mathbb{E}[z \mid x] > 0\).</p>
<p>Using the formulation in the previous section, we have:</p>
<script type="math/tex; mode=display">\begin{align}\mu \frac{\sigma_\text{noise}^2}{\sigma_\text{noise}^2 +
\sigma_\text{skill}^2} + x
\frac{\sigma_\text{skill}^2}{\sigma_\text{skill}^2 +
\sigma_\text{noise}^2} > 0 \iff x > -\mu
\frac{\sigma_\text{noise}^2}{\sigma_\text{skill}^2}
\end{align}</script>
<p>Therefore, the optimal threshold is:</p>
<script type="math/tex; mode=display">\bbox[5px, border: 2px solid grey]{
\text{optimal threshold} = -\mu\frac{\sigma_\text{noise}^2}{\sigma_\text{skill}^2}
}</script>
<p>In our running example with average skill \(\mu=-1\)
and \(\sigma_\text{skill} = \sigma_\text{noise}=1\), the optimal threshold is 1.
The figure below shows how the optimal threshold varies according
to \(\mu\) and \(\sigma_\text{noise}\).
As \(\sigma_\text{noise}\) increases or \(\mu\) decreases,
the optimal threshold moves farther away from \(0\).</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image5.png" />
<em>(left) The optimal threshold increases as the average of the prior distribution decreases (with a fixed exam noise \(\sigma_\text{noise} > 0\)). (right) The optimal threshold increases if the exam noise increases (with a fixed average skill \(\mu < 0\)). Note that, if exam scores are not noisy or the average skill is zero, then the optimal threshold is zero.</em></p>
</div></figure>
<blockquote>
<blockquote>
<h4 id="as-exams-become-more-noisy-or-the-average-skill-becomes-more-negative-the-optimal-threshold-moves-further-away-from-0">As exams become more noisy or the average skill becomes more negative, the optimal threshold moves further away from 0.</h4>
</blockquote>
</blockquote>
<h1 id="what-does-machine-learning-have-to-do-with-all-of-this">What does machine learning have to do with all of this?</h1>
<p>So far, we precisely identified the optimal cut-off threshold given the
exact knowledge of \(\mu, \sigma_\text{skill}\),
and \(\sigma_\text{noise}\). But how can the government find the
optimal threshold using observational data? This is where machine
learning (ML) comes into the picture.
Let’s imagine very favorable conditions. Let’s assume everyone (an infinite number of them!) takes the exam, the government hires all of them and observes their true skills. Further, assume the modeling assumption is perfectly correct (i.e., both the true prior distribution and conditional distribution are normal). What would happen if the government trains a model with an infinite number of \((x,z)\)
pairs?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_50" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image6.png" />
<em>The government has collected lots of data and now wants to use ML models to predict the best threshold that minimizes the error.</em></p>
</div></figure>
<p>Before delving into this, we would like to note that in real-world
scenarios, we do not have infinite data (finite data issues); the
government does not hire everyone (selection bias issues), and the true
skill is not perfectly observable (target noise/biases issues).
Furthermore, the modeling assumptions are often incorrect (model
misspecification issues). Each of these issues may affect the model
adversely; however, in this blog post our goal is to analyze the model
decisions when none of these issues exist. In the next section, we will show that discrimination occurs even under these ideal conditions.</p>
<p>Under these very favorable conditions and the right loss function,
machine learning algorithms can perfectly predict \(\mathbb{E} [z
\mid x]\) from \(x\); therefore, can find the optimal threshold
that minimizes the error. The following few lines of Python code show
how linear regression and logistic regression fit the data. In this
example, we set \(\mu = -1,
\sigma_\text{skill}=\sigma_\text{noise}=1\), and as shown in
the figure on the right, the cut-off threshold predicted by the model is
one, which matches the optimal threshold as we observed previously.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image2.png" />
A simple example along with the predicted cut-off
threshold for linear and logistic regression. The predicted cut-off
threshold results in the minimum error, as previously discussed.</p>
</div></figure>
<blockquote>
<blockquote>
<h4 id="under-very-favorable-conditions-machine-learning-models-find-the-optimal-threshold-which-is-a-function-of-average-skill-exam-noise-and-skill-variance-among-people">Under very favorable conditions, machine learning models find the optimal threshold, which is a function of average skill, exam noise, and skill variance among people.</h4>
</blockquote>
</blockquote>
<h1 id="optimal-thresholds-for-different-groups">Optimal thresholds for different groups</h1>
<p>So far, we have shown how to calculate the optimal threshold and
illustrated that ML models also recover this threshold. Let’s now
analyze the optimal threshold when different groups exist in the
population. There are two kinds of people in the city of Ney: blue and red. The
blue people’s skills are normally distributed centered
on \(\mu_\text{blue}\), and the red people’s skills are normally
distributed centered on \(\mu_\text{red}\). The standard deviation for
both groups is \(\sigma_\text{skill}\). There can be various
reasons for disparities between groups, for example historically blue
people might not have been allowed to attend school.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image9.png" />
<em>In Ney, people are divided into two groups: blue and red. The blue people have a lower average skill level than the red people.</em></p>
</div></figure>
<p>First of all, let’s see what happens if the exam is completely precise. As
previously discussed in this case, the optimal threshold to use is 0 for
both groups independent of their distribution. Thus, both groups are
held to the same standard, and the error for the government is 0.</p>
<blockquote>
<blockquote>
<h4 id="if-there-is-no-noise-in-the-exam-then-zero-is-the-optimal-threshold-for-both-groups-and-leads-to-zero-error">If there is no noise in the exam, then zero is the optimal threshold for both groups and leads to zero error.</h4>
</blockquote>
</blockquote>
<p>Now let’s analyze the case where the exam is noisy
( \(\sigma_\text{noise} > 0\)). As discussed in the prior
sections, the optimal threshold depends on the average of the prior
distribution, thus the optimal threshold differs between blue and red
groups. Therefore, if the government knows the demographic information,
then it’s a better strategy for the government to classify different
groups separately (in order to minimize the error). In particular, the
government can calculate the optimal threshold for blue and red people
using Bayesian inference.</p>
<script type="math/tex; mode=display">\begin{align}
\text{Red Threshold} = -\mu_\text{red} \frac{\sigma_\text{noise}^2}{\sigma_\text{skill}^2} \quad \quad \text{Blue Threshold} = -\mu_\text{blue}\frac{\sigma_\text{noise}^2}{\sigma_\text{skill}^2}
\end{align}</script>
<blockquote>
<blockquote>
<h4 id="people-in-a-group-that-has-lower-average-skills-need-to-pass-a-higher-bar-for-hiring-not-only-do-blue-people-need-to-overcome-other-associated-effects-of-being-in-a-group-with-lower-average-skills-they-also-need-to-pass-a-higher-bar-to-get-hired---------">People in a group that has lower average skills need to pass a higher bar for hiring! Not only do blue people need to overcome other associated effects of being in a group with lower average skills, they also need to pass a higher bar to get hired. </h4>
</blockquote>
</blockquote>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image7.png" />
The cut-off threshold for hiring is higher for blue people in comparison to the red people.</p>
</div></figure>
<p>As stated, the government uses a higher threshold for people in a group
with a lower average skill! Consider two individuals with the same skill
level but from different groups. The blue person is less likely to get
hired by the government than the red person. Surprisingly, blue people
who are already in a group with a lower average skill (which probably
affects their confidence and society’s view of them) need to also pass a
higher bar to get hired!</p>
<p>Finally, note that the gap between thresholds for the different groups
grows as the noise increases.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image12.png" />
As the exam noise increases, the gap between the optimal thresholds among different groups widens. Blue people need to get a better score than red people on the exam to get hired.</p>
</div></figure>
<blockquote>
<blockquote>
<h4 id="a-blue-person-has-a-lower-chance-of-getting-hired-in-comparison-with-a-red-person-with-the-same-skill">A blue person has a lower chance of getting hired in comparison with a red person with the same skill.</h4>
</blockquote>
</blockquote>
<h1 id="conclusion">Conclusion</h1>
<p>We examined the discriminatory effect of relying on noisy features. When ML models use noisy features, they’re naturally incentivized to devalue a good score when the candidate in question comes from an overall lower-performing group. Note that noisy features are prevalent in any real-world application (here, we assumed that noise is the same among all individuals, but it’s usually worse for disadvantaged groups). Ideally, we would like to improve the features to better reflect a candidate’s skill/potential or make the features more closely approximate the job requirements. If that’s not possible, it’s important to be conscious that the “optimal decision” is to discriminate, and we should adjust our process (e.g., hiring) in acknowledgment that group membership can shade an individual’s evaluation.</p>
<hr />
<h1 id="frequently-asked-questions">Frequently asked questions</h1>
<h5 id="can-we-just-remove-the-group-membership-information-so-the-model-treats-individuals-from-both-groups-similarly"><strong>Can we just remove the group membership information, so the model treats individuals from both groups similarly?</strong></h5>
<p>Unlike this example where group membership is a removable feature,
real-world datasets are more complex. Usually, datasets contain many
features such that the group membership can be predicted from them
(recall that ML models benefit from predicting group membership since it
lowers error). Thus, it is not obvious how to remove group membership in
these datasets. See
[<a href="http://proceedings.mlr.press/v28/zemel13.pdf">1</a>,<a href="https://arxiv.org/pdf/1707.00075.pdf">2</a>,<a href="https://arxiv.org/abs/1907.00020">3</a>]
for some efforts on removing group information.</p>
<h5 id="why-should-we-treat-these-two-groups-similarly-when-their-distributions-are-inherently-different-utilizing-group-membership-information-reduces-error-overall-and-for-both-groups"><strong>Why should we treat these two groups similarly when their distributions are inherently different? Utilizing group membership information reduces error overall and for both groups!</strong></h5>
<p>Fairness in machine learning usually studies the impact of ML algorithms
on groups according to protected attributes such as sex, sexual
orientation, race, etc. Usually, there has been some discrimination
towards these groups throughout history, which leads to huge disparities
among their distributions. For example, women (because of their sex)
were not allowed to go to universities. Thus, these disparities are not
inherent and could (and probably should!) change over time. For
instance, see women in the labor force
[<a href="https://www.dol.gov/agencies/wb/data/facts-over-time/women-in-the-labor-force%23civilian-labor-force-by-sex">4</a>].</p>
<p>Another reason to avoid relying on disparities among protected groups in
models is feedback loops. Feedback loops might exacerbate distributional
disparities among protected groups over time. (e.g., few women get
accepted → the self-doubt between women increases → women perform
worse in the exam → fewer women get accepted and so on). For
instance, see
[<a href="https://arxiv.org/abs/1806.08010">5</a>]
and
[<a href="https://arxiv.org/abs/1706.09847">6</a>].</p>
<p>Finally, note that although the government objective may be to minimize the
error by weighting the costs of hiring non-qualified and non-hiring
qualified candidates similarly, it is not clear whether the group
objectives should be the same. For example, a group might be worse off
as a result of the government not hiring its qualified members than if
the government had hired its non-qualified members (for example, in
settings where the lack of minority role models in higher-level
positions leads to a lower perceived sense of belonging in other members
of a group). Thus, using group membership to minimize the error is not
necessarily the most beneficial outcome for a group; and depending on
the context we might need to minimize other objectives.</p>
<h5 id="what-about-other-notions-of-fairness-in-machine-learning"><strong>What about other notions of fairness in machine learning?</strong></h5>
<p>In this blog post, we studied the ML model’s prediction for two similar individuals (here same z) but from different groups (blue vs. red). This is referred to as the counterfactual notion of fairness. There is another common notion of fairness known as the statistical notion of fairness, which looks at the groups as a whole and compares their incurred error (it is also common to compare the error incurred by qualified members of different groups known as the equal opportunity [<a href="https://arxiv.org/pdf/1610.02413.pdf">7</a>]). Statistical and counterfactual notions of fairness are independent of each other, and satisfying one does not guarantee satisfying the other. Another consequence of feature noise is causing a trade-off between these two notions of fairness, which is beyond this blog post’s scope. See our paper [<a href="https://arxiv.org/abs/1911.09876">8</a>] for critiques regarding these two notions and the effect of feature noise on statistical notions of fairness.</p>
<h1 id="acknowledgement">Acknowledgement</h1>
<p>I would like to thank Percy Liang, Megha Srivastava, Frieda Rong, and Rishi Bommasani, Yeganeh Alimohammadi, and Michelle Lee for their useful comments.</p>
Sun, 20 Dec 2020 00:00:00 -0800A Model-Based Approach Towards Identifying the Brain's Learning Algorithms
/blog/lr-identify/
/blog/lr-identify/<h3 id="introduction"><strong>Introduction</strong></h3>
<p>One of the tenets of modern neuroscience is that the brain modifies the
strengths of its synaptic connections (“weights”) during learning in
order to better adapt to its environment. However, the underlying
learning rules (“weight updates”) in the brain are currently unknown.
Many proposals have been suggested, ranging from Hebbian-style
mechanisms that seem biologically plausible but are not very effective
as learning algorithms in that they prescribe purely local changes to
the weights between two neurons that increase only if they activate
together -- to backpropagation, which is effective from a learning
perspective by assigning credit to neurons along the entire downstream
path from outputs to inputs, but has numerous biologically implausible
elements.</p>
<p>A major long-term goal of computational neuroscience is to identify
which learning rules actually drive learning in the brain. A further
difficulty is that we do not even have strong ideas for what needs to be
measured in the brain to quantifiably assert that one learning rule is
more consistent with those measurements than another learning rule. So
how might we approach these issues? We take a simulation-based approach,
meaning that experiments are done on artificial neural networks rather
than real brains. We train over a thousand artificial neural networks
across a wide range of possible learning rule types (conceived of as
“optimizers”), system architectures, and tasks, where the ground truth
learning rule is known, and quantify the impact of these choices. Our
work suggests that recording activities from several hundred neurons,
measured semi-regularly during learning, may provide a good basis to
identify learning rules -- a testable hypothesis within reach of
current neuroscience tools!</p>
<h3 id="background-a-plethora-of-theories-and-a-paucity-of-evidence"><strong>Background: A Plethora of Theories and a Paucity of Evidence</strong></h3>
<p>The brain modifies the connections between neurons during learning to
improve behavior; however, the underlying rules that govern these
modifications are unknown. The most famous proposed learning rule is
“Hebbian learning”, also known by the mantra: “neurons that fire
together; wire together”. In this proposal, a synaptic connection
strengthens if one neuron ("pre-synaptic") consistently sends a signal
to another neuron ("post-synaptic"). The changes prescribed by Hebbian
learning are “local” in that they do not take into account a synapse’s
influence further downstream in the network. This locality makes
learning rather slow even in the cases where additional issues, such as
the weight changes becoming arbitrarily large, are mitigated. Though
there have been many suggested theoretical strategies to deal with this
problem, commonly involving simulations with artificial neural networks
(ANNs), these strategies appear difficult to scale up to solve
large-scale tasks such as ImageNet categorization
[<a href="https://arxiv.org/abs/1807.04587">1</a>].</p>
<p>This property of local changes is in stark contrast to backpropagation,
the technique commonly used to optimize artificial neural networks. In
backpropagation, as the name might suggest, an error signal is
propagated backward along the entire downstream path from the outputs of
a model to the inputs of the model. This allows credit to be effectively
assigned to every neuron along the path.</p>
<p>Although backpropagation has long been a standard component of deep
learning, its plausibility as a <em>biological</em> learning rule (i.e. how the
brain modifies the strengths of its synaptic connections) is called into
question for several reasons. Chief among them is that backpropagation
requires perfect symmetry, whereby the backward error-propagating
weights are the transpose of the forward inference weights, for which
there is currently little biological support
[<a href="https://www.sciencedirect.com/science/article/pii/S0364021387800253">2</a>,
<a href="https://www.nature.com/articles/337129a0">3</a>].</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-09-lr-identify/weight_symmetry.gif" /></p>
<figcaption>
<b>Avoiding weight symmetry.</b> Backpropagation naturally couples the
forward and backward weights. This constraint can be relaxed by
uncoupling them, thereby generating a spectrum of learning rule
hypotheses about how the backward weights may be updated.
For more details, see our recent <a href="https://arxiv.org/abs/2003.01513">prior work</a>.
</figcaption>
</div></figure>
<p>Recent approaches, from us and others
[<a href="https://arxiv.org/abs/1904.05391">4</a>,
<a href="https://arxiv.org/abs/2003.01513">5</a>], introduce approximate
backpropagation strategies that do not require this symmetry, and can
still succeed at large-scale learning as backpropagation does. However,
given the number of proposals, a natural question to ask is how
realistic they are. At the moment, our hypotheses are governed by domain
knowledge that specifies what “can” and “cannot” be biologically
plausible (e.g. “exact weight symmetry is likely not possible” or
“separate forward and backward passes during learning seem
implausible”), as well as characterizations of ANN task performance
under a given learning rule (which is not always directly measurable
from animal behavior). In order to be able to successfully answer this
question, we need to be able to empirically <em>refute</em> hypotheses. In
other words, we would ideally want to know what biological data to
collect in order to claim that one hypothesis is more likely than
another.</p>
<p>More concretely, we can ask: what specific measurements from the brain,
in the form of individual activation patterns over time, synaptic
strengths, or paired-neuron input-output relations, would allow one to
draw quantitative comparisons of whether the observations are more
consistent with one or another specific learning rule? For example,
suppose we record neural responses (“activation patterns”) while an
animal is learning a task. Would these data be sufficient to enable us
to broadly differentiate between learning rule hypotheses, e.g. by
reliably indicating that one learning rule’s changes over time more
closely match the changes measured from real data than those prescribed
by another learning rule?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-09-lr-identify/neuron_schematic.gif" /></p>
<figcaption>
Some potential observables to measure on which to separate candidate
learning rule hypotheses. (Pyramidal neuron schematic adapted from Figure
4 of [<a href="https://www.nature.com/articles/s41583-020-0277-3">6</a>])
</figcaption>
</div></figure>
<p>Answering this question turns out to be a substantial challenge, because
it is difficult on purely theoretical grounds to identify which patterns
of neural changes arise from given learning rules, without also knowing
the overall network connectivity and reward target (if any) of the
learning system.</p>
<p>But, there may be a silver lining. While ANNs consist of units that are
highly simplified with respect to biological neurons, recent progress
within the past few years has shown that the internal representations that
emerge in trained deep ANNs often overlap strongly with representations
in the brain, and are in fact quantifiably similar to many
neurophysiological and behavioral observations in animals
[<a href="https://www.nature.com/articles/s41593-019-0520-2">7</a>]. For
instance, task-optimized, deep convolutional neural networks (CNNs) have
emerged as quantitatively accurate models of encoding in primate visual
cortex [<a href="https://www.pnas.org/content/111/23/8619">8</a>,
<a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003915">9</a>,
<a href="https://www.jneurosci.org/content/35/27/10005">10</a>]. This is
due to (1) their cortically-inspired architecture, a cascade of
spatially-tiled linear and nonlinear operations; and (2) their being
optimized to perform certain behaviors that animals must perform to
survive, such as object recognition
[<a href="https://www.nature.com/articles/nn.4244">11</a>]. CNNs trained
to recognize objects on ImageNet predict neural responses of primate
visual cortical neurons better than any other model class. Thus, these
models are, at the moment, some of our current best algorithmic
“theories” of the brain -- a system that was ultimately not designed by
us, but rather the product of millions of years of evolution. On the
other hand, ANNs <em>are</em> designed by us -- so the ground truth learning
rule is known and every unit (artificial “neuron”) can be measured up to
machine precision.</p>
<p>Can we marry what we can measure in neuroscience with what we can
conclude from machine learning, in order to identify what experimentally
measurable observables may be most useful for inferring the underlying
learning rule? If we can’t do this in our models, then it seems very
unlikely to be able to do this in the real brain. But if we can do this
in principle, then we are in a position to generate predictions as to
what data to collect, and whether that is even within reach of current
experimental neuroscience tools.</p>
<h3 id="methods"><strong>Methods</strong></h3>
<p>We adopt a two-stage “virtual experimental” approach. In the first
stage, we train ANNs with different learning rules, across a variety of
architectures, tasks, and associated hyperparameters. These will serve
as our “model organisms” on which we will subsequently perform idealized
neuroscience measurements. In the second stage, we calculate aggregated
statistics (“measurements”) from each layer of the models as features
from which to train simple classifiers that classify the category that a
given learning rule belongs to (specified below). These classifiers
include the likes of a linear SVM, as well as simple non-linear ones
such as a Random Forest and a 1D convolutional two-layer perceptron.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/approach_schematic.png" /></p>
<figcaption>
<b>Overall approach.</b> Observable statistics are generated from each
neural network's layer, through the model training process for each
learning rule. We take a quantitative approach whereby a classifier is
cross-validated and trained on a subset of these trajectories and
evaluated on the remaining data.
</figcaption>
</div></figure>
<p>Generating a large-scale dataset is crucial to this endeavor, in order
to both emulate a variety of experimental neuroscience scenarios and be
able to derive robust conclusions from them. Thus, in the first stage,
we train ANNs on tasks and architectures that have been shown to explain
variance in neural responses from sensory (visual and auditory)
brain areas [<a href="https://www.pnas.org/content/111/23/8619">8</a>,
<a href="https://www.sciencedirect.com/science/article/pii/S0896627318302502?via%3Dihub">12</a>].
These include <em>supervised</em> tasks across vision and audition, as well as
<em>self-supervised</em> ones. We consider both shallow and deep feedforward
architectures on these tasks, that are of depth comparable to what is
considered reasonable from the standpoint of shallower non-primate (e.g.
mouse
[<a href="https://www.nature.com/articles/s41586-019-1716-z">13</a>]) and
deeper primate sensory systems
[<a href="https://www.pnas.org/content/111/23/8619">8</a>,
<a href="https://arxiv.org/abs/1807.00053">14</a>,
<a href="https://www.biorxiv.org/content/10.1101/407007v2.full">15</a>].</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/table.png" /></p>
<figcaption>
The learning rules, tasks, architectures, and hyperparameters from which
we generate data, comprising over a thousand training experiments in total.
</figcaption>
</div></figure>
<p>In the second stage, we train classifiers on the observable statistics from these ANNs to predict the learning rules (as specified in the table above) used to train them.
The four learning rules were chosen as they span the space of commonly
used variants of backpropagation (<a href="http://proceedings.mlr.press/v28/sutskever13.pdf">SGDM</a> and <a href="https://arxiv.org/abs/1412.6980">Adam</a>), as well as potentially
more biologically-plausible “local” learning rules (<a href="https://arxiv.org/abs/1411.0247">Feedback
Alignment (FA)</a> and <a href="https://arxiv.org/abs/2003.01513">Information Alignment (IA)</a>) that efficiently
train networks at scale to varying degrees of performance but avoid exact weight
symmetry.</p>
<p>Because the primary aim of this study is to determine the extent that
different learning rules led to different encodings within ANNs, we
begin by defining representative features that can be drawn from the
course of model training. For each layer in a model, we consider three
measurements: <em>weights</em> of the layer, <em>activations</em> from the layer, and
<em>layer-wise activity change</em> of a given layer’s outputs relative to its
inputs. We choose ANN weights to analogize to synaptic strengths in the
brain, activations to analogize to post-synaptic firing rates, and
layer-wise activity changes to analogize to paired measurements that
involve observing the change in post-synaptic activity with respect to
changes induced by pre-synaptic input.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/statistics.gif" /></p>
<figcaption>
Defining observable statistics.
</figcaption>
</div></figure>
<p>For each measure, we consider three functions applied to it: “identity”,
“absolute value”, and “square”. Finally, for each function of the
weights and activations, we consider seven statistics, and for the
layer-wise activity change observable, we only use the mean statistic
due to computational restrictions. This results in a total of 45
continuous valued observable statistics for each layer, though 24
observable statistics are ultimately used for training the classifiers,
since we remove any statistic that has a divergent value during the
course of model training. We also use a ternary indicator of layer
position in the model hierarchy: “early”, “middle”, or “deep”
(represented as a one-hot categorical variable).</p>
<h3 id="we-can-separate-learning-rules-from-aggregate-statistics-of-the-weights-activations-or-layer-wise-activity-changes"><strong>We Can Separate Learning Rules from Aggregate Statistics of the Weights, Activations, or Layer-wise Activity Changes</strong></h3>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/example.png" /></p>
<figcaption>
Across tasks, different learning rules give rise to perceptible
differences in observable statistics.
</figcaption>
</div></figure>
<p>Already by eye, one can pick up distinctive differences across the
learning rules for each of the training trajectories of these metrics.
Of course, this is not systematic enough to clearly judge one set of
observables versus another, but provides some initial assurance that
these metrics seem to capture some inherent differences in learning
dynamics across rules.</p>
<p>So these initial observations seem promising, but we want to make this
approach more quantitative. Suppose for each layer we concatenate the
trajectories of each observable and the position in the model hierarchy
that this observable came from. Can we generalize well across held-out
examples?</p>
<p>It turns out that the answer is in fact, yes. Across all classes of
observables, the Random Forest attains the highest test accuracy, and
all observable measures perform similarly under this classifier.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/conf_mats.png" /></p>
<figcaption>
<b>Test set confusion matrices.</b> Random Forest performs the best and differences in learning rate policy
(Adam vs. SGDM) are more difficult to distinguish.
</figcaption>
</div></figure>
<p>Looking at confusion matrices on the test set, we see that the Random
Forest hardly mistakes one learning rule from any of the others. And
when the classifiers do make mistakes, they generally tend to confuse
Adam vs. SGDM more so than IA vs. FA, suggesting that they are able to
pick up more on differences (reflected in the observable statistics) due
to high-dimensional direction of the gradient tensor than the magnitude
of the gradient tensor (the latter being directly tied to learning rate
policy).</p>
<h3 id="adding-back-some-experimental-neuroscience-realism"><strong>Adding Back Some Experimental Neuroscience Realism</strong></h3>
<p>Up until this point, we have had access to all input types, the full learning trajectory, and noiseless access to all units when making our virtual measurements of ANN observable statistics.
But in a real experiment where someone were to
collect such data from a neural circuit, the situation would be far from
this ideal scenario. We therefore explore experimental realism in
several ways, in order to identify which observable measures are robust
across these scenarios.</p>
<h4 id="access-to-only-portions-of-the-learning-trajectory-subsampling-observable-trajectories"><strong><em>Access to only portions of the learning trajectory: subsampling observable trajectories</em></strong></h4>
<p>The results presented thus far were obtained with access to the entire
learning trajectory of each model. Often however, an experimentalist
collects data throughout learning at regularly spaced intervals. We
capture this variability by randomly sampling a fixed number of points
at a fixed temporal spacing for each trajectory, which we refer to as a
“subsample period”.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/sparse_subsampling.png" /></p>
<figcaption>
Sparse subsampling across learning trajectory is most robust to
trajectory undersampling.
</figcaption>
</div></figure>
<p>We find across observable measures that robustness to undersampling of
the trajectory is largely dependent on the subsample period length. As
the subsample period length increases (in the middle and right-most
columns), the Random Forest classification performance increases
compared to the same number of sampled points for a smaller period
(depicted in the left-most column).</p>
<p>Taken together, these results suggest that data consisting of
measurements collected temporally further apart across the learning
trajectory is more robust to undersampling than data collected closer
together in training time. Furthermore, across individual observable
measures, the weights are overall the most robust to undersampling of
the trajectory, but with enough frequency of samples we can achieve
comparable performance with the activations.</p>
<h4 id="incomplete-and-noisy-measurements-subsampling-units-and-gaussian-noise-before-collecting-observables"><strong><em>Incomplete and noisy measurements: subsampling units and Gaussian noise before collecting observables</em></strong></h4>
<p>The aggregate statistics computed from the observable measures thus far
have operated under the idealistic assumption of noiseless access to
every unit in the model. However, in most datasets, there is a
significant amount of unit undersampling as well as non-zero measurement
noise. How do these two factors affect learning rule identification, and
in particular, how noise and subsample-robust are particular observable
measures?</p>
<p>Addressing this question would provide insight into the types of
experimental neuroscience paradigms that may be most useful for
identifying learning rules, and predict how certain experimental tools
may fall short for given observables. For instance, optical imaging
techniques can use fluorescent indicators of electrical activities of
neurons to give us simultaneous access to thousands of neurons.
But these techniques can have lower temporal resolution and signal-to-noise than
electrophysiological recordings that more directly measure the
electrical activities of neurons, which in turn may lack the same
coverage.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/subsample_noise.png" /></p>
<figcaption>
<b>Activations are the most robust to measurement noise and unit
undersampling.</b> Reported here is Random Forest test set accuracy in
separating IA vs. FA, averaged over 10 train/test splits per random
sampling and simulated measurement noise seed.
</figcaption>
</div></figure>
<p>To account for these tradeoffs, we model measurement noise as an
additive white Gaussian noise process added to units of ResNet-18
trained on the ImageNet and self-supervised SimCLR tasks. We choose IA
vs. FA since the differences between them are conceptually stark: IA
imposes dynamics on the feedback error weights during learning, whereas
FA keeps them fixed. If there are scenarios of measurement noise and
unit subsampling where we are at chance accuracy for this problem (50%),
then it may establish a strong constraint on learning rule
separability more generally.</p>
<p>Our results suggest that if one makes experimental measurements by
imaging synaptic strengths, it is still crucial that the optical imaging
readout not be very noisy, since even with the amount of units typically
recorded currently (on the order of several hundred to several thousand
synapses), a noisy imaging strategy of synaptic strengths may be
rendered ineffective.</p>
<p>Instead, current electrophysiological techniques that measure the
activities from hundreds of units could form a good set of neural data
to separate learning rules. Recording more units with these techniques
can improve learning rule separability from the activities, but it does
not seem necessary, at least in this setting, to record a majority of
units to perform this separation effectively.</p>
<h3 id="conclusions"><strong>Conclusions</strong></h3>
<p>As experimental techniques in neuroscience continue to advance, we will
be able to record data from more neurons with higher temporal
resolution. But even if we had the perfect measurement tools, it is not
clear ahead of time what should be measured in order to identify the
learning rule(s) operative within a given neural circuit, or whether
this is even possible in principle. Our model-based approach
demonstrates that we can identify learning rules <em>solely</em> on the basis of
standard types of experimental neuroscience measurements from the
weights, activations, or layer-wise activity changes, without knowledge
of the architecture or loss target of the learning system.</p>
<p>Additionally, our results suggest the following prescription for the type of
experimental neuroscience data to be collected towards this goal:</p>
<p><strong>Electrophysiological recordings of post-synaptic activities
from a neural circuit on the order of several hundred units, frequently
measured at wider intervals during the course of learning, may provide a
good basis on which to identify learning rules.</strong></p>
<p>We have made our <a href="https://github.com/neuroailab/lr-identify">dataset, code, and interactive
tutorial</a> publicly
available so that others can analyze these properties without needing to
train neural networks themselves. Our dataset may also be of interest to
researchers theoretically or empirically investigating learning in deep
neural networks. For further details, check out our <a href="https://arxiv.org/abs/2010.11765">NeurIPS 2020
paper</a>.</p>
<h3 id="acknowledgements"><strong>Acknowledgements</strong></h3>
<p>I would like to thank my collaborator Sanjana Srivastava
and advisors Surya Ganguli and Daniel Yamins. I would also like to
thank Jacob Schreiber, Sidd Karamcheti, and Andrey Kurenkov for their
editorial suggestions on this post.</p>
Wed, 09 Dec 2020 00:00:00 -0800iGibson: A Simulation Environment to Train AI Agents in Large Realistic Scenes
/blog/igibson/
/blog/igibson/<h2 id="why-simulation-for-ai">Why simulation for AI?</h2>
<p>We are living in a Golden Age of simulation environments in AI and robotics. Looking back ten years, simulation environments were rare, with only a handful of available solutions, and were complex and used only by experts. Today, there are many available simulation environments and most papers in AI and robotics at first tier conferences such as NeurIPS, CoRL or even ICRA and IROS, make some use of them. What has changed?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-08-igibson/sim_img.png" /></p>
</div></figure>
<p>This extensive use of simulation environments is the result of several trends:</p>
<ul>
<li>First, the increasing role of machine learning in robotics creates a demand for more data (for example, interactive experiences) than what can be generated in real time <sup id="fnref:dexterity"><a href="#fn:dexterity" class="footnote">1</a></sup><sup id="fnref:todorov"><a href="#fn:todorov" class="footnote">2</a></sup><sup id="fnref:peng"><a href="#fn:peng" class="footnote">3</a></sup><sup id="fnref:robosuite"><a href="#fn:robosuite" class="footnote">4</a></sup>. Also, the initial data collection process often involves random exploration that may be dangerous for physical robots or their surroundings.</li>
<li>Second, simulation environments have matured to be more robust, realistic (visually and physically), user friendly and accessible to all types of users, and the necessary computation to simulate complex physics is reasonably fast on most modern machines. Therefore, simulation environments have the potential to lower the barrier to entry in robotics, even for researchers without the funds to acquire expensive real robot platforms.</li>
<li>Finally, the increasing number of robotic solutions to tasks such as grasping, navigation or manipulation have brought more attention to a critical absence in our community: the lack of repeatable benchmarks. Mature sciences are based on experiments that can be easily and reliably replicated, so that different techniques, theories, and solutions can be compared in fair conditions. Simulation environments can help us to establish repeatable benchmarks, which is very difficult to achieve with real robots, which can in turn help us understand the status of our field.</li>
</ul>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-08-igibson/image9.png" /></p>
</div></figure>
<h2 id="why-igibson">Why iGibson?</h2>
<p>These ideas motivated us in the Stanford Vision and Learning Lab to develop a simulation environment that can serve as a “playground” to train and test interactive AI agents – an environment we call iGibson (*footnote on naming at bottom of post) . What makes iGibson special? To understand this, let’s first define what a simulation environment is and how it is different from a physics simulator. A physics simulator is an engine capable of computing the physical effect of actions on an environment (e.g. motion of bodies when a force is applied, or flow of liquid particles when being poured). There are many existing physics simulation engines. The best known in robotics are Bullet and its python extension, PyBullet, MuJoCo, Nvidia PhysX and Flex, UnrealEngine, DART, Unity, and ODE. Given a physical problem (objects, forces, particles, and physics parameters), these engines compute the temporal evolution of the system. On the other hand, a simulation environment is a framework that includes a physics simulator, a renderer of virtual signals, and a set of assets (i.e. models of scenes, objects, and robots) that can be used to create simulations of problems to study and develop solutions for different tasks. The decision on what physics engine to use is based on the type of physical process that dominates the problem, for example rigid body physics or motion of fluids. However, to decide on what simulation environment to use, researchers are guided by the application domain they are interested in, and the research questions they want to explore. With iGibson, we aim to support the study of interactive tasks in large realistic scenes, guided by high quality virtual visual signals.</p>
<h2 id="comparison-to-existing-simulators">Comparison to existing simulators</h2>
<p>No existing simulation environments support developing solutions for problems involving interactions in large scale scenes like full houses. There are several simulation environments for tasks with stationary arms, such as meta-world, RLBench, RoboSuite or DoorGym, but none of them include large realistic scenes like homes with multiple rooms for tasks that include navigation. For navigation, our previous version, Gibson (v1) and Habitat have proven to be great environments that allow researchers to study visual and language guided navigation. However, the included assets (scenes) are single meshes that cannot change when interactions are applied, like opening doors or moving objects.</p>
<p>Finally, a set of recent simulation environments allow for scene-level interactive tasks, such as Sapien, AI2Thor and ThreeDWorld (TDW). Sapien focuses on interaction with articulated objects (doors, cabinets, and drawers). TDW is a multi-modal simulator with audio, high quality visuals, and simulation of flexible materials and liquids via Nvidia Flex. But neither Sapien nor TDW include fully interactive scenes aligned with real object distribution and layout as part of the environment. AI2Thor includes fully interactive scenes, but the interactions are scripted: interactable objects are annotated with the possible actions they can receive. When the agent is close enough to an object and the object is in the right state (precondition), the agent can select a predefined action, and the object is “transitioned’” to the next state (postcondition). RoboThor, an alternative version of AI2Thor, enables continuous interactions but focuses on navigation. It provides limited sensory signals to the agent (only RGB-D images) that is always embodied as a <a href="https://www.google.com/url?q=http://www.locobot.org/&sa=D&ust=1607413428167000&usg=AOvVaw1ZTY10cnxkvqoOZHiIr9Hw">locobot</a>, a low-cost platform with limited interaction capabilities. Here at SVL, we want to study complex, long-horizon mobile manipulation tasks such as tidying a house or searching for objects, which requires access to fully interactive realistic large-scale scenes.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-08-igibson/image10.png" /></p>
</div></figure>
<h2 id="igibsons-new-features">iGibson’s new features</h2>
<p>The main focus of iGibson in interactivity: enabling realistic interactions in large scenes. For that, we have included several key features:</p>
<ul>
<li>Fifteen fully interactive visually realistic scenes representing real world homes with furniture and articulated object models annotated with materials and dynamics properties.</li>
<li>Capabilities to import models from CubiCasa5K <sup id="fnref:cubicasa"><a href="#fn:cubicasa" class="footnote">5</a></sup> and 3D-Front <sup id="fnref:3dfront"><a href="#fn:3dfront" class="footnote">6</a></sup>, giving access to more than 12000 additional interactive home scenes.</li>
<li>Realistic virtual sensor signals, including high quality RGB images from a physics-based renderer, depth maps, 1 beam and 16 beams virtual LiDAR signals, semantic/instance/material segmentation, optical and scene flow, and surface normals.</li>
<li>Domain randomization for visual texture, dynamics properties and object instances for endless variations of scenes.</li>
<li>Human-computer interface for humans to provide demonstrations of fully physical interactions with the scenes.</li>
<li>Integration with sampling-based motion planners to facilitate motion of robotic bases (navigation in 2D layout) and arms (interaction in 3D space).</li>
</ul>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image5.gif" />
<img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image1.gif" /></p>
</div></figure>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image3.gif" />
<img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image8.gif" /></p>
</div></figure>
<h2 id="using-igibson-for-robot-learning">Using iGibson for robot learning</h2>
<p>These novel features in iGibson allow us to study and develop solutions for new interactive tasks in large environments. One of these new problems is Interactive Navigation, where the agents need to interact with the environment to change its configuration, for example, to open doors or push obstacles away. This is a common type of navigation in our homes and offices, but non-interactive simulation environments cannot be used to study it. In iGibson we have developed hierarchical reinforcement learning solutions for interactive navigation that decide explicitly what part of the body to use in the next phase of the task: the arm (for interactions), the base (for navigation) or the combination of both <sup id="fnref:hrl4in2"><a href="#fn:hrl4in2" class="footnote">7</a></sup>. We also propose a new learning solution for interactive navigation that integrates a motion planner: the learning algorithm decides on the next point to interact, and the motion planner finds a collision free path to that point of interaction <sup id="fnref:relmogen2"><a href="#fn:relmogen2" class="footnote">8</a></sup>. But these are just the tips of the iceberg: many of SVL’s projects are leveraging iGibson to study a wide variety of interactive tasks in large realistic scenes.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image11.gif" />
<img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image6.gif" /></p>
</div></figure>
<h2 id="summary">Summary</h2>
<p>Simulation environments have the potential to support researchers in their study of robotics and embodied AI problems. With iGibson, SVL contributes to the community with an open source, fully academically developed simulation environment for interactive tasks in large realistic scenes. If you want to start using it, visit <a href="http://svl.stanford.edu/igibson/">our website</a> and download - setup should be straightforward, and we’re happy to answer any questions about getting the simulator up and running for your research! You can also read <a href="https://arxiv.org/pdf/2012.02924.pdf">our preprint on arxiv</a>. We hope we can facilitate new avenues of research in robotics and AI.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-08-igibson/image7.png" /></p>
</div></figure>
<hr />
<p>* A note on Gibson - Our simulation environment takes the name from James J. Gibson [1904-1979]. Gibson was an influential psychologist and cognitive scientist with, at the time, disruptive ideas. He pushed forward a new concept of perception to be considered 1) an ecological process that cannot and should not be studied in isolation from the environment, and 2) an active process that needs agency and interactivity. This was in contrast to the predominant view of the time of perception to be a passive process where signals “arrive” and “are processed” by the brain. Instead, he argued that agents seek for information, interacting and revealing it. He also coined the term “affordance” as the opportunity the environment offers to an agent to perform a task. This is a quote from a colleague summarizing his research that directly connects to the guiding principle behind our work in the iGibson team: “ask not what’s inside your head, but what your head is inside of”.</p>
<div class="footnotes">
<ol>
<li id="fn:dexterity">
<p>Andrychowicz, OpenAI: Marcin, et al. “Learning dexterous in-hand manipulation.” The International Journal of Robotics Research 39.1 (2020): 3-20. <a href="#fnref:dexterity" class="reversefootnote">↩</a></p>
</li>
<li id="fn:todorov">
<p>Rajeswaran, Aravind, et al. “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.” Robotics: Science and Systems, 2017 <a href="#fnref:todorov" class="reversefootnote">↩</a></p>
</li>
<li id="fn:peng">
<p>Peng, Xue Bin, et al. “Sfv: Reinforcement learning of physical skills from videos.” ACM Transactions on Graphics (TOG) 37.6 (2018): 1-14. <a href="#fnref:peng" class="reversefootnote">↩</a></p>
</li>
<li id="fn:robosuite">
<p>Zhu, Yuke, et al. “robosuite: A modular simulation framework and benchmark for robot learning.” arXiv preprint arXiv:2009.12293 (2020). <a href="#fnref:robosuite" class="reversefootnote">↩</a></p>
</li>
<li id="fn:cubicasa">
<p>Kalervo, Ahti, et al. “Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis.” Scandinavian Conference on Image Analysis. Springer, Cham, 2019. <a href="#fnref:cubicasa" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3dfront">
<p>Fu, Huan, et al. “3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics.” arXiv preprint arXiv:2011.09127 (2020). <a href="#fnref:3dfront" class="reversefootnote">↩</a></p>
</li>
<li id="fn:hrl4in2">
<p>Li, Chengshu, et al. “Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators.” Conference on Robot Learning. PMLR, 2020. <a href="#fnref:hrl4in2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:relmogen2">
<p>Xia, Fei, et al. “Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation.” arXiv preprint arXiv:2008.07792 (2020). <a href="#fnref:relmogen2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Tue, 08 Dec 2020 00:00:00 -0800Stanford AI Lab Papers and Talks at NeurIPS 2020
/blog/neurips-2020/
/blog/neurips-2020/<p><img class="postimage_75" src="https://ai.stanford.edu/blog/assets/img/posts/2020-12-06-neurips-2020/logo.png" /></p>
<p>The <a href="https://neurips.cc">Neural Information Processing Systems</a> (NeurIPS) 2020 conference is being hosted virtually from Dec 6th - Dec 12th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!</p>
<h2 id="list-of-accepted-papers">List of Accepted Papers</h2>
<hr />
<h4 id="provably-efficient-reward-agnostic-navigation-with-linear-value-iteration">Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration</h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img33" />
<strong>Authors</strong>: Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill
<br /><strong>Contact</strong>: zanette@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2008.07737.pdf">Paper</a>
<br /><strong>Keywords</strong>: reinforcement learning, function approximation, exploration</p>
<hr />
<h4 id="acceleration-with-a-ball-optimization-oracle"><a href="https://arxiv.org/abs/2003.08078">Acceleration with a Ball Optimization Oracle</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img29" />
<strong>Authors</strong>: Yair Carmon, Arun Jambulapati, Qijia Jiang, Yujia Jin, Yin Tat Lee, Aaron Sidford, Kevin Tian
<br /><strong>Contact</strong>: kjtian@stanford.edu
<br /><strong>Award nominations:</strong> Oral presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2003.08078">Paper</a>
<br /><strong>Keywords</strong>: convex optimization, local search, trust region methods</p>
<hr />
<h4 id="banditpam-almost-linear-time-k-medoids-clustering-via-multi-armed-bandits"><a href="https://arxiv.org/abs/2006.06856">BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img10" />
<strong>Authors</strong>: Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony
<br /><strong>Contact</strong>: Motiwari@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.06856">Paper</a> | <a href="https://studio.slideslive.com/web_recorder/share/20201019T224008Z__NeurIPS_posters__17289__bandit-pam-almost-linear-time?s=c3456b98-724c-4903-b216-e4cd5810b6b8">Video</a>
<br /><strong>Keywords</strong>: clustering, k-means, k-medoids, multi-armed bandits</p>
<hr />
<h4 id="caspr-learning-canonical-spatiotemporal-point-cloud-representations"><a href="https://geometry.stanford.edu/projects/caspr/content/CaSPR_CR.pdf">CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img20" />
<strong>Authors</strong>: Davis Rempe, Tolga Birdal, Yongheng Zhao, Zan Gojcic, Srinath Sridhar, Leonidas J. Guibas
<br /><strong>Contact</strong>: drempe@stanford.edu
<br /><strong>Links:</strong> <a href="https://geometry.stanford.edu/projects/caspr/content/CaSPR_CR.pdf">Paper</a> | <a href="https://www.youtube.com/watch?v=1CrITE28DeM">Video</a> | <a href="https://geometry.stanford.edu/projects/caspr/">Website</a>
<br /><strong>Keywords</strong>: 3d vision, dynamic point clouds, representation learning</p>
<hr />
<h4 id="compositional-explanations-of-neurons"><a href="https://arxiv.org/abs/2006.14032">Compositional Explanations of Neurons</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img15" />
<strong>Authors</strong>: Jesse Mu, Jacob Andreas
<br /><strong>Contact</strong>: muj@stanford.edu
<br /><strong>Award nominations:</strong> oral
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.14032">Paper</a>
<br /><strong>Keywords</strong>: interpretability, explanation, deep learning, computer vision, natural language processing, adversarial examples</p>
<hr />
<h4 id="continuous-meta-learning-without-tasks"><a href="https://arxiv.org/abs/1912.08866">Continuous Meta-Learning without Tasks</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img35" />
<strong>Authors</strong>: James Harrison, Apoorva Sharma, Chelsea Finn, Marco Pavone
<br /><strong>Contact</strong>: jharrison@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/1912.08866">Paper</a>
<br /><strong>Keywords</strong>: meta-learning, continuous learning, changepoint detection</p>
<hr />
<h4 id="deep-learning-versus-kernel-learning-an-empirical-study-of-loss-landscape-geometry-and-the-time-evolution-of-the-neural-tangent-kernel"><a href="https://arxiv.org/abs/2010.15110">Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img16" />
<strong>Authors</strong>: Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli
<br /><strong>Contact</strong>: sfort1@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.15110">Paper</a>
<br /><strong>Keywords</strong>: loss landscape, neural tangent kernel, linearization, taylorization, basin, nonlinear advantage</p>
<hr />
<h4 id="diversity-can-be-transferred-output-diversification-for-white--and-black-box-attacks"><a href="https://arxiv.org/abs/2003.06878">Diversity can be Transferred: Output Diversification for White- and Black-box Attacks</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img8" />
<strong>Authors</strong>: Yusuke Tashiro, Yang Song, Stefano Ermon
<br /><strong>Contact</strong>: ytashiro@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2003.06878">Paper</a> | <a href="https://github.com/ermongroup/ODS">Website</a>
<br /><strong>Keywords</strong>: adversarial examples, deep learning, robustness</p>
<hr />
<h4 id="evidential-sparsification-of-multimodal-latent-spaces-in-conditional-variational-autoencoders"><a href="https://arxiv.org/abs/2010.09164">Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img1" />
<strong>Authors</strong>: Masha Itkina, Boris Ivanovic, Ransalu Senanayake, Mykel J. Kochenderfer, and Marco Pavone
<br /><strong>Contact</strong>: mitkina@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.09164">Paper</a> | <a href="https://github.com/sisl/EvidentialSparsification">Website</a>
<br /><strong>Keywords</strong>: sparse distributions, generative models, discrete latent spaces, behavior prediction, image generation</p>
<hr />
<h4 id="federated-accelerated-stochastic-gradient-descent"><a href="https://papers.nips.cc/paper/2020/hash/39d0a8908fbe6c18039ea8227f827023-Abstract.html">Federated Accelerated Stochastic Gradient Descent</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img18" />
<strong>Authors</strong>: Honglin Yuan, Tengyu Ma
<br /><strong>Contact</strong>: yuanhl@stanford.edu
<br /><strong>Award nominations:</strong> Best Paper Award of Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML 2020 (FL-ICML’20)
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/hash/39d0a8908fbe6c18039ea8227f827023-Abstract.html">Paper</a> | <a href="https://github.com/hongliny/FedAc-NeurIPS20">Website</a>
<br /><strong>Keywords</strong>: federated learning, local sgd, acceleration, fedac</p>
<hr />
<h4 id="fourier-transform-based-attribution-priors-improve-the-interpretability-and-stability-of-deep-learning-models-for-genomics"><a href="https://proceedings.neurips.cc/paper/2020/hash/1487987e862c44b91a0296cf3866387e-Abstract.html">Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img11" />
<strong>Authors</strong>: Alex Michael Tseng, Avanti Shrikumar, Anshul Kundaje
<br /><strong>Contact</strong>: amtseng@stanford.edu
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc/paper/2020/hash/1487987e862c44b91a0296cf3866387e-Abstract.html">Paper</a> | <a href="https://github.com/amtseng/fourier_attribution_priors">Website</a>
<br /><strong>Keywords</strong>: deep learning, interpretability, attribution prior, computational biology, genomics</p>
<hr />
<h4 id="from-trees-to-continuous-embeddings-and-back-hyperbolic-hierarchical-clustering"><a href="https://arxiv.org/abs/2010.00402">From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img5" />
<strong>Authors</strong>: Ines Chami, Albert Gu, Vaggos Chatziafratis, Christopher Ré
<br /><strong>Contact</strong>: chami@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.00402">Paper</a> | <a href="https://www.youtube.com/watch?v=11bIx4v_Mz4&feature=youtu.be&ab_channel=HazyResearch">Video</a> | <a href="https://github.com/HazyResearch/HypHC">Website</a>
<br /><strong>Keywords</strong>: hierarchical clustering, hyperbolic embeddings</p>
<hr />
<h4 id="frugalml-how-to-use-ml-prediction-apis-more-accurately-and-cheaply"><a href="https://papers.nips.cc/paper/2020/file/789ba2ae4d335e8a2ad283a3f7effced-Paper.pdf">FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img27" />
<strong>Authors</strong>: Lingjiao Chen; Matei Zaharia; James Zou
<br /><strong>Contact</strong>: lingjiao@stanford.edu
<br /><strong>Award nominations:</strong> Oral Presentation
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/file/789ba2ae4d335e8a2ad283a3f7effced-Paper.pdf">Paper</a> | <a href="https://venturebeat.com/2020/07/21/frugalml-switches-between-apis-to-improve-image-classification-and-cut-costs/">Blog Post</a> | <a href="https://github.com/lchen001/FrugalML">Website</a>
<br /><strong>Keywords</strong>: machine learning as a service, ensemble learning, meta learning, systems for machine learning</p>
<hr />
<h4 id="generative-3d-part-assembly-via-dynamic-graph-learning"><a href="https://arxiv.org/abs/2006.07793">Generative 3D Part Assembly via Dynamic Graph Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img19" />
<strong>Authors</strong>: Jialei Huang, Guanqi Zhan, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas Guibas, Hao Dong
<br /><strong>Contact</strong>: fqnchina@gmail.com
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.07793">Paper</a>
<br /><strong>Keywords</strong>: 3d part assembly, dynamic graph learning</p>
<hr />
<h4 id="generative-3d-part-assembly-via-dynamic-graph-learning-1"><a href="https://arxiv.org/abs/2006.07793">Generative 3D Part Assembly via Dynamic Graph Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img3" />
<strong>Authors</strong>: Jialei Huang*, Guanqi Zhan*, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas J. Guibas, Hao Dong
<br /><strong>Contact</strong>: kaichun@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.07793">Paper</a> | <a href="https://hyperplane-lab.github.io/Generative-3D-Part-Assembly/">Website</a>
<br /><strong>Keywords</strong>: 3d part assembly, graph neural network</p>
<hr />
<h4 id="gradient-surgery-for-multi-task-learning"><a href="https://arxiv.org/pdf/2001.06782.pdf">Gradient Surgery for Multi-Task Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img7" />
<strong>Authors</strong>: Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, Chelsea Finn
<br /><strong>Contact</strong>: tianheyu@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2001.06782.pdf">Paper</a> | <a href="https://github.com/tianheyu927/PCGrad">Website</a>
<br /><strong>Keywords</strong>: multi-task learning, deep reinforcement learning</p>
<hr />
<h4 id="hippo-recurrent-memory-with-optimal-polynomial-projections"><a href="https://arxiv.org/abs/2008.07669">HiPPO: Recurrent Memory with Optimal Polynomial Projections</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img39" />
<strong>Authors</strong>: Albert Gu*, Tri Dao*, Stefano Ermon, Atri Rudra, Chris Ré
<br /><strong>Contact</strong>: albertgu@stanford.edu, trid@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2008.07669">Paper</a> | <a href="https://hazyresearch.stanford.edu/hippo">Blog Post</a>
<br /><strong>Keywords</strong>: representation learning, time series, recurrent neural networks, lstm, orthogonal polynomials</p>
<hr />
<h4 id="identifying-learning-rules-from-neural-network-observables"><a href="https://arxiv.org/abs/2010.11765">Identifying Learning Rules From Neural Network Observables</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img13" />
<strong>Authors</strong>: Aran Nayebi, Sanjana Srivastava, Surya Ganguli, Daniel L.K. Yamins
<br /><strong>Contact</strong>: anayebi@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight Presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.11765">Paper</a> | <a href="https://github.com/neuroailab/lr-identify">Website</a>
<br /><strong>Keywords</strong>: computational neuroscience, learning rule, deep networks</p>
<hr />
<h4 id="improved-techniques-for-training-score-based-generative-models"><a href="https://arxiv.org/pdf/2006.09011.pdf">Improved Techniques for Training Score-Based Generative Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img28" />
<strong>Authors</strong>: Yang Song, Stefano Ermon
<br /><strong>Contact</strong>: songyang@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2006.09011.pdf">Paper</a>
<br /><strong>Keywords</strong>: score-based generative modeling, score matching, deep generative models</p>
<hr />
<h4 id="language-through-a-prism-a-spectral-approach-for-multiscale-language-representations"><a href="https://arxiv.org/abs/2011.04823">Language Through a Prism: A Spectral Approach for Multiscale Language Representations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img12" />
<strong>Authors</strong>: Alex Tamkin, Dan Jurafsky, Noah Goodman
<br /><strong>Contact</strong>: atamkin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.04823">Paper</a>
<br /><strong>Keywords</strong>: bert, signal processing, self-supervised learning, interpretability, multiscale</p>
<hr />
<h4 id="large-scale-methods-for-distributionally-robust-optimization"><a href="https://arxiv.org/pdf/2010.05893.pdf">Large-Scale Methods for Distributionally Robust Optimization</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img14" />
<strong>Authors</strong>: Daniel Levy, Yair Carmon, John Duchi, Aaron Sidford
<br /><strong>Contact</strong>: danilevy@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2010.05893.pdf">Paper</a>
<br /><strong>Keywords</strong>: robustness dro optimization large-scale optimal</p>
<hr />
<h4 id="learning-physical-graph-representations-from-visual-scenes"><a href="https://proceedings.neurips.cc/paper/2020/hash/4324e8d0d37b110ee1a4f1633ac52df5-Abstract.html">Learning Physical Graph Representations from Visual Scenes</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img0" />
<strong>Authors</strong>: Daniel Bear, Chaofei Fan, Damian Mrowca, Yunzhu Li, Seth Alter, Aran Nayebi, Jeremy Schwartz, Li F. Fei-Fei, Jiajun Wu, Josh Tenenbaum, Daniel L. Yamins
<br /><strong>Contact</strong>: dbear@stanford.edu
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc/paper/2020/hash/4324e8d0d37b110ee1a4f1633ac52df5-Abstract.html">Paper</a> | <a href="https://neuroailab.github.io/physical-scene-graphs/">Blog Post</a> | <a href="https://github.com/neuroailab/PSGNets">Website</a>
<br /><strong>Keywords</strong>: structure learning, graph learning, visual scene representations, unsupervised learning, unsupervised segmentation, object-centric representation, intuitive physics</p>
<hr />
<h4 id="mopo-model-based-offline-policy-optimization"><a href="https://arxiv.org/pdf/2005.13239.pdf">MOPO: Model-based Offline Policy Optimization</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img6" />
<strong>Authors</strong>: Tianhe Yu*, Garrett Thomas*, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma
<br /><strong>Contact</strong>: tianheyu@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2005.13239.pdf">Paper</a> | <a href="https://github.com/tianheyu927/mopo">Website</a>
<br /><strong>Keywords</strong>: offline reinforcement learning, model-based reinforcement learning</p>
<hr />
<h4 id="measuring-robustness-to-natural-distribution-shifts-in-image-classification"><a href="https://arxiv.org/abs/2007.00644">Measuring Robustness to Natural Distribution Shifts in Image Classification</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img4" />
<strong>Authors</strong>: Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, Ludwig Schmidt
<br /><strong>Contact</strong>: rtaori@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2007.00644">Paper</a> | <a href="https://modestyachts.github.io/imagenet-testbed/">Website</a>
<br /><strong>Keywords</strong>: machine learning, robustness, image classification</p>
<hr />
<h4 id="minibatch-stochastic-approximate-proximal-point-methods"><a href="https://proceedings.neurips.cc//paper_files/paper/2020/hash/fa2246fa0fdf0d3e270c86767b77ba1b-Abstract.html">Minibatch Stochastic Approximate Proximal Point Methods</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img36" />
<strong>Authors</strong>: Hilal Asi, Karan Chadha, Gary Cheng, John Duchi
<br /><strong>Contact</strong>: chenggar@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight talk
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc//paper_files/paper/2020/hash/fa2246fa0fdf0d3e270c86767b77ba1b-Abstract.html">Paper</a>
<br /><strong>Keywords</strong>: stochastic optimization, sgd, aprox</p>
<hr />
<h4 id="model-based-adversarial-meta-reinforcement-learning"><a href="https://proceedings.neurips.cc/paper/2020/file/73634c1dcbe056c1f7dcf5969da406c8-Paper.pdf">Model-based Adversarial Meta-Reinforcement Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img38" />
<strong>Authors</strong>: Zichuan Lin, Garrett Thomas, Guangwen Yang, Tengyu Ma
<br /><strong>Contact</strong>: lzcthu12@gmail.com,gwthomas@stanford.edu
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc/paper/2020/file/73634c1dcbe056c1f7dcf5969da406c8-Paper.pdf">Paper</a>
<br /><strong>Keywords</strong>: model-based rl, meta-rl, minimax</p>
<hr />
<h4 id="multi-plane-program-induction-with-3d-box-priors"><a href="http://bpi.csail.mit.edu/data/paper/2020NeurIPS-BPI.pdf">Multi-Plane Program Induction with 3D Box Priors</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img9" />
<strong>Authors</strong>: Yikai Li, Jiayuan Mao, Xiuming Zhang, William T. Freeman, Joshua B. Tenenbaum, Noah Snavely, Jiajun Wu
<br /><strong>Contact</strong>: jiajunwu@cs.stanford.edu
<br /><strong>Links:</strong> <a href="http://bpi.csail.mit.edu/data/paper/2020NeurIPS-BPI.pdf">Paper</a> | <a href="http://bpi.csail.mit.edu/data/img/intro.mp4">Video</a> | <a href="http://bpi.csail.mit.edu/">Website</a>
<br /><strong>Keywords</strong>: visual program induction, 3d vision, image editing</p>
<hr />
<h4 id="multi-label-contrastive-predictive-coding"><a href="https://arxiv.org/abs/2007.09852">Multi-label Contrastive Predictive Coding</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img25" />
<strong>Authors</strong>: Jiaming Song, Stefano Ermon
<br /><strong>Contact</strong>: jiaming.tsong@gmail.com
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2007.09852">Paper</a>
<br /><strong>Keywords</strong>: representation learning, mutual information</p>
<hr />
<h4 id="neural-bridge-sampling-for-evaluating-safety-critical-autonomous-systems"><a href="https://arxiv.org/abs/2008.10581">Neural Bridge Sampling for Evaluating Safety-Critical Autonomous Systems</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img41" />
<strong>Authors</strong>: Aman Sinha, Matthew O’Kelly, Russ Tedrake, John Duchi
<br /><strong>Contact</strong>: amans@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2008.10581">Paper</a>
<br /><strong>Keywords</strong>: safety, probabilistic methods, autonomous systems</p>
<hr />
<h4 id="neuron-shapley-discovering-the-responsible-neurons"><a href="https://papers.nips.cc/paper/2020/file/41c542dfe6e4fc3deb251d64cf6ed2e4-Paper.pdf">Neuron Shapley: Discovering the Responsible Neurons</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img32" />
<strong>Authors</strong>: Amirata Ghorbani, James Zou
<br /><strong>Contact</strong>: amiratag@stanford.edu
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/file/41c542dfe6e4fc3deb251d64cf6ed2e4-Paper.pdf">Paper</a>
<br /><strong>Keywords</strong>: interpretability, deep learning, shapley value</p>
<hr />
<h4 id="no-subclass-left-behind-fine-grained-robustness-in-coarse-grained-classification-problems"><a href="https://arxiv.org/abs/2011.12945">No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img23" />
<strong>Authors</strong>: Nimit Sharad Sohoni, Jared Alexander Dunnmon, Geoffrey Angus, Albert Gu, Christopher Ré
<br /><strong>Contact</strong>: nims@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.12945">Paper</a> | <a href="https://hazyresearch.stanford.edu/hidden-stratification">Blog Post</a> | <a href="https://youtu.be/dI6nByor3rY">Video</a>
<br /><strong>Keywords</strong>: classification, robustness, clustering, neural feature representations</p>
<hr />
<h4 id="off-policy-policy-evaluation-for-sequential-decisions-under-unobserved-confounding"><a href="https://papers.nips.cc/paper/2020/hash/da21bae82c02d1e2b8168d57cd3fbab7-Abstract.html">Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img26" />
<strong>Authors</strong>: Hongseok Namkoong, Ramtin Keramati, Steve Yadlowsky, Emma Brunskill
<br /><strong>Contact</strong>: keramati@stanford.edu
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/hash/da21bae82c02d1e2b8168d57cd3fbab7-Abstract.html">Paper</a>
<br /><strong>Keywords</strong>: off-policy policy evaluation, unobserved confounding, reinforcement learning</p>
<hr />
<h4 id="one-solution-is-not-all-you-need-few-shot-extrapolation-via-structured-maxent-rl"><a href="https://arxiv.org/abs/2010.14484">One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img21" />
<strong>Authors</strong>: Saurabh Kumar, Aviral Kumar, Sergey Levine, Chelsea Finn
<br /><strong>Contact</strong>: szk@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.14484">Paper</a>
<br /><strong>Keywords</strong>: robustness, diversity, reinforcement learning</p>
<hr />
<h4 id="point-process-models-for-sequence-detection-in-high-dimensional-neural-spike-trains"><a href="https://arxiv.org/abs/2010.04875">Point process models for sequence detection in high-dimensional neural spike trains</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img2" />
<strong>Authors</strong>: Alex H. Williams, Anthony Degleris, Yixin Wang, Scott W. Linderman
<br /><strong>Contact</strong>: ahwillia@stanford.edu
<br /><strong>Award nominations:</strong> Selected for Oral Presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.04875">Paper</a> | <a href="https://github.com/lindermanlab/PPSeq.jl">Website</a>
<br /><strong>Keywords</strong>: bayesian nonparametrics, unsupervised learning</p>
<hr />
<h4 id="predictive-coding-in-balanced-neural-networks-with-noise-chaos-and-delays"><a href="https://papers.nips.cc/paper/2020/file/c236337b043acf93c7df397fdb9082b3-Paper.pdf">Predictive coding in balanced neural networks with noise, chaos and delays</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img24" />
<strong>Authors</strong>: Jonathan Kadmon, Jonathan Timcheck, Surya Ganguli
<br /><strong>Contact</strong>: kadmonj@stanford.edu
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/file/c236337b043acf93c7df397fdb9082b3-Paper.pdf">Paper</a>
<br /><strong>Keywords</strong>: neuroscience, predictive coding, chaos</p>
<hr />
<h4 id="probabilistic-circuits-for-variational-inference-in-discrete-graphical-models"><a href="https://arxiv.org/abs/2010.11446">Probabilistic Circuits for Variational Inference in Discrete Graphical Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img34" />
<strong>Authors</strong>: Andy Shih, Stefano Ermon
<br /><strong>Contact</strong>: andyshih@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.11446">Paper</a>
<br /><strong>Keywords</strong>: variational inference, discrete, high-dimensions, sum product networks, probabilistic circuits, graphical models</p>
<hr />
<h4 id="provably-good-batch-off-policy-reinforcement-learning-without-great-exploration"><a href="https://proceedings.neurips.cc/paper/2020/file/0dc23b6a0e4abc39904388dd3ffadcd1-Paper.pdf">Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img30" />
<strong>Authors</strong>: Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill.
<br /><strong>Contact</strong>: yaoliu@stanford.edu
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc/paper/2020/file/0dc23b6a0e4abc39904388dd3ffadcd1-Paper.pdf">Paper</a>
<br /><strong>Keywords</strong>: reinforcement leanring, off-policy, batch reinforcement learning</p>
<hr />
<h4 id="pruning-neural-networks-without-any-data-by-iteratively-conserving-synaptic-flow"><a href="https://papers.nips.cc/paper/2020/hash/46a4378f835dc8040c8057beb6a2da52-Abstract.html">Pruning neural networks without any data by iteratively conserving synaptic flow</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img22" />
<strong>Authors</strong>: Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli
<br /><strong>Contact</strong>: kunin@stanford.edu
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/hash/46a4378f835dc8040c8057beb6a2da52-Abstract.html">Paper</a> | <a href="https://www.youtube.com/watch?v=8l-TDqpoUQs">Video</a> | <a href="https://github.com/ganguli-lab/Synaptic-Flow">Website</a>
<br /><strong>Keywords</strong>: network pruning, sparse initialization, lottery ticket</p>
<hr />
<h4 id="robust-sub-gaussian-principal-component-analysis-and-width-independent-schatten-packing"><a href="https://arxiv.org/abs/2006.06980">Robust Sub-Gaussian Principal Component Analysis and Width-Independent Schatten Packing</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img31" />
<strong>Authors</strong>: Arun Jambulapati, Jerry Li, Kevin Tian
<br /><strong>Contact</strong>: kjtian@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.06980">Paper</a>
<br /><strong>Keywords</strong>: robust statistics, principal component analysis, positive semidefinite programming</p>
<hr />
<h4 id="self-training-avoids-using-spurious-features-under-domain-shift"><a href="https://arxiv.org/abs/2006.10032">Self-training Avoids Using Spurious Features Under Domain Shift</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img17" />
<strong>Authors</strong>: Yining Chen*, Colin Wei*, Ananya Kumar, Tengyu Ma (*equal contribution)
<br /><strong>Contact</strong>: cynnjjs@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.10032">Paper</a>
<br /><strong>Keywords</strong>: self-training, pseudo-labeling, domain shift, robustness</p>
<hr />
<h4 id="wasserstein-distances-for-stereo-disparity-estimation"><a href="https://arxiv.org/abs/2007.03085">Wasserstein Distances for Stereo Disparity Estimation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img40" />
<strong>Authors</strong>: Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, Wei-Lun Chao
<br /><strong>Contact</strong>: divgarg@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2007.03085">Paper</a> | <a href="https://slideslive.com/38937842">Video</a> | <a href="https://div99.github.io/W-Stereo-Disp/">Website</a>
<br /><strong>Keywords</strong>: depth estimation, disparity estimation, autonomous driving, 3d object detection, statistical learning</p>
<hr />
<p>We look forward to seeing you at NeurIPS2020!</p>
Sun, 06 Dec 2020 00:00:00 -0800Learning from Language Explanations
/blog/learning-from-language/
/blog/learning-from-language/<p>Imagine you’re a machine learning practitioner and you want to solve some classification problem, like classifying groups of colored squares as being either 1s or 0s. Here’s what you would typically do: collect a large dataset of examples, label the data, and train a classifier:</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 700px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/examples.jpg" /></p>
</div></figure>
<p><em>But humans don’t learn like this</em>. We have a very powerful and intuitive mechanism for communicating information about the world - <strong>language</strong>!</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 500px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/language.jpg" /></p>
</div></figure>
<p>With just the phrase <em>at least 2 red squares</em>, we’ve summarized the entire dataset presented above in a much more efficient manner.</p>
<p><strong>Language is a crucial medium for human learning:</strong> we use it to <a href="https://www.npr.org/2010/01/18/122701268/i-have-a-dream-speech-in-its-entirety">convey beliefs</a> about the world, <a href="https://www.nature.com/articles/ncomms7029">teach others</a>, and describe things that are hard to <a href="https://en.wikipedia.org/wiki/Saturn">experience directly</a>. Thus, language ought to be a simple and effective way to supervise machine learning models. Yet past approaches to learning from language have struggled to scale up to the general tasks targeted by modern deep learning systems and the freeform language explanations used in these domains. In two short papers presented at ACL 2020 this year, we use deep neural models to learn from language explanations to help tackle a variety of challenging tasks in natural language processing (NLP) and computer vision.</p>
<ul>
<li><a href="https://arxiv.org/abs/2005.01932">ExpBERT: Representation Engineering with Natural Language Explanations</a></li>
<li><a href="https://arxiv.org/abs/1911.02683">Shaping Visual Representations with Language for Few-shot Classification</a></li>
</ul>
<h3 id="whats-the-challenge"><strong>What’s the challenge?</strong></h3>
<p>Given that language is such an intuitive interface for humans to teach others,
why is it so hard to use language for machine learning?</p>
<p>The principal challenge is the <a href="https://arxiv.org/html/cs/9906002">grounding
problem</a>: understanding language
explanations in the context of other inputs. Building models that can
understand rich and ambiguous language is tricky enough, but building models
that can relate language to the surrounding world is even more challenging. For
instance, given the explanation <em>at least two red squares</em>, a model must not
only understand the terms <em>red</em> and <em>square</em>, but also how they refer to
particular parts of (often complex) inputs.</p>
<p>Past work (<a href="https://www.aclweb.org/anthology/D17-1161">1</a>,
<a href="https://www.aclweb.org/anthology/P18-1029.pdf">2</a>,
<a href="https://arxiv.org/abs/1805.03818">3</a>) has relied on <a href="https://cs.stanford.edu/~pliang/papers/executable-cacm2016.pdf">semantic
parsers</a> which
convert natural language statements (e.g. <em>at least two red squares</em>) to formal
logical representations (e.g. <code class="highlighter-rouge">Count(Square AND Red) > 2</code>). If we can easily
check whether explanations apply to our inputs by executing these logical
formulas, we can use our explanations as features to train our model.
However, semantic parsers only work on simple domains
where we can hand-engineer a logical grammar of explanations we might expect to
see. They struggle to handle richer and vaguer language or scale up to more
complex inputs, such as images.</p>
<p>Fortunately, modern deep neural language models such as
<a href="https://arxiv.org/abs/1810.04805">BERT</a> are beginning to show promise at
solving many language understanding tasks. Our papers propose to alleviate the
grounding problem by using neural language models that are either trained to
ground language explanations in the domain of interest, or come pre-trained
with general-purpose “knowledge” that can be used to interpret explanations. We
will show that these neural models allow us to learn from richer and more
diverse language for more challenging settings.</p>
<h3 id="representation-engineering-with-natural-language-explanations"><strong>Representation Engineering with Natural Language Explanations</strong></h3>
<p>In our <a href="https://arxiv.org/abs/2005.01932">first paper</a>, we examine how to build text classifiers with language
explanations.
Consider the task of <em>relation extraction</em>, where we are given a
short paragraph and must identify whether two people mentioned in the
paragraph are <strong>married</strong>. While state-of-the-art NLP models can likely solve
this task from data alone, humans might use language to describe ways to tell
whether two people are married—for example, <em>people who go on honeymoons are
typically married</em>. Can such language explanations be used to train better
classifiers?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 700px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert_dataset.jpg" /></p>
</div></figure>
<p>In the same way that we might take an input <script type="math/tex">x</script>, and extract features (e.g.
the presence of certain words) to train a model, we can use explanations to
provide additional features. For example, knowing that honeymoons are relevant
for this task, if we can create a honeymoon feature that reliably activates
whenever the two people in a paragraph are described as going on a honeymoon,
this should be useful signal for training a better model.</p>
<p>But creating such features requires some sort of explanation <strong>interpretation</strong>
mechanism that tells us whether an explanation is true for an input. Semantic
parsers are one such tool: given <em><script type="math/tex">A</script> and <script type="math/tex">B</script> went on honeymoon</em>, we could
parse this explanation into a logical form which, when run on an input,
produces 1 if the word <em>honeymoon</em> appears between <script type="math/tex">A</script> and <script type="math/tex">B</script>. But what about
a vaguer explanation like <em><script type="math/tex">A</script> and <script type="math/tex">B</script> are in love</em>? How can we parse this?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 800px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/semantic_parsing_examples.jpg" /></p>
</div></figure>
<p>While semantic parsing is efficient and accurate in small domains, it can be
overly <em>brittle</em>, as it can only interpret explanations which adhere to a fixed
set of grammatical rules and functions that we must specify in advance (e.g.
<code class="highlighter-rouge">contains</code> and <code class="highlighter-rouge">extract_text</code>).
Instead, we turn to the soft reasoning
capabilities of <a href="https://arxiv.org/abs/1810.04805">BERT</a>, a neural language model. BERT is particularly effective
at the task of <em>textual entailment</em>: determining whether a sentence implies or
contradicts another sentence (e.g. does <em>She ate pizza</em> imply that <em>She ate
food?</em> Yes!). In our proposed <strong>ExpBERT</strong> model, we take a BERT model
trained for textual entailment, and instead ask it to identify whether a
paragraph in our task <em>entails</em> an explanation. The features produced by BERT
during this process replace the indicator features produced by the semantic
parser above.</p>
<figure class="figure"><div class="figure__main">
<video class="postimage_unpadded" style="max-width: 800px" autoplay="" muted="" loop="" playsinline="">
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert.webm" type="video/webm" />
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert.mp4" type="video/mp4" />
<p>Your browser doesn't support HTML5 video. Here is a <a href="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert.mp4">link to the video</a> instead, which you can download and run with a player like <a href="https://www.videolan.org/vlc/index.html">VLC</a></p>
</video>
</div></figure>
<p>Does the soft reasoning power of BERT improve over semantic parsing? On the
marriage identification task, we find that <strong>ExpBERT</strong> leads to substantial
improvements over a classifier that is trained on the input features only (No
Explanations). Importantly, using a semantic parser to try to parse
explanations doesn’t help much, since there are general explanations (<em>in
love</em>) that are difficult to convert to logical forms.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 285px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert_results.jpg" /></p>
</div></figure>
<p>In the full paper, we compare to more baselines, explore larger relation
extraction tasks (e.g. <a href="https://nlp.stanford.edu/projects/tacred/">TACRED</a>),
conduct ablation studies to understand what kinds of explanations are
important, and examine how much more efficient explanations are compared to
additional data.</p>
<h3 id="shaping-visual-representations-with-language"><strong>Shaping Visual Representations with Language</strong></h3>
<p>The work we’ve just described uses natural language explanations for a single
task like marriage identification. However, <a href="https://plato.stanford.edu/entries/language-thought/">work in cognitive
science</a> suggests that
language also equips us with the right features and abstractions that help us
solve <em>future</em> tasks.
For example, explanations that indicate whether person <script type="math/tex">A</script> is married to
<script type="math/tex">B</script> also highlight other concepts that are crucial to human relationships:
<em>children</em>, <em>daughters</em>, <em>honeymoons</em>, and more. Knowing these additional
concepts are not just useful for identifying married people; they are also
important if we would later like to identify other relationships
(e.g. <em>siblings</em>, <em>mother</em>, <em>father</em>).</p>
<p>In machine learning, we might ask: how can language point out the right
features for challenging and underspecified domains, if we
ultimately wish to solve <em>new tasks</em> where no language is available? In our
<a href="https://arxiv.org/abs/1911.02683">second paper</a>, we explore this setting,
additionally increasing the challenge by seeing whether language can improve
the learning of representations across modalities—here, vision.</p>
<p>We’re specifically interested in few-shot visual reasoning tasks like the following (here, from the <a href="https://arxiv.org/abs/1704.04517">ShapeWorld</a> dataset):</p>
<figure class="figure"><div class="figure__main">
<video class="postimage_unpadded" style="max-width: 500px" autoplay="" muted="" loop="" playsinline="">
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/shapeworld.webm" type="video/webm" />
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/shapeworld.mp4" type="video/mp4" />
<p>Your browser doesn't support HTML5 video. Here is a <a href="/blog/assets/img/posts/2020-11-23-learning-from-language/shapeworld.mp4">link to the video</a> instead, which you can download and run with a player like <a href="https://www.videolan.org/vlc/index.html">VLC</a></p>
</video>
</div></figure>
<p>Given a small training set of examples of a visual concept, the task is to
determine whether a held-out test image expresses the same concept. Now, what
if we assume access to language explanations of the relevant visual concepts at
training time? Can we use these to learn a better model, <em>even if no language
is available at test time</em>?</p>
<p>We frame this as a <a href="https://arxiv.org/abs/1904.04232"><em>meta-learning</em></a> task:
instead of training and testing a model on a single task, we
train a model on a <em>set</em> of tasks, each with a small training set and
an accompanying language description (the <em>meta-train</em> set). We then test
generalization to a <em>meta-test</em> set of unseen tasks, for which no language is
available:</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 760px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/metalearning.jpg" /></p>
</div></figure>
<p>First, let’s look at how we might solve this task without language. One typical
approach is <strong>Prototype Networks</strong>, where we learn some model <script type="math/tex">f_\theta</script>
(here, a <a href="https://arxiv.org/abs/1409.1556">deep convolutional neural network</a>)
that embeds the training images, averages them, and compares to an embedding of
the test image:</p>
<figure class="figure"><div class="figure__main">
<video class="postimage_unpadded" style="max-width: 800px" autoplay="" muted="" loop="" playsinline="">
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/lsl.webm" type="video/webm" />
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/lsl.mp4" type="video/mp4" />
<p>Your browser doesn't support HTML5 video. Here is a <a href="/blog/assets/img/posts/2020-11-23-learning-from-language/lsl.mp4">link to the video</a> instead, which you can download and run with a player like <a href="https://www.videolan.org/vlc/index.html">VLC</a></p>
</video>
</div></figure>
<p>To use language, we propose a simple approach called <strong>Language Shaped Learning</strong>
(LSL): if we have access to explanations at training time, we encourage the
model to learn representations that are not only helpful for classification,
but are <em>predictive of the language explanations</em>. We do this by introducing an
<em>auxiliary</em> training objective (i.e. it is not related to the ultimate task of
interest), where we simultaneously train a recurrent neural network (RNN)
decoder to predict the explanation(s) from the representation of the
input images. Crucially, training this decoder depends on the
parameters of our image model <script type="math/tex">f_\theta</script>, so this process should encourage
<script type="math/tex">f_\theta</script> to better encode the features and abstractions exposed in
language.</p>
<p>In effect, we are training the model to “think out loud” when representing
concepts at training time. At test time, we simply discard the RNN decoder, and
do classification as normal with the “language-shaped” image embeddings.</p>
<p>We apply this model to both the ShapeWorld dataset described above, and a more
realistic <a href="http://www.vision.caltech.edu/visipedia/CUB-200-2011.html">Birds</a>
dataset, with real images and human language:</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 800px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/birds.jpg" /></p>
</div></figure>
<p>In both cases, this auxiliary training objective improves performance over a
no-explanation baseline (<strong>Meta</strong>), and <a href="https://arxiv.org/abs/1711.00482"><em>Learning with Latent
Language</em></a> (<strong>L3</strong>), a similar model proposed
for this setting that uses language as a discrete bottleneck (see the paper for
details):</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 400px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/lsl_results.jpg" /></p>
</div></figure>
<p>In the full paper, we also explore which <em>parts</em> of language are most important
(spoiler: a little bit of everything), and <em>how much</em> language is needed for
LSL to improve over models that don’t use language (spoiler: surprisingly little!)</p>
<h3 id="moving-forward"><strong>Moving Forward</strong></h3>
<p>As NLP systems grow in their ability to understand and produce language, so too
grows the potential for machine learning systems to <em>learn from language</em> to
solve other challenging tasks. In the papers above, we’ve shown that deep
neural language models can be used to successfully learn from language
explanations to improve generalization across a variety of tasks in vision and
NLP.</p>
<p>We think this is an exciting new avenue for training machine learning models,
and similar ideas are already being explored in areas such as reinforcement
learning (<a href="https://arxiv.org/abs/1910.08210">4</a>,
<a href="https://arxiv.org/abs/1906.03926">5</a>). We envision a future where in order to
solve a machine learning task, we no longer have to collect a large labeled
dataset, but instead interact naturally and expressively with a model in the
same way that humans have interacted with each other for millennia—<em>through
language</em>.</p>
<h3 id="acknowledgments"><strong>Acknowledgments</strong></h3>
<p>Thanks to our coauthors (Pang Wei Koh, Percy Liang, and Noah Goodman), and to
Nelson Liu, Pang Wei Koh, and the rest of the SAIL blog team for reviewing and
publishing this blog post. This research was supported in part by the <a href="https://research.fb.com/fellowship/">Facebook
Fellowship</a> (to Pang Wei Koh), the <a href="https://www.nsfgrfp.org/">NSF Graduate Research Fellowship</a> (to Jesse Mu), <a href="https://www.tri.global/">Toyota Research
Institute</a>, and the <a href="https://www.onr.navy.mil/">Office of Naval Research</a>.</p>
Mon, 23 Nov 2020 00:00:00 -0800Stanford AI Lab Papers and Talks at CoRL 2020
/blog/corl-2020/
/blog/corl-2020/<figure class="figure"><div class="figure__main">
<p><img class="postimagethird" src="/blog/assets/img/posts/2020-11-16-corl-2020/logo.png" /></p>
</div></figure>
<p>The <a href="https://www.robot-learning.org/">Conference on Robot Learning</a> (CoRL) 2020 is being hosted virtually from November 16th - November 18th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!</p>
<h2 id="list-of-accepted-papers">List of Accepted Papers</h2>
<hr />
<h4 id="learning-3d-dynamic-scene-representations-for-robot-manipulation"><a href="https://arxiv.org/pdf/2011.01968.pdf">Learning 3D Dynamic Scene Representations for Robot Manipulation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img0" />
<strong>Authors</strong>: Zhenjia Xu, Zhanpeng He, Jiajun Wu, Shuran Song
<br /><strong>Contact</strong>: jiajunwu@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2011.01968.pdf">Paper</a> | <a href="https://www.youtube.com/watch?v=GQjYG3nQJ80">Video</a> | <a href="https://dsr-net.cs.columbia.edu/">Website</a>
<br /><strong>Keywords</strong>: scene representations, 3d perception, robot manipulation</p>
<hr />
<h4 id="learning-latent-representations-to-influence-multi-agent-interaction"><a href="https://drive.google.com/file/d/1_ezqLLEv4HLtj9vflRj0sq3PNOhaSnJm/view">Learning Latent Representations to Influence Multi-Agent Interaction</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img6" />
<strong>Authors</strong>: Annie Xie, Dylan P. Losey, Ryan Tolsma, Chelsea Finn, Dorsa Sadigh
<br /><strong>Contact</strong>: anniexie@stanford.edu
<br /><strong>Links:</strong> <a href="https://drive.google.com/file/d/1_ezqLLEv4HLtj9vflRj0sq3PNOhaSnJm/view">Paper</a> | <a href="https://ai.stanford.edu/blog/lili/">Blog Post</a> | <a href="https://sites.google.com/view/latent-strategies">Website</a>
<br /><strong>Keywords</strong>: multi-agent systems, human-robot interaction, reinforcement learning</p>
<hr />
<h4 id="learning-object-conditioned-exploration-using-distributed-soft-actor-critic"><a href="https://arxiv.org/pdf/2007.14545.pdf">Learning Object-conditioned Exploration using Distributed Soft Actor Critic</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img1" />
<strong>Authors</strong>: Ayzaan Wahid (Google), Austin Stone (Google), Brian Ichter (Google Brain), Kevin Chen (Stanford), Alexander Toshev (Google)
<br /><strong>Contact</strong>: ayzaan@google.com
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2007.14545.pdf">Paper</a>
<br /><strong>Keywords</strong>: object navigation, visual navigation</p>
<hr />
<h4 id="mats-an-interpretable-trajectory-forecasting-representation-for-planning-and-control-"><a href="https://arxiv.org/abs/2009.07517">MATS: An Interpretable Trajectory Forecasting Representation for Planning and Control </a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img2" />
<strong>Authors</strong>: Boris Ivanovic, Amine Elhafsi, Guy Rosman, Adrien Gaidon, Marco Pavone
<br /><strong>Contact</strong>: borisi@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2009.07517">Paper</a> | <a href="https://www.youtube.com/watch?v=q6hMY2y-BcQ">Video</a>
<br /><strong>Keywords</strong>: trajectory forecasting, learning dynamical systems, motion planning, autonomous vehicles</p>
<hr />
<h4 id="model-based-reinforcement-learning-for-decentralized-multiagent-rendezvous"><a href="https://arxiv.org/abs/2003.06906">Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img3" />
<strong>Authors</strong>: Rose E. Wang, J. Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, Aleksandra Faust
<br /><strong>Contact</strong>: rewang@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2003.06906">Paper</a> | <a href="https://youtu.be/HqeYcO1DBUU">Video</a> | <a href="https://sites.google.com/view/multiagent-hpp/home">Website</a>
<br /><strong>Keywords</strong>: multiagent systems; model-based reinforcement learning</p>
<hr />
<h4 id="reinforcement-learning-with-videos--combining-offline-observations-with-interaction"><a href="https://arxiv.org/abs/2011.06507">Reinforcement Learning with Videos: Combining Offline Observations with Interaction</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img4" />
<strong>Authors</strong>: Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn
<br /><strong>Contact</strong>: karls@seas.upenn.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.06507">Paper</a> | <a href="https://sites.google.com/view/rl-with-videos">Website</a>
<br /><strong>Keywords</strong>: reinforcement learning, learning from observation</p>
<hr />
<h4 id="sampling-based-reachability-analysis-a-random-set-theory-approach-with-adversarial-sampling"><a href="https://arxiv.org/abs/2008.10180">Sampling-based Reachability Analysis: A Random Set Theory Approach with Adversarial Sampling</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img5" />
<strong>Authors</strong>: Thomas Lew, Marco Pavone
<br /><strong>Contact</strong>: thomas.lew@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2008.10180">Paper</a>
<br /><strong>Keywords</strong>: reachability analysis, robust planning and control, neural networks</p>
<h2 id="keynote">Keynote</h2>
<hr />
<h4 id="walking-the-boundary-of-learning-and-interaction-dorsa-sadigh">Walking the Boundary of Learning and Interaction (Dorsa Sadigh)</h4>
<figure class="figure"><div class="figure__main">
<p><img class="postimagethird" src="/blog/assets/img/posts/2020-11-16-corl-2020/keynote.png" /></p>
</div></figure>
<p><strong>Overview:</strong> There have been significant advances in the field of robot learning in the past decade. However, many challenges still remain when considering how robot learning can advance interactive agents such as robots that collaborate with humans. This includes autonomous vehicles that interact with human-driven vehicles or pedestrians, service robots collaborating with their users at homes over short or long periods of time, or assistive robots helping patients with disabilities. This introduces an opportunity for developing new robot learning algorithms that can help advance interactive autonomy.</p>
<p>In this talk, I will discuss a formalism for human-robot interaction built upon ideas from representation learning. Specifically, I will first discuss the notion of latent strategies— low dimensional representations sufficient for capturing non-stationary interactions. I will then talk about the challenges of learning such representations when interacting with humans, and how we can develop data-efficient techniques that enable actively learning computational models of human behavior from demonstrations, preferences, or physical corrections. Finally, I will introduce an intuitive controlling paradigm that enables seamless collaboration based on learned representations, and further discuss how that can be used for further influencing humans.</p>
<p><strong>Live Event:</strong> November 17th, 7:00AM - 7:45AM PST</p>
<hr />
<p>We look forward to seeing you at CoRL!</p>
Mon, 16 Nov 2020 00:00:00 -0800Stanford AI Lab Papers and Talks at EMNLP 2020
/blog/emnlp-2020/
/blog/emnlp-2020/<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/logo.png" /></p>
<p>The <a href="https://2020.emnlp.org/">Conference on Empirical Methods in Natural Language Processing</a> (EMNLP) 2020 is being hosted virtually from November 16th - November 20th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!</p>
<ul>
<li><a href="#main-conference">Main Conference</a></li>
<li><a href="#findings-of-emnlp">Findings of EMNLP</a></li>
<li><a href="#workshops-and-co-located-conferences">Workshops and Co-Located Conferences</a></li>
</ul>
<h2 id="main-conference">Main Conference</h2>
<hr />
<h4 id="pre-training-transformers-as-energy-based-cloze-models"><a href="https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf">Pre-Training Transformers as Energy-Based Cloze Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img19" />
<strong>Authors</strong>: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
<br /><strong>Contact</strong>: kevclark@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf">Paper</a>
<br /><strong>Keywords</strong>: representation learning, self-supervised learning, energy-based models</p>
<hr />
<h4 id="alice-active-learning-with-contrastive-natural-language-explanations"><a href="https://arxiv.org/abs/2009.10259">ALICE: Active Learning with Contrastive Natural Language Explanations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img8" />
<strong>Authors</strong>: Weixin Liang, James Zou, Zhou Yu
<br /><strong>Contact</strong>: wxliang@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2009.10259">Paper</a>
<br /><strong>Keywords</strong>: natural language explanation, class-based active learning, contrastive explanation</p>
<hr />
<h4 id="chexbert-combining-automatic-labelers-and-expert-annotations-for-accurate-radiology-report-labeling-using-bert"><a href="https://arxiv.org/abs/2004.09167">CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img1" />
<strong>Authors</strong>: Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, Matthew P. Lungren
<br /><strong>Contact</strong>: akshaysm@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.09167">Paper</a> | <a href="https://virtual.2020.emnlp.org/paper_main.55.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: bert, natural language processing, radiology, medical imaging, deep learning</p>
<hr />
<h4 id="autoqa-from-databases-to-qa-semantic-parsers-with-only-synthetic-training-data"><a href="https://www.aclweb.org/anthology/2020.emnlp-main.31/">AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img21" />
<strong>Authors</strong>: Silei Xu, Sina J. Semnani, Giovanni Campagna, Monica S. Lam
<br /><strong>Contact</strong>: silei@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://www.aclweb.org/anthology/2020.emnlp-main.31/">Paper</a> | <a href="https://virtual.2020.emnlp.org/paper_main.3506.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: question answering, semantic parsing, language models, synthetic training data, data augmentation</p>
<hr />
<h4 id="data-and-representation-for-turkish-natural-language-inference"><a href="https://arxiv.org/pdf/2004.14963.pdf">Data and Representation for Turkish Natural Language Inference</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img14" />
<strong>Authors</strong>: Emrah Budur, Rıza Özçelik, Tunga Güngör, Christopher Potts
<br /><strong>Contact</strong>: emrah.budur@boun.edu.tr
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2004.14963.pdf">Paper</a> | <a href="https://github.com/boun-tabi/NLI-TR">Website</a>
<br /><strong>Keywords</strong>: sentence-level semantics, natural language inference, neural machine translation, morphologically rich language</p>
<hr />
<h4 id="intrinsic-evaluation-of-summarization-datasets"><a href="https://github.com/rishibommasani/rishibommasani.github.io/blob/master/papers/EMNLP2020.pdf">Intrinsic Evaluation of Summarization Datasets</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img6" />
<strong>Authors</strong>: Rishi Bommasani, Claire Cardie
<br /><strong>Contact</strong>: nlprishi@stanford.edu
<br /><strong>Links:</strong> <a href="https://github.com/rishibommasani/rishibommasani.github.io/blob/master/papers/EMNLP2020.pdf">Paper</a> | <a href="https://slideslive.com/38938755">Video</a> | <a href="https://rishibommasani.github.io/">Website</a> | <a href="https://virtual.2020.emnlp.org/paper_main.675.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: summarization, datasets, evaluation</p>
<hr />
<h4 id="learning-music-helps-you-read-using-transfer-to-study-linguistic-structure-in-language-models"><a href="https://arxiv.org/abs/2004.14601">Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img22" />
<strong>Authors</strong>: Isabel Papadimitriou, Dan Jurafsky
<br /><strong>Contact</strong>: isabelvp@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.14601">Paper</a>
<br /><strong>Keywords</strong>: transfer learning, analysis, music, hierarchical structure</p>
<hr />
<h4 id="localizing-open-ontology-qa-semantic-parsers-in-a-day-using-machine-translation"><a href="https://arxiv.org/pdf/2010.05106.pdf">Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img5" />
<strong>Authors</strong>: Mehrad Moradshahi, Giovanni Campagna, Sina J. Semnani, Silei Xu, Monica S. Lam
<br /><strong>Contact</strong>: mehrad@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2010.05106.pdf">Paper</a> | <a href="https://github.com/stanford-oval/SPL">Website</a>
<br /><strong>Keywords</strong>: machine translation, semantic parsing, localization</p>
<hr />
<h4 id="slm-learning-a-discourse-language-representation-with-sentence-unshuffling"><a href="https://arxiv.org/pdf/2010.16249.pdf">SLM: Learning a Discourse Language Representation with Sentence Unshuffling</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img18" />
<strong>Authors</strong>: Haejun Lee, Drew A. Hudson, Kangwook Lee, Christopher D. Manning
<br /><strong>Contact</strong>: dorarad@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2010.16249.pdf">Paper</a>
<br /><strong>Keywords</strong>: transformer, bert, language, understanding, nlp, squad, glue, sentences, discourse</p>
<hr />
<h4 id="utility-is-in-the-eye-of-the-user-a-critique-of-nlp-leaderboards"><a href="https://arxiv.org/pdf/2009.13888.pdf">Utility is in the Eye of the User: A Critique of NLP Leaderboards</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img0" />
<strong>Authors</strong>: Kawin Ethayarajh, Dan Jurafsky
<br /><strong>Contact</strong>: kawin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2009.13888.pdf">Paper</a> | <a href="https://kawine.github.io/">Website</a>
<br /><strong>Keywords</strong>: nlp, leaderboard, utility, benchmark, fairness, efficiency</p>
<hr />
<h4 id="with-little-power-comes-great-responsibility"><a href="https://arxiv.org/abs/2010.06595">With Little Power Comes Great Responsibility</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img4" />
<strong>Authors</strong>: Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, Dan Jurafsky
<br /><strong>Contact</strong>: dcard@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.06595">Paper</a> | <a href="https://github.com/dallascard/NLP-power-analysis">Website</a>
<br /><strong>Keywords</strong>: statistical power, experimental methodology, leaderboards, machine translation, human evaluation</p>
<hr />
<h2 id="findings-of-emnlp">Findings of EMNLP</h2>
<hr />
<h4 id="desmog-detecting-stance-in-media-on-global-warming"><a href="https://www.aclweb.org/anthology/2020.findings-emnlp.296.pdf">DeSMOG: Detecting Stance in Media On Global Warming</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img16" />
<strong>Authors</strong>: Yiwei Luo, Dallas Card, Dan Jurafsky
<br /><strong>Contact</strong>: yiweil@stanford.edu
<br /><strong>Links:</strong> <a href="https://www.aclweb.org/anthology/2020.findings-emnlp.296.pdf">Paper</a> | <a href="http://stanford.edu/~yiweil/webpage.html">Website</a>
<br /><strong>Keywords</strong>: computational social science; framing; argumentation; stance; bias; climate change</p>
<hr />
<h4 id="investigating-transferability-in-pretrained-language-models"><a href="https://arxiv.org/abs/2004.14975">Investigating Transferability in Pretrained Language Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img15" />
<strong>Authors</strong>: Alex Tamkin, Trisha Singh, Davide Giovanardi, Noah Goodman
<br /><strong>Contact</strong>: atamkin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.14975">Paper</a> | <a href="http://alextamkin.com">Website</a> | <a href="https://virtual.2020.emnlp.org/paper_WS-1.1165_F.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: finetuning, transfer learning, language models, bert, probing</p>
<hr />
<h4 id="stay-hungry-stay-focused-generating-informative-and-specific-questions-in-information-seeking-conversations"><a href="https://arxiv.org/pdf/2004.14530.pdf">Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img2" />
<strong>Authors</strong>: Peng Qi, Yuhao Zhang, Christopher D. Manning
<br /><strong>Contact</strong>: pengqi@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2004.14530.pdf">Paper</a> | <a href="https://qipeng.me/blog/learning-to-ask/">Blog Post</a> | <a href="https://virtual.2020.emnlp.org/paper_WS-1.69_F.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: conversational agents, question generation, natural language generation</p>
<hr />
<h4 id="do-language-embeddings-capture-scales"><a href="https://arxiv.org/abs/2010.05345">Do Language Embeddings Capture Scales?</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img11" />
<strong>Authors</strong>: Xikun Zhang*, Deepak Ramachandran*, Ian Tenney, Yanai Elazar, Dan Roth
<br /><strong>Contact</strong>: xikunz2@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.05345">Paper</a> | <a href="https://virtual.2020.emnlp.org/paper_findings.439.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: probing, analysis, bertology, scales, common sense knowledge</p>
<hr />
<h4 id="on-the-importance-of-adaptive-data-collection-for-extremely-imbalanced-pairwise-tasks"><a href="https://arxiv.org/abs/2010.05103">On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img7" />
<strong>Authors</strong>: Stephen Mussmann, Robin Jia, Percy Liang
<br /><strong>Contact</strong>: robinjia@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.05103">Paper</a> | <a href="https://worksheets.codalab.org/worksheets/0x39ba5559790b4099a7ff75f916ce19a4">Website</a>
<br /><strong>Keywords</strong>: active learning, robustness, label imbalance</p>
<hr />
<h4 id="pragmatic-issue-sensitive-image-captioning"><a href="https://arxiv.org/abs/2004.14451">Pragmatic Issue-Sensitive Image Captioning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img12" />
<strong>Authors</strong>: Allen Nie, Reuben Cohn-Gordon, Christopher Potts
<br /><strong>Contact</strong>: anie@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.14451">Paper</a> | <a href="https://slideslive.com/38940644/pragmatic-issuesensitive-image-captioning">Video</a>
<br /><strong>Keywords</strong>: controllable caption generation, question under discussion, discourse, pragmatics</p>
<hr />
<h2 id="workshops-and-co-located-conferences">Workshops and Co-Located Conferences</h2>
<hr />
<h4 id="bleu-neighbors-a-reference-less-approach-to-automatic-evaluation"><a href="https://arxiv.org/pdf/2004.12726.pdf">BLEU Neighbors: A Reference-less Approach to Automatic Evaluation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img3" />
<strong>Authors</strong>: Kawin Ethayarajh, Dorsa Sadigh
<br /><strong>Contact</strong>: kawin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2004.12726.pdf">Paper</a> | <a href="https://kawine.github.io/">Website</a>
<br /><strong>Keywords</strong>: nlp, bleu, evaluation, nearest neighbors, dialogue</p>
<hr />
<h4 id="determining-question-answer-plausibility-in-crowdsourced-datasets-using-multi-task-learning"><a href="https://arxiv.org/abs/2011.04883">Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img17" />
<strong>Authors</strong>: Rachel Gardner, Maya Varma, Clare Zhu, Ranjay Krishna
<br /><strong>Contact</strong>: rachel0@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.04883">Paper</a>
<br /><strong>Keywords</strong>: noisy text, bert, plausibility, multi-task learning</p>
<hr />
<h4 id="explaining-the-trump-gap-in-social-distancing-using-covid-discourse"><a href="https://openreview.net/pdf/baa636711f681ae8664818f378d565b17065c604.pdf">Explaining the ‘Trump Gap’ in Social Distancing Using COVID Discourse</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img20" />
<strong>Authors</strong>: Austin van Loon, Sheridan Stewart, Brandon Waldon, Shrinidhi K. Lakshmikanth, Ishan Shah, Sharath Chandra Guntuku, Garrick Sherman, James Zou, Johannes Eichstaedt
<br /><strong>Contact</strong>: avanloon@stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/pdf/baa636711f681ae8664818f378d565b17065c604.pdf">Paper</a>
<br /><strong>Keywords</strong>: computational social science, social distancing, word2vec, vector semantics, twitter, bert</p>
<hr />
<h4 id="learning-adaptive-language-interfaces-through-decomposition"><a href="https://arxiv.org/abs/2010.05190">Learning Adaptive Language Interfaces through Decomposition</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img10" />
<strong>Authors</strong>: Siddharth Karamcheti, Dorsa Sadigh, Percy Liang
<br /><strong>Contact</strong>: skaramcheti@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.05190">Paper</a> | <a href="https://virtual.2020.emnlp.org/paper_WS-6.10.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: semantic parsing, interaction, decomposition</p>
<hr />
<h4 id="modeling-subjective-assessments-of-guilt-in-newspaper-crime-narratives"><a href="https://arxiv.org/abs/2006.09589">Modeling Subjective Assessments of Guilt in Newspaper Crime Narratives</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img23" />
<strong>Authors</strong>: Elisa Kreiss*, Zijian Wang*, Christopher Potts
<br /><strong>Contact</strong>: ekreiss@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.09589">Paper</a> | <a href="https://github.com/zijwang/modeling_guilt">Website</a>
<br /><strong>Keywords</strong>: psycholinguistics, pragmatics, token-level supervision, model attribution, news, guilt, hedges, corpus, subjectivity</p>
<hr />
<h4 id="neural-natural-language-inference-models-partially-embed-theories-of-lexical-entailment-and-negation"><a href="https://arxiv.org/abs/2004.14623">Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img13" />
<strong>Authors</strong>: Atticus Geiger, Kyle Richardson, Chris Potts
<br /><strong>Contact</strong>: atticusg@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.14623">Paper</a> | <a href="https://atticusg.github.io/">Website</a>
<br /><strong>Keywords</strong>: entailment intervention causality systematic generalization</p>
<hr />
<h4 id="structured-self-attention-weights-encode-semantics-in-sentiment-analysis"><a href="https://arxiv.org/abs/2010.04922">Structured Self-Attention Weights Encode Semantics in Sentiment Analysis</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img9" />
<strong>Authors</strong>: Zhengxuan Wu, Thanh-Son Nguyen, Desmond C. Ong
<br /><strong>Contact</strong>: wuzhengx@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.04922">Paper</a>
<br /><strong>Keywords</strong>: attention, explainability, sentiment analysis</p>
<hr />
<p>We look forward to seeing you at EMNLP 2020!</p>
Sun, 15 Nov 2020 00:00:00 -0800