The Stanford AI Lab Blog
http://ai.stanford.edu/blog/
The Stanford AI Lab (SAIL) Blog is a place for SAIL students, faculty, and researchers to share our work with the general public.Wed, 03 Mar 2021 15:31:47 -0800Neural Mechanics: Symmetry and Broken Conservation Laws In Deep Learning Dynamics
/blog/neural-mechanics/
/blog/neural-mechanics/<p>Just like the fundamental laws of classical and quantum mechanics taught us how to control and optimize the physical world for engineering purposes, a better understanding of the laws governing neural network learning dynamics can have a profound impact on the optimization of artificial neural networks. This raises a foundational question: what, if anything, can we quantitatively understand about the learning dynamics of state-of-the-art deep learning models driven by real-world datasets?</p>
<p>In order to make headway on this extremely difficult question, existing works have made major simplifying assumptions on the architecture, such as restricting to a single hidden layer <sup id="fnref:saad1995dynamics"><a href="#fn:saad1995dynamics" class="footnote">1</a></sup>, linear activation functions <sup id="fnref:saxe2013exact"><a href="#fn:saxe2013exact" class="footnote">2</a></sup>, or infinite width layers <sup id="fnref:jacot2018neural"><a href="#fn:jacot2018neural" class="footnote">3</a></sup>. These works have also ignored the complexity introduced by the optimizer through stochastic and discrete updates. In the present work, rather than introducing unrealistic assumptions on the architecture or optimizer, we identify combinations of parameters with simpler dynamics (as shown Fig. 1) that can be solved exactly!</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image1.gif" /></p>
</div></figure>
<p><strong>Fig. 1.</strong> <em>We plot the per-parameter dynamics (left) and per-filter squared Euclidean norm dynamics (right) for the convolutional layers of a VGG-16 model (with batch normalization) trained on Tiny ImageNet with SGD with learning rate <script type="math/tex">\eta = 0.1</script>, weight decay <script type="math/tex">\lambda = 10^{-4}</script>, and batch size <script type="math/tex">S = 256</script>. Each color represents a different convolutional block. While the parameter dynamics are noisy and chaotic, the neuron dynamics are smooth and patterned.</em></p>
<h2 id="symmetries-in-the-loss-shape-gradient-and-hessian-geometry">Symmetries in the loss shape gradient and Hessian geometry</h2>
<p>While we commonly initialize neural networks with random weights, their gradients and Hessians at all points in training, no matter the loss or dataset, obey certain geometric constraints. Some of these constraints have been noticed previously as a form of implicit regularization, while others have been leveraged algorithmically in applications from network pruning to interpretability. Remarkably, all these geometric constraints can be understood as consequences of numerous symmetries in the loss introduced by neural network architectures.</p>
<p>A set of parameters observes a symmetry in the loss if the loss doesn’t change under a certain transformation of these parameters. This invariance introduces associated geometric constraints on the gradient and Hessian. We consider three families of symmetries (translation, scale, and rescale) that commonly appear in modern neural network architectures.</p>
<ul>
<li>Translation symmetry is defined by the transformation <script type="math/tex">\psi(\theta, \alpha) = \theta + \alpha\mathbb{1}_{\mathcal{A}}</script> where <script type="math/tex">\mathbb{1}_{\mathcal{A}}</script> is the indicator vector for some subset <script type="math/tex">\mathcal{A}</script> of the parameters <script type="math/tex">\{\theta_1, ..., \theta_m\}</script>. Any network using the softmax function gives rise to translation symmetry for the parameters immediately preceding the function.</li>
<li>Scale symmetry is defined by the transformation <script type="math/tex">\psi(\theta, \alpha) = \alpha_\mathcal{A} \odot \theta</script> where <script type="math/tex">\alpha_\mathcal{A} = \alpha \mathbb{1}_\mathcal{A} + \mathbb{1}_\mathcal{A^\mathsf{c}}</script>. Batch normalization leads to scale invariance for the parameters immediately preceding the function.</li>
<li>Rescale symmetry is defined by the transformation <script type="math/tex">\psi(\theta, \alpha) = \alpha_{\mathcal{A}_1} \odot \alpha^{-1}_{\mathcal{A}_2} \odot \theta</script> where <script type="math/tex">\mathcal{A}_1</script> and <script type="math/tex">\mathcal{A}_2</script> are two disjoint sets of parameters. For networks with continuous, homogeneous activation functions <script type="math/tex">\phi(z) = \phi'(z)z</script> (e.g. ReLU, Leaky ReLU, linear), this symmetry emerges at every hidden neuron by considering all incoming and outgoing parameters to the neuron.</li>
</ul>
<p>These symmetries enforce geometric constraints on the gradient of a neural network <script type="math/tex">g</script>,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\textbf{Translation:}&\quad\langle g, \mathbb{1}_\mathcal{A} \rangle = 0\\
\textbf{Scale:}&\quad\langle g, \theta_\mathcal{A} \rangle = 0\\
\textbf{Rescale:}&\quad\langle g, \theta_{\mathcal{A}_1} - \theta_{\mathcal{A}_2}\rangle = 0
\end{aligned} %]]></script>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image2.jpg" /></p>
</div></figure>
<p><strong>Fig. 2.</strong> <em>We visualize the vector fields associated with simple network components that have translation, scale, and rescale symmetry. On the right we consider the vector field associated with a neuron <script type="math/tex">% <![CDATA[
\sigma\left(\begin{bmatrix}\theta_1 & \theta_2\end{bmatrix}^\intercal x\right) %]]></script> where <script type="math/tex">\sigma</script> is the softmax function. In the middle we consider the vector field associated with a neuron <script type="math/tex">% <![CDATA[
\text{BN}\left(\begin{bmatrix}\theta_1 & \theta_2\end{bmatrix}\begin{bmatrix}x_1 & x_2\end{bmatrix}^\intercal\right) %]]></script> where <script type="math/tex">\text{BN}</script> is the batch normalization function. On the left we consider the vector field associated with a linear path <script type="math/tex">\theta_2\theta_1 x</script>.</em></p>
<p></p>
<h2 id="symmetry-leads-to-conservation-laws-under-gradient-flow">Symmetry leads to conservation laws under gradient flow</h2>
<p>We now consider how geometric constraints on gradients and Hessians, arising as a consequence of symmetry, impact the learning dynamics given by stochastic gradient descent (SGD). We will consider a model parameterized by <script type="math/tex">\theta</script>, a training dataset <script type="math/tex">\{x_{1}, ..., x_{N}\}</script> of size <script type="math/tex">N</script>, and a training loss <script type="math/tex">\mathcal{L}(\theta) = \frac{1}{N}\sum_{i=1}^N\ell(\theta, x_i)</script> with corresponding gradient <script type="math/tex">g(\theta) = \frac{\partial \mathcal{L}}{\partial\theta}</script>. The gradient descent update with learning rate <script type="math/tex">\eta</script> is <script type="math/tex">\theta^{(n+1)} = \theta^{(n)} - \eta g(\theta^{(n)})</script>, which is a forward Euler discretization with step size <script type="math/tex">\eta</script> of the ordinary differential equation (ODE) <script type="math/tex">\frac{d\theta}{dt} = -g(\theta)</script>. In the limit as <script type="math/tex">\eta \to 0</script>, gradient descent exactly matches the dynamics of this ODE, which is commonly referred to as gradient flow. Equipped with a continuous model for the learning dynamics, we now ask how do the dynamics interact with the geometric properties introduced by symmetries?</p>
<p>Strikingly similar to <a href="https://www.google.com/url?q=https://en.wikipedia.org/wiki/Noether%2527s_theorem&sa=D&source=editors&ust=1614205020229000&usg=AOvVaw1FkghDm15tT1bYlTSo-QKm">Noether’s theorem</a>, which describes a fundamental relationship between symmetry and conservation for physical systems governed by Lagrangian dynamics, every symmetry of a network architecture has a corresponding “conserved quantity” through training under gradient flow. Just as the total kinetic and potential energy is conserved for an idealized spring in harmonic motion, certain combinations of parameters are constant under gradient flow dynamics.</p>
<p>Consider some subset of the parameters <script type="math/tex">\mathcal{A}</script> that respects either a translation, scale, or rescale symmetry. As discussed earlier, the gradient of the loss <script type="math/tex">g(\theta)</script> is always perpendicular to the vector field that generates the symmetry <script type="math/tex">\partial_\alpha \psi</script>. Projecting the gradient flow learning dynamics onto the generator vector field yields a differential equation <script type="math/tex">\langle\frac{d\theta}{dt}, \partial_\alpha \psi\rangle = 0</script>. Integrating this equation through time results in the conservation laws,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\textbf{Translation:}&\quad\langle \theta_\mathcal{A}(t), \mathbb{1} \rangle = \langle \theta_\mathcal{A}(0), \mathbb{1} \rangle\\
\textbf{Scale:}&\quad|\theta_\mathcal{A}(t)|^2 = |\theta_\mathcal{A}(0)|^2\\
\textbf{Rescale:}&\quad|\theta_{\mathcal{A}_1}(t)|^2 - |\theta_{\mathcal{A}_2}(t)|^2 = |\theta_{\mathcal{A}_1}(0)|^2 - |\theta_{\mathcal{A}_2}(0)|^2
\end{aligned} %]]></script>
<p>Each of these equations define a conserved constant through training, effectively restricting the possible trajectory the parameters take through learning. For parameters with translation symmetry, their sum is conserved, effectively constraining their dynamics to a hyperplane. For parameters with scale symmetry, their Euclidean norm is conserved, effectively constraining their dynamics to a sphere. For parameters with rescale symmetry, their difference in squared Euclidean norm is conserved, effectively constraining their dynamics to a hyperbola.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image6.gif" /></p>
</div></figure>
<p><strong>Fig. 3.</strong> <em>Associated with each symmetry is a conserved quantity constraining the gradient flow dynamics to a surface. For translation symmetry (right) the flow is constrained to a hyperplane where the intercept is conserved. For scale symmetry (middle) the flow is constrained to a sphere where the radius is conserved. For rescale symmetry (left) the flow is constrained to a hyperbola where the axes are conserved. The color represents the value of the conserved quantity, where blue is positive and red is negative, and the black lines are level sets.</em>
</p>
<h2 id="a-realistic-continuous-model-for-stochastic-gradient-descent">A realistic continuous model for stochastic gradient descent</h2>
<p>While the conservation laws derived with gradient flow are quite striking, empirically we know they are broken, as demonstrated in Fig. 1. Gradient flow is too simple of a continuous model for realistic SGD training, it fails to account for the effect of hyperparameters such as weight decay and momentum, the effect of stochasticity introduced by random batches of data, and the effect of discrete updates due to a finite learning rate. Here, we consider how to address these effects individually to construct more realistic continuous models of SGD.</p>
<p>Modeling weight decay. Explicit regularization through the addition of an <script type="math/tex">L_2</script> penalty on the parameters, with regularization constant <script type="math/tex">\lambda</script>, is a very common practice when training neural networks. Weight decay modifies the gradient flow trajectory pulling the network towards the origin in parameter space.</p>
<p>Modeling momentum. Momentum is a common extension to SGD that uses an exponentially moving average of gradients to update parameters rather than a single gradient evaluation. The method introduces an additional hyperparameter <script type="math/tex">\beta</script>, which controls how past gradients are used in future updates, resulting in a form of “inertia” that accelerates the learning dynamics rescaling time, but leaves the gradient flow trajectory intact.</p>
<p>Modeling stochasticity. Stochastic gradients arise when we consider a batch <script type="math/tex">\mathcal{B}</script> of size <script type="math/tex">S</script> drawn uniformly from the indices <script type="math/tex">\{1,...,N\}</script> forming the unbiased gradient estimate <script type="math/tex">\hat{g}_{\mathcal{B}}(\theta) = \frac{1}{S}\sum_{i\in\mathcal{B}}\nabla\ell(\theta, x_i)</script>. We can model the batch gradient <script type="math/tex">\hat{g}_{\mathcal{B}}(\theta)</script> as a noisy version of the true gradient <script type="math/tex">g(\theta)</script>. However, because both the batch gradient and true gradient observe the same geometric properties introduced by symmetry, this noise has a special low-rank structure. In other words, stochasticity introduced by random batches does not affect the gradient flow dynamics in the directions associated with symmetry.</p>
<p>Modeling discretization. Gradient descent always moves in the direction of steepest descent on a loss function <script type="math/tex">\mathcal{L}</script> at each step, however, due to the finite nature of the learning rate, it fails to remain on the continuous steepest descent path given by gradient flow. In order to model this discrepancy, we borrow tools from the numerical analysis of partial differential equations. In particular, we use modified equation analysis <sup id="fnref:warming1974modified"><a href="#fn:warming1974modified" class="footnote">4</a></sup>, which determines how to model the numerical artifacts introduced by a discretization of a PDE. In our paper we present two methods based on modified equation analysis and recent works <sup id="fnref:barrett2020implicit"><a href="#fn:barrett2020implicit" class="footnote">5</a></sup>, <sup id="fnref:kovachki2019analysis"><a href="#fn:kovachki2019analysis" class="footnote">6</a></sup>, which modify gradient flow, with either higher order derivatives of the loss or higher order temporal derivatives of the parameters, to account for the effect of discretization on the learning dynamics.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image4.gif" /></p>
</div></figure>
<p><strong>Fig. 4.</strong> <em>We visualize the trajectories of gradient descent with momentum (black dots), gradient flow (blue line), and the modified dynamics (red line) on the quadratic loss <script type="math/tex">% <![CDATA[
\mathcal{L}(w) = w^\intercal\begin{bmatrix}2.5 & -1.5\\ -1.5 & 2 \end{bmatrix}w %]]></script>. The modified continuous dynamics visually track the discrete dynamics much better than the original gradient flow dynamics.</em></p>
<h2 id="combining-symmetry-and-modified-gradient-flow-to-derive-exact-learning-dynamics">Combining symmetry and modified gradient flow to derive exact learning dynamics</h2>
<p>We now study how weight decay, momentum, stochastic gradients, and finite learning rates all interact to break the conservation laws of gradient flow. Remarkably, even when using a more realistic continuous model for stochastic gradient descent, we can derive exact learning dynamics for the previously conserved quantities. To do this we (i) consider a realistic continuous model for SGD, (ii) project these learning dynamics onto the generator vector fields <script type="math/tex">\partial_\alpha \psi</script> associated with each symmetry, (iii) harness the geometric constraints introduced by symmetry to derive simplified ODEs, and (iv) solve these ODEs to obtain exact dynamics for the previously conserved quantities. We first consider the continuous model of SGD without momentum incorporating weight decay, stochasticity, and a finite learning rate. In this setting, the exact dynamics for the parameter combinations tied to the symmetries are,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\textbf{Translation:}&\quad\langle \theta_\mathcal{A}(t), \mathbb{1} \rangle = e^{-\lambda t} \langle \theta_\mathcal{A}(0), \mathbb{1} \rangle\\
\textbf{Scale:}&\quad|\theta_\mathcal{A}(t)|^2 = e^{- 2 \lambda t} |\theta_\mathcal{A}(0)|^2 + \eta \int_0^t e^{-2\lambda (t-\tau)} \left| g_\mathcal{A} \right|^2 d\tau\\
\textbf{Rescale:}&\quad|\theta_{\mathcal{A}_1} (t)|^2 - |\theta_{\mathcal{A}_2} (t)|^2 = \\
&\quad e^{- 2 \lambda t} (|\theta_{\mathcal{A}_1} (0)|^2 - |\theta_{\mathcal{A}_2} (0)|^2) + \eta \int_0^t e^{-2\lambda (t-\tau)} \left(\left| g_{\theta_{\mathcal{A}_1}} \right|^2 - \left| g_{\theta_{\mathcal{A}_2}} \right|^2\right)
d\tau
\end{aligned} %]]></script>
<p>Notice how these equations are equivalent to the conservation laws when <script type="math/tex">\eta = \lambda = 0</script>. Remarkably, even in typical hyperparameter settings (weight decay, stochastic batches, finite learning rates), these solutions match nearly perfectly with empirical results from modern neural networks (VGG-16) trained on real-world datasets (Tiny ImageNet), as shown in Fig. 5.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-02-25-neural-mechanics/image3.gif" /></p>
</div></figure>
<p><strong>Fig. 5.</strong> <em>We plot the column sum of the final linear layer (left) and the difference between squared channel norms of the fifth and fourth convolutional layer (right) of a VGG-16 model without batch normalization. We plot the squared channel norm of the second convolution layer (middle) of a VGG-16 model with batch normalization. Both models are trained on Tiny ImageNet with SGD with learning rate <script type="math/tex">\eta = 0.1</script>, weight decay <script type="math/tex">\lambda=0</script>, batch size <script type="math/tex">S = 256</script>, for <script type="math/tex">100</script> epochs. Colored lines are empirical and black dashed lines are the theoretical predictions.</em></p>
<p>Translation dynamics. For parameters with translation symmetry, this equation implies that the sum of these parameters decays exponentially to zero at a rate proportional to the weight decay. In particular, the dynamics do not directly depend on the learning rate <script type="math/tex">\eta</script> nor any information of the dataset due to the lack of curvature in the gradient field for these parameters (as shown in Fig. 2).</p>
<p>Scale dynamics. For parameters with scale symmetry, this equation implies that the norm for these parameters is the sum of an exponentially decaying memory of the norm at initialization and an exponentially weighted integral of gradient norms accumulated through training. Compared to the translation dynamics, the scale dynamics do depend on the data through the gradient norms accumulated throughout training.</p>
<p>Rescale dynamics. For parameters with rescale symmetry, this equation is the sum of an exponentially decaying memory of the difference in norms at initialization and an exponentially weighted integral of difference in gradient norms accumulated through training. Similar to the scale dynamics, the rescale dynamics do depend on the data through the gradient norms, however unlike the scale dynamics we have no guarantee that the integral term is always positive.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Despite being the central guiding principle in the exploration of the physical world, symmetry has been underutilized in understanding the mechanics of neural networks. In this paper, we constructed a unifying theoretical framework harnessing the geometric properties of symmetry and realistic continuous equations for SGD that model weight decay, momentum, stochasticity, and discretization. We use this framework to derive exact dynamics for meaningful combinations of parameters, which we experimentally verified on large scale neural networks and datasets. Overall, our work provides a first step towards understanding the mechanics of learning in neural networks without unrealistic simplifying assumptions.</p>
<p>For more details check out our ICLR <a href="https://openreview.net/forum?id=q8qLAbQBupm">paper</a> or this seminar <a href="http://www.physicsmeetsml.org/posts/sem_2021_02_24/">presentation</a>!</p>
<h3 id="acknowledgments">Acknowledgments</h3>
<p>We would like to thank our collaborator <a href="https://www.javiersagastuy.com/">Javier Sagastuy-Brena</a> and advisors <a href="https://profiles.stanford.edu/surya-ganguli">Surya Ganguli</a> and <a href="https://web.stanford.edu/~yamins/">Daniel Yamins</a>.
We would also like to thank <a href="https://web.stanford.edu/~meghas/">Megha Srivastava</a> for very helpful feedback on this post.</p>
<div class="footnotes">
<ol>
<li id="fn:saad1995dynamics">
<p>David Saad and Sara Solla. Dynamics of on-line gradient descent learning for multilayer neural networks.Advances in neural information processing systems, 8:302–308, 1995. <a href="#fnref:saad1995dynamics" class="reversefootnote">↩</a></p>
</li>
<li id="fn:saxe2013exact">
<p>Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proc. Natl. Acad. Sci. U. S. A., May 2019. <a href="#fnref:saxe2013exact" class="reversefootnote">↩</a></p>
</li>
<li id="fn:jacot2018neural">
<p>Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp.8571–8580, 2018 <a href="#fnref:jacot2018neural" class="reversefootnote">↩</a></p>
</li>
<li id="fn:warming1974modified">
<p>RF Warming and BJ Hyett. The modified equation approach to the stability and accuracy analysis of finite-difference methods. Journal of computational physics, 14(2):159–179, 1974. <a href="#fnref:warming1974modified" class="reversefootnote">↩</a></p>
</li>
<li id="fn:barrett2020implicit">
<p>David GT Barrett and Benoit Dherin. Implicit gradient regularization.arXiv preprintarXiv:2009.11162, 2020. <a href="#fnref:barrett2020implicit" class="reversefootnote">↩</a></p>
</li>
<li id="fn:kovachki2019analysis">
<p>Nikola B Kovachki and Andrew M Stuart. Analysis of momentum methods.arXiv preprint arXiv:1906.04285, 2019. <a href="#fnref:kovachki2019analysis" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 25 Feb 2021 00:00:00 -0800Do Language Models Know How Heavy an Elephant Is?
/blog/scalar-probing/
/blog/scalar-probing/<p>How heavy is an elephant? How expensive is a wedding ring?</p>
<p>Humans have a pretty good sense of <em>scale</em>, or reasonable ranges of these
<em>numeric attributes</em>, of different objects, but do pre-trained language
representations? Although pre-trained Language Models (LMs) like
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1810.04805&sa=D&source=editors&ust=1613552260369000&usg=AOvVaw2sJUKWCZGDMLa3LWoqOEZ7">BERT</a> have
shown a remarkable ability to learn all kinds of knowledge, including
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1909.01066&sa=D&source=editors&ust=1613552260369000&usg=AOvVaw27gyPje50D9HeU8ZaY_8VY">factual
knowledge</a>,
it remains unclear whether their representations can capture these types
of numeric attributes from text alone without explicit training data.</p>
<!-- ![](/assets/img/posts/2021-02-17-scalar-probing/image1.png) -->
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 500px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image1.png" /></p>
</div></figure>
<p>In our <a href="https://www.google.com/url?q=https://arxiv.org/abs/2010.05345&sa=D&source=editors&ust=1613552260370000&usg=AOvVaw2jns7eFEtJBLkPB-VDrx6F">recent
paper</a>,
we measure the amount of scale information that is captured in several
kinds of pre-trained text representations and show that, although
generally a <strong>significant amount</strong> of such information is captured, there is
still a <strong>large gap</strong> between their current performance and the theoretical
upper bound. We identify that specifically those text representations
that are <strong>contextual</strong> and <strong>good at numerical reasoning</strong> capture scale
better. We also come up with a <strong>new version of BERT</strong>, called <em>NumBERT</em>, with
improved numerical reasoning by <strong>replacing numbers in the pretraining
text corpus with their scientific notation</strong>, which more readily exposes
the magnitude to the model, and demonstrate that NumBERT representations
capture scale significantly better than all those previous text
representations.</p>
<h1 id="scalar-probing">Scalar Probing</h1>
<p>In order to understand to what extent pre-trained text representations, like
BERT representations, capture scale information, we propose a task
called <em>scalar probing</em>: probing the ability to predict a
<em>distribution</em> over values of a scalar attribute of an object. In this
work, we focus specifically on three kinds of scalar attributes: weight,
length, and price.</p>
<p>Here is the basic architecture of our scalar probing task:</p>
<!-- ![](images/image2.png) -->
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 900px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image2.png" /></p>
</div></figure>
<p>In this example, we are trying to see whether the representation of
“dog” extracted by a pre-trained encoder can be used to predict/recover
the distribution of the weight of a dog through a linear model. We probe
three baseline language representations:
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1301.3781&sa=D&source=editors&ust=1613552260374000&usg=AOvVaw08p9HhtI6FTvvcqpFd5NDn">Word2vec</a>,
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1802.05365&sa=D&source=editors&ust=1613552260374000&usg=AOvVaw1ngIpQf6a40ItFoq0MM78w">ELMo</a>,
and
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1810.04805&sa=D&source=editors&ust=1613552260374000&usg=AOvVaw1BGbokiyXp_QdBgvlV6B2J">BERT</a>.
Since the latter two are contextual representations that operate on
sentences instead of words, we feed in sentences constructed using fixed
templates. For example, for weight, we use the template “The X is
heavy”, where X is the object in interest.</p>
<p>We explore the kind of probe that predicts a <em>point estimate</em> of the value
and the kind that predicts the <em>full distribution</em>. For predicting a point
estimate, we use a standard linear <strong>R</strong>e<strong>GR</strong>ession (we denote as “<strong>rgr</strong>”)
trained to predict the log of the median of all values for each object
for the scale attribute under consideration. We predict the log because,
again, we care about the general scale rather than the exact value. The
loss is calculated using the prediction and the log of the median of the
ground-truth distribution. For predicting the full distribution, we use
a linear softmax <strong>M</strong>ulti-<strong>C</strong>lass <strong>C</strong>lassifier (we denote as “<strong>mcc</strong>”) producing a
categorical distribution over the 12 orders of magnitude. The
categorical distribution predicted using the NumBERT (our improved
version of BERT; will be introduced <a href="#numbert">later</a>) representations is shown as
the orange histogram in the above example.</p>
<p>The ground-truth distributions we use come from the <a href="https://www.google.com/url?q=https://arxiv.org/abs/1906.01327&sa=D&source=editors&ust=1613552260377000&usg=AOvVaw3IFP_sUANrnAsBdvZRbBJV">Distributions over
Quantities</a> (DoQ)
dataset which consists of <em>empirical counts</em> of scalar attribute values
associated with >350K nouns, adjectives, and verbs over 10 different
attributes, <em>automatically extracted</em> from a large web text corpus. Note
that during the construction of the dataset, all units for a certain
attribute are first unified to a single one (e.g.
centimeter/meter/kilometer -> meter) and the numeric values are scaled
accordingly. We convert the collected counts for each object-attribute
pair in DoQ into a <em>categorical distribution over 12 orders of magnitude</em>.
In the above example of the weight of a dog, the ground-truth
distribution is shown as the grey histogram, which is concentrated
around 10-100kg.</p>
<p><strong>The better the predictive performance is across all the object-attribute
pairs we are dealing with, the better the pre-trained representations
encode the corresponding scale information.</strong></p>
<h1 id="numbert"><a name="numbert"></a>NumBERT</h1>
<p>Before looking at the scalar probing results of these different language
presentations, let’s also think about what kind of representations might
be good at capturing scale information and how to improve existing LMs
to capture scale better. All of these models are trained using large
online text corpora like Wikipedia, news, etc. How can their
representations pick up scale information from all this text?</p>
<p>Here is a piece of text from the first document I got when I searched on
Google “elephant weight”:</p>
<blockquote>
<p>“…African elephants can range from 5,000 pounds to more than 14,000 pounds (6,350 kilograms)…”</p>
</blockquote>
<p>So it is highly likely that <strong>the learning of scale is partly mediated by
the transfer of scale information from the numbers</strong> (here “5,000”,
“14,000”, etc.) <strong>to nouns</strong> (here “elephants”) and <strong>numeracy</strong>, i.e. the
ability to reason about numbers, <strong>is probably important for representing
scale</strong>!</p>
<p>However, <a href="https://www.google.com/url?q=https://www.aclweb.org/anthology/D19-1534/&sa=D&source=editors&ust=1613552260381000&usg=AOvVaw294kIi9L87A-KaO__fkYLk">previous
work</a> has
shown that existing pre-trained text representations, including BERT,
ELMo, and Word2Vec, are not good at reasoning over numbers. For example,
beyond the magnitude of ~500, they cannot even decode a number from its
word embedding, e.g. embedding(“710”) <script type="math/tex">\nrightarrow</script> 710. Thus, we propose to improve
the numerical reasoning abilities of these representations by replacing
every instance of a number in the LM training data with its <em>scientiﬁc
notation</em>, and re-pretraining BERT (which we call <em>NumBERT</em>). This enables
the model to more easily associate objects in the sentence directly with
the <em>magnitude</em> expressed in the <em>exponent</em>, ignoring the relatively
insigniﬁcant mantissa.
<!-- ![](images/image4.png) --></p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 900px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image4.png" /></p>
</div></figure>
<h1 id="results">Results</h1>
<h3 id="scalar-probing-1">Scalar Probing</h3>
<!-- ![](images/image6.png) -->
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 500px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image6.png" /></p>
</div></figure>
<p>The above table shows the results of scalar probing on the DoQ data. We
use three evaluation metrics: <em>Accuracy</em>, <em>Mean Squared Error (MSE)</em>, and
<em>Earth Mover’s distance (EMD)</em>, and we do the experiments in four domains:
<em>Lengths</em>, <em>Masses</em>, <em>Prices</em> and <em>Animal Masses</em> (a subset of Masses). For MSE
and EMD, the best possible score is 0, while we compute a loose <em>upper
bound</em> of accuracy by sampling from the ground-truth distribution and
evaluating against the mode. This upper bound achieves accuracies of
0.570 for lengths, 0.537 for masses, and 0.476 for prices.</p>
<p>For the <em>Aggregate</em> baseline, for each attribute, we compute the empirical
distribution over buckets across all objects in the training set, and
use that as the predicted distribution for all objects in the test set.
Compared with this baseline, we can see that the <strong>mcc</strong> probe over the best
text representations capture about <strong>half</strong> (as measured by accuracy) to <strong>a
third</strong> (by MSE and EMD) of the distance to the upper bound mentioned
above, suggesting that <strong>while a signiﬁcant amount of scalar information
is available, there is a long way to go to support robust commonsense
reasoning</strong>.</p>
<p>Specifically, <strong>NumBERT representations do consistently better than all
the others</strong> on <em>Earth Mover’s Distance</em> (EMD), which is the <em>most
robust</em> metric because of its <a href="https://www.google.com/url?q=https://ieeexplore.ieee.org/document/710701&sa=D&source=editors&ust=1613552260385000&usg=AOvVaw221Lk2TXvCNo_SHGAj7IN6">better convergence
properties</a> and
<a href="https://www.google.com/url?q=http://proceedings.mlr.press/v97/liu19b.html&sa=D&source=editors&ust=1613552260386000&usg=AOvVaw1Q1SF3K0mlfjt8HERQiUVj">robustness to adversarial perturbations of the data
distribution</a>. <strong>Word2Vec
performs signiﬁcantly worse than the contextual representations</strong> – even
though the task is <em>noncontextual</em> (since we do not have different
ground-truths for an object occurring in different contexts in our
setting). Also, despite being weaker than BERT on downstream NLP tasks,
<strong>ELMo does better on scalar probing</strong>, consistent with it <a href="https://www.google.com/url?q=https://www.aclweb.org/anthology/D19-1534/&sa=D&source=editors&ust=1613552260387000&usg=AOvVaw366Vf1Or1N_arhzIwSF0a4">being better at
numeracy</a> due
to its <em>character-level tokenization</em>.</p>
<h3 id="zero-shot-transfer">Zero-shot transfer</h3>
<p>We note that DoQ is derived heuristically from web text and contains
noise. So we also evaluate probes trained on DoQ on 2 datasets
containing <em>ground truth labels</em> of scalar attributes:
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1706.03799&sa=D&source=editors&ust=1613552260388000&usg=AOvVaw1TjMr0Kp_kSo377e-Vl7KB">VerbPhysics</a> and
<a href="https://www.google.com/url?q=https://jmcauley.ucsd.edu/data/amazon/&sa=D&source=editors&ust=1613552260388000&usg=AOvVaw006j3ja6jmqXMQh2XejA0G">Amazon Price
Dataset</a>.
The ﬁrst is a human labeled dataset of relative comparisons, e.g.
(person, fox, weight, bigger). Predictions for this task are made by
comparing the point estimates for <strong>rgr</strong> and highest-scoring buckets for
<strong>mcc</strong>. The second is a dataset of empirical distributions of product
prices on Amazon. We retrained a probe on DoQ prices using 12 power-of-4
buckets to support ﬁner grained predictions.</p>
<!-- ![](images/image3.png)![](images/image5.png) -->
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 400px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image3.png" /></p>
<p><img class="postimage_unpadded" style="max-width: 400px" src="/blog/assets/img/posts/2021-02-17-scalar-probing/image5.png" /></p>
</div></figure>
<p>The results are shown in the tables above. On VerbPhysics (the table on
the top), <strong>rgr</strong>+NumBERT performed best, approaching the performance of
using DoQ as an oracle, though short of <a href="https://www.google.com/url?q=https://www.aclweb.org/anthology/P18-2102/&sa=D&source=editors&ust=1613552260389000&usg=AOvVaw1sQRwGz2TwHxKQUagdmqsf">specialized
models</a> for
this task. Scalar probes trained with <strong>mcc</strong> perform poorly, possibly
because a ﬁner-grained model of predicted distribution is not useful for
the 3-class comparative task. On the Amazon Price Dataset (the table on
the bottom) which is a full distribution prediction task, <strong>mcc</strong>+NumBERT did
best on both distributional metrics. On both zero-shot transfer tasks,
<strong>NumBERT representations were the best</strong> across all conﬁgurations of
metrics/objectives, suggesting that manipulating numeric representations
of the text in the pre-training corpora can signiﬁcantly improve
performance on scale prediction.</p>
<h1 id="moving-forward">Moving Forward</h1>
<p>In the work above, we introduce a new task called <em>scalar probing</em> used to
measure how much information of numeric attributes of objects
pre-trained text representations have captured and find out that while
there is a <strong>significant amount of scale information</strong> in object
representations (half to a third to the theoretical upper bound), these
models are <strong>far from achieving common sense scale understanding</strong>. We also
come up with an <strong>improved version of BERT</strong>, called <em>NumBERT</em>, whose
representations <strong>capture scale information significantly better</strong> than all
the previous ones.</p>
<p>Scalar probing opens up new exciting research directions to explore. For
example, lots of work has pre-trained large-scale <em>vision & language
models</em>, like
<a href="https://www.google.com/url?q=https://arxiv.org/abs/1908.02265&sa=D&source=editors&ust=1613552260391000&usg=AOvVaw3-rig6UgNOniW4jV0cJEzz">ViLBERT</a> and
<a href="https://www.google.com/url?q=https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf&sa=D&source=editors&ust=1613552260392000&usg=AOvVaw0FSByZ1nvSs_nkiucRIZ4N">CLIP</a>.
Probing their representations to see how much scale information has been
captured and performing systematic comparisons between them and
representations learned by language-only models can be quite
interesting.</p>
<p>Also, models learning text representations that predict scale better can
have a <strong>great real-world impact</strong>. Consider a web query like:</p>
<blockquote>
<p>“How tall is the tallest building in the world?”</p>
</blockquote>
<p>With a common sense understanding of what a reasonable range of heights
for “building” is, we can detect errors in the current web QA system when there are mistakes in
retrieval or parsing, e.g. when a wikipedia sentence about a building is
mistakenly parsed as being 19 miles high instead of meters.</p>
<p>Check out the paper <a href="https://www.google.com/url?q=https://arxiv.org/abs/2010.05345&sa=D&source=editors&ust=1613552260393000&usg=AOvVaw1QGVJEuhUKZ9jhfPl06j56">Do Language Embeddings Capture
Scales?</a> by
Xikun Zhang, Deepak Ramachandran, Ian Tenney, Yanai Elazar, and Dan
Roth.</p>
Wed, 17 Feb 2021 00:00:00 -0800Removing Spurious Features can Hurt Accuracy and Affect Groups Disproportionately
/blog/removing-spuriousfeature/
/blog/removing-spuriousfeature/<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_CHTML"></script>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/feature.png" /></p>
<h1 id="introduction">Introduction</h1>
<p>Machine learning models are susceptible to learning irrelevant patterns.
In other words, they rely on some spurious features that we humans know
to avoid. For example, assume that you are training a model to predict
whether a comment is toxic on social media platforms. You would expect
your model to predict the same score for similar sentences with
different identity terms. For example, “some people are Muslim” and
“some people are Christian” should have the same toxicity score.
However, as shown in <sup id="fnref:dixon2018measuring"><a href="#fn:dixon2018measuring" class="footnote">1</a></sup>, training a convolutional
neural net leads to a model which assigns different toxicity scores to
the same sentences with different identity terms. Reliance on spurious
features is prevalent among many other machine learning models. For
instance, <sup id="fnref:xiao2020noise"><a href="#fn:xiao2020noise" class="footnote">2</a></sup> shows that state of the art models in object
recognition like Resnet-50 <sup id="fnref:resnet"><a href="#fn:resnet" class="footnote">3</a></sup> rely heavily on background, so
changing the background can also change their predictions .</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image10.png" />
<img class="postimagehalf" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image1.png" />
<em>(Left) Machine learning models assign different toxicity scores to the
same sentences with different identity terms.
(Right) Machine learning models make different predictions on the same
object against different backgrounds.</em></p>
</div></figure>
<blockquote>
<p>Machine learning models rely on spurious features such as background in an image or identity terms in a comment. Reliance on spurious features conflicts with fairness and robustness goals.</p>
</blockquote>
<p>Of course, we do not want our model to rely on such spurious features
due to fairness as well as robustness concerns. For example, a model’s
prediction should remain the same for different identity terms
(fairness); similarly its prediction should remain the same with
different backgrounds (robustness). The first instinct to remedy this
situation would be to try to remove such spurious features, for example,
by masking the identity terms in the comments or by removing the
backgrounds from the images. However, removing spurious features can
lead to drops in accuracy at test time <sup id="fnref:zemel2013learning"><a href="#fn:zemel2013learning" class="footnote">4</a></sup><sup id="fnref:wang2019balanced"><a href="#fn:wang2019balanced" class="footnote">5</a></sup>. In this
blog post, we explore the causes of such drops in accuracy.</p>
<p>There are two natural explanations for accuracy drops:</p>
<ol>
<li>Core (non-spurious) features can be noisy or not expressive enough
so that even an optimal model has to use spurious features to
achieve the best accuracy
<sup id="fnref:khani2020noise"><a href="#fn:khani2020noise" class="footnote">6</a></sup><sup id="fnref:kleinberg2019simplicity"><a href="#fn:kleinberg2019simplicity" class="footnote">7</a></sup><sup id="fnref:credit_blur"><a href="#fn:credit_blur" class="footnote">8</a></sup>.</li>
<li>Removing spurious features can corrupt the core features
<sup id="fnref:zhao2019inherent"><a href="#fn:zhao2019inherent" class="footnote">9</a></sup><sup id="fnref:credit_sport"><a href="#fn:credit_sport" class="footnote">10</a></sup>.</li>
</ol>
<p>One valid question to ask is whether removing spurious features leads to
a drop in accuracy even in the absence of these two reasons. We answer
this question affirmatively in our recently published work in ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) <sup id="fnref:paper"><a href="#fn:paper" class="footnote">11</a></sup>. Here, we explain our results.</p>
<blockquote>
<p>Removing spurious features can lead to drop in accuracy even when spurious features are removed properly and core features exactly determine the target!</p>
</blockquote>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image14.png" />
<img class="postimagehalf" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image8.png" />
<em>(Left) When core features are not representative (blurred image), the
spurious feature (the background) provides extra information to identify
the object. (Right) Removing spurious features (gender
information) in the sport prediction task has corrupted other core
features (the weights and the bar).</em></p>
</div></figure>
<p>Before delving into our result, we note that understanding the reasons
behind the accuracy drop is crucial for mitigating such drops. Focusing
on the wrong mitigation method fails to address the accuracy drop.</p>
<blockquote>
<p>Before trying to mitigate the accuracy drop resulting from the removal of the spurious features, we must understand the reasons for the drop.</p>
</blockquote>
<table>
<thead>
<tr>
<th> </th>
<th>Previous work</th>
<th>Previous work</th>
<th>This work</th>
</tr>
</thead>
<tbody>
<tr>
<td> </td>
<td><img width="85%" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image18.png" /></td>
<td><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image19.png" /></td>
<td><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image20.png" /></td>
</tr>
<tr>
<td>Removing spurious features causes drops in accuracy because…</td>
<td>core features are noisy and not sufficiently expressive.</td>
<td>spurious features are not removed properly and thus corrupt core features.</td>
<td>a lack of training data causes spurious connections between some features and the target.</td>
</tr>
<tr>
<td>We can mitigate such drops by…</td>
<td>focusing on collecting more expressive features (e.g., high-resolution images)</td>
<td>focusing on more accurate methods for removing spurious features.</td>
<td>focusing on collecting more diverse training data. We show how to leverage unlabeled data to achieve such diversity.</td>
</tr>
</tbody>
</table>
<blockquote>
<p><img style="float: right;" width="30%" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/nut.png" /></p>
<h3 id="this-work-in-a-nutshell"><strong>This work in a nutshell:</strong></h3>
<ul>
<li>We study overparameterized models that fit training data perfectly.</li>
<li>We compare the “core model” that only uses core features (non-spurious) with the “full model” that uses both core features and spurious features.</li>
<li>Using the spurious feature, the full model can fit training data with a smaller norm.</li>
<li>In the overparameterized regime, since the number of training examples is less than the number of features, there are some directions of data variation that are not observed in the training data (unseen directions).</li>
<li>Though both models fit the training data perfectly, they have different “assumptions’’ for the unseen directions. This difference can lead to
<ul>
<li>Drop in accuracy</li>
<li>Affecting different test distributions (we also call them groups) disproportionately (increasing accuracy in some while decreasing accuracy in others).</li>
</ul>
</li>
</ul>
</blockquote>
<h1 id="noiseless-linear-regression">Noiseless Linear Regression</h1>
<p>Over the last few years, researchers have observed some surprising
phenomena about deep networks that conflict with classical machine
learning. For example, training models to zero training loss leads to
better generalization instead of overfitting <sup id="fnref:double_descent"><a href="#fn:double_descent" class="footnote">12</a></sup>. A line
of work <sup id="fnref:montanari"><a href="#fn:montanari" class="footnote">13</a></sup><sup id="fnref:aditi_michael"><a href="#fn:aditi_michael" class="footnote">14</a></sup> found that these unintuitive
results happen even for simple models such as linear regression if the
number of features are greater than the number of training data, known
as the overparameterized regime.</p>
<p>Accuracy drops due to the removal of spurious features is also
unintuitive. Classical machine learning tells us that removing spurious
features should decrease generalization error (since these features are,
by definition, irrelevant for the task). Analogous to the mentioned
work, we will explain this unintuitive result in overparameterized
linear regression as well. </p>
<blockquote>
<p>Accuracy drop due to removal of the spurious feature can be explained in overparameterized linear regression.</p>
</blockquote>
<p>Let’s first formalize the noiseless linear regression setup. Recall
that we are going to study a setup in which the target is completely
determined by the core features, and the spurious feature is a single
feature that can be removed perfectly without affecting predictive
performance. Formally, we assume there are \(d\) core features
\(z \in \mathbb{R}^d\) that determine the target \(y \in
\mathbb{R}\) perfectly, i.e., \( y = {\theta^\star}^\top z\).
In addition, we assume there is a single spurious feature \(s\) that
can also be determined by the core features \(s =
{\beta^\star}^\top z\). Note that the spurious feature can have
information about features that determine the target or it can be
completely unrelated to the target (i.e., for all \(i\),
\(\beta^\star_i \theta^\star_i=0\)).</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image13.png" />
<em>We consider a setup where target (\(y\)) is a deterministic function
of core features (\(z\)). In addition, there is a spurious feature
(\(s\)) that can also be determined by the core feature. We compare
two models, the core model that only uses \(z\) to predict \(y\) and the full model which uses both \(z\) and \(s\) to predict
\(y\).</em></p>
<p>We consider two models:</p>
<ul>
<li>Core model that only uses the core features \(z\) to predict the
target \(y\), and it is parametrized by
\({\theta^\text{-s}}\). For a data point with core features
\(z\), its prediction is \(\hat y =
{\theta^\text{-s}}^\top z\).</li>
<li>Full model that uses the core features \(z\) and also uses the
spurious feature \(s\), and it is parametrized by
\({\theta^\text{+s}}\), and \(w\), For a data point with
core feature \(z\) and a spurious feature \(s\), its
prediction is \(\hat y = {\theta^\text{+s}}^\top z + ws\).</li>
</ul>
<p>In this setup, the mentioned two reasons that naturally can cause
accuracy drop after removing the spurious feature (depicted in the table
above) do not exist.</p>
<ol>
<li>The spurious feature \(s\) adds no information about the target
\(y\) beyond what already exists in the core features
\(z\) (reason 1),</li>
<li>Removing \(s\) does not corrupt \(z\) (reason 2).</li>
</ol>
<p>Motivated by recent work in deep learning, which speculates that
gradient descent converges to the minimum-norm solution that fits
training data perfectly <sup id="fnref:gunasekar2017implicit"><a href="#fn:gunasekar2017implicit" class="footnote">15</a></sup>, we consider the
minimum-norm solution. </p>
<ul>
<li>Training data: We assume we have \(n < d\) triples of
\((z_i, s_i, y_i)\)</li>
<li>Test data: We assume core features in the test data are from a
distribution with covariance matrix \(\Sigma =
\mathbb{E}[zz^\top]\) (we use group and test data distribution
exchangeably).</li>
</ul>
<p>In this simple setting, one might conjecture that removing the spurious
feature should only help accuracy. However, we show that this is not
always the case. We exactly characterize the test distributions that are
negatively affected by removing spurious features, as well as the ones
that are positively affected by it.</p>
<h1 id="example">Example</h1>
<p>Let’s first look at a simple example with only one training data and
three core features (\(z_1, z_2\) and \(z_3\)). Let the true
parameters \(\theta^\star =[2,2,2]^\top\) which results in
\(y=2\), and let the spurious feature parameter \({\beta^\star}
= [1,2,-2]^\top\) which results in \(s=1\).</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image11_1.png" /></p>
<p>First, note that the smallest L2-norm vector that can fit the training
data for the core model is \({\theta^\text{-s}}=[2,0,0]\). On
the other hand, in the presence of the spurious feature, the full model
can fit the training data perfectly with a smaller norm by assigning
weight \(1\) for the feature \(s\)
(\(|{\theta^\text{-s}}|_2^2 = 4\) while
\(|{\theta^\text{+s}}|_2^2 + w^2 = 2 < 4\)).</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image11_2.png" /></p>
<p>Generally, in the overparameterized regime, since the number of training
examples is less than the number of features, there are some directions
of data variation that are not observed in the training data. In this
example, we do not observe any information about the second and third
features. The core model assigns weight \(0\) to the unseen
directions (weight \(0\) for the second and third features in this
example). However, the non-zero weight for the spurious feature leads to
a different assumption for the unseen directions. In particular, the
full model does not assign weight \(0\) to the unseen directions.
Indeed, by substituting \(s\) with \({\beta^\star}^\top
z\), we can view the full model as not using \(s\) but
implicitly assigning weight \(\beta^\star_2=2\) to the second
feature and \(\beta^\star_3=-2\) to the third feature (unseen
directions at training).</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image11_3.png" /></p>
<p>Let’s now look at different examples and the prediction of these two
models:</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image7.png" /></p>
<p>In this example, removing \(s\) reduces the error for a test
distribution with high deviations from zero on the second feature,
whereas removing \(s\) increases the error for a test distribution
with high deviations from zero on the third feature.</p>
<h1 id="main-result">Main result</h1>
<p>As we saw in the previous example, by using the spurious feature, the
full model incorporates \({\beta^\star}\) into its estimate. The
true target parameter (\(\theta^\star\)) and the true spurious
feature parameters (\({\beta^\star}\)) agree on some of the
unseen directions and do not agree on the others. Thus, depending on
which unseen directions are weighted heavily in the test time, removing
\(s\) can increase or decrease the error.</p>
<p>More formally, the weight assigned to the spurious feature is
proportional to the projection of \(\theta^\star\) on
\({\beta^\star}\) on the seen directions. If this number is close
to the projection of \(\theta^\star\) on \({\beta^\star}\)
on the unseen directions (in comparison to 0), removing \(s\)
increases the error, and it decreases the error otherwise. Note that
since we are assuming noiseless linear regression and choose models that
fit training data, the model predicts perfectly in the seen directions
and only variations in unseen directions contribute to the error.</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image6.png" />
<em>(Left) The projection of \(\theta^\star\) on
\(\beta^\star\) is positive in the seen direction, but it is
negative in the unseen direction; thus, removing \(s\) decreases the
error. (Right) The projection of \(\theta^\star\) on
\(\beta^\star\) is similar in both seen and unseen directions;
thus, removing \(s\) increases the error.</em></p>
<blockquote>
<p>Drop in accuracy in test time depends on the relationship between the true target parameter (\(\theta^\star\)) and the true spurious feature parameters (\({\beta^\star}\)) in the seen directions and unseen direction.</p>
</blockquote>
<p>Let’s now formalize the conditions under which removing the spurious
feature (\(s\)) increases the error. Let \(\Pi =
Z(ZZ^\top)^{-1}Z\) denote the column space of training data (seen
directions), thus \(I-\Pi\) denotes the null space of training data
(unseen direction). The below equation determines when removing the
spurious feature decreases the error.</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image9.png" />
<em>The left side is the difference between the projection of \(\theta^\star\) on \(\beta^\star\) in the seen direction
with their projection in the unseen direction scaled by test time
covariance. The right side is the difference between 0 (i.e., not using
spurious features) and the projection of \(\theta^\star\) on
\(\beta^\star\) in the unseen direction scaled by test time
covariance. Removing \(s\) helps if the left side is greater than
the right side.</em></p>
<h1 id="experiments">Experiments</h1>
<p>While the theory applies only to linear models, we now show that in
non-linear models trained on real-world datasets, removing a spurious
feature reduces the accuracy and affects groups disproportionately.</p>
<p>Datasets. We are going to study the CelebA dataset <sup id="fnref:liu2015"><a href="#fn:liu2015" class="footnote">16</a></sup> which
contains photos of celebrities along with 40 different attributes.
\footnote{See our paper for the results on the
comment-toxicity-detection and MNIST datasets} We choose wearing
lipstick (indicating if a celebrity is wearing lipstick) as the target
and wearing earrings (indicating if a celebrity is wearing earrings) as
the spurious feature. </p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image5.png" /></p>
<p>Note that although wearing earrings is correlated with wearing lipstick,
we expect our model to not change its prediction if we tell the model
the person is wearing earrings.</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image3.png" /></p>
<p>In the CelebA dataset wearing earrings is correlated with wearing
lipstick. In this dataset, if a celebrity wears earrings, it is almost
five times more likely that they will wear lipstick than not wearing
lipstick. Similarly, if a celebrity does not wear earrings, it is
almost two times more likely for them not to wear lipstick than wearing
lipstick.</p>
<p>Setup. We train a two-layer neural network with 128 hidden units. We
flatten the picture and concatenate the binary variable of wearing
earrings to it (we tuned a multiplier for it). We also want to know how
much each model relies on the spurious feature. In other words, we want
to know how much the model prediction changes as we change the wearing
earrings variable. We call this attacking the model (i.e, swapping the
value of the binary feature of wearing earrings). We run each experiment
50 times and report the average.</p>
<p><img class="postimage_75" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image12.png" /></p>
<p>Results. The below diagram shows the accuracy of different models, and
their accuracies when they are attacked. Note that, because our attack
focuses on the spurious feature, the core model’s accuracy will remain
the same.</p>
<p><img class="postimage" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image16.png" /></p>
<p>Removal of the wearing lipstick decreases the overall accuracy. The
decrease in accuracy is not monotonic among different groups. The
accuracy has decreased in the group where people are not wearing
lipstick or earrings and in the group that they both have lipstick and
earrings. On the other hand, accuracy increases for the group that only
wears one of them.</p>
<p>Let’s break down the diagram and analyze each section.</p>
<table>
<tbody>
<tr>
<td><img width="2000" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image4.png" /></td>
<td>All celebrities together: have a reasonable accuracy of 82% The overall accuracy drops 1% when we remove the spurious feature (core model accuracy). The full model relies on the spurious feature a lot, thus attacking the full model leads to a ~ 17% drop in overall accuracy.</td>
</tr>
<tr>
<td><img width="2000" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image2.png" /></td>
<td>The celebrities who follow the stereotype (people who do not have earrings or lipstick, and people who wear both) have a good accuracy overall (both above 85%); The accuracy of both groups drop as we remove the wearing earrings (i.e., core model accuracy). Using the spurious feature helps their accuracy, thus attacking the full model leads to a ~30% drop in their accuracy.</td>
</tr>
<tr>
<td><img width="2000" src="/blog/assets/img/posts/2021-1-24-removing-spuriousfeature/image15.png" /></td>
<td>The celebrities who do not follow the stereotypes have a very low accuracy; this is especially worse for people who only wear earrings (33% accuracy in comparison to the average of 85%). Removing the wearing earring increases their accuracy substantially. Using the spurious feature does not help their accuracy, thus attacking the full model does not change accuracy for these groups.</td>
</tr>
</tbody>
</table>
<blockquote>
<p> In non-linear models trained on real-world datasets, removing a spurious feature reduces the accuracy and affects groups disproportionately.</p>
</blockquote>
<h1 id="qa-other-results">Q&A (Other results):</h1>
<p><strong>I know about my problem setting, and I am certain that disjoint features
determine the target and the spurious feature (i.e., for all \(i\),
\(\theta^\star_i\beta^\star_i=0\)). Can I be sure that my
model will not rely on the spurious feature, and removing the spurious
feature definitely reduces the error?</strong> No! Actually, for any
\(\theta^\star\) and \({\beta^\star}\), we can construct a
training set and two test sets with \(\theta^\star\) and
\({\beta^\star}\) as the true parameters and the spurious feature
parameter, such that removing the spurious feature reduces the error in
one but increases the error in the other one (see Corollary 1 in our
paper).</p>
<p><strong>I am collecting a balanced dataset such that the spurious feature and
the target are completely independent (i.e., \(p[y,s]= p[y]p[s]\)).
Can I be sure that my model will not rely on the spurious feature, and
removing the spurious feature definitely reduces the error?</strong>
No! for any
\(S \in \mathbb{R}^n\) and \(Y \in \mathbb{R}^n\), we can
generate a training set and two test sets with \(S\) and \(Y\)
as their spurious feature and targets, respectively, such that removing
the spurious feature reduces the error in one but increases the error in
the other (see Corollary 2 in our paper).</p>
<p><strong>What happens when we have many spurious features?</strong> Good question! Let’s
say \(s_1\) and \(s_2\) are two spurious features. We show
that:</p>
<ol>
<li>Removing \(s_1\) makes the model more sensitive against
\(s_2\), and</li>
<li>If a group has high error because of the new assumption about unseen
direction enforced by using \(s_2\), then it will have an even
higher error by removing \(s_1\).
(See Proposition 3 in our paper).</li>
</ol>
<p><strong>Is it possible to have the same model (a model with the same assumptions
on unseen directions as the full model) without relying on the spurious
feature (i.e., be robust against the spurious feature)?</strong> Yes! You can
recover the same model as the full model without relying on the spurious
feature via robust self-training and unlabeled data (See Proposition 4).</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this work, we first showed that overparameterized models are
incentivized to use spurious features in order to fit the training data
with a smaller norm. Then we demonstrated how removing these spurious
features altered the model’s assumption on unseen directions.
Theoretically and empirically, we showed that this change could hurt the
overall accuracy and affect groups disproportionately. We also proved
that robustness against spurious features (or error reduction by
removing the spurious features) cannot be guaranteed under any condition
of the target and spurious feature. Consequently, balanced datasets do
not guarantee a robust model and practitioners should consider other
features as well. Studying the effect of removing noisy spurious
features is an interesting future direction.</p>
<h1 id="acknowledgement">Acknowledgement</h1>
<p>I would like to thank Percy Liang, Jacob Schreiber and Megha Srivastava for their useful comments. The images in the introduction are from <sup id="fnref:xiao2020noise2"><a href="#fn:xiao2020noise2" class="footnote">17</a></sup><sup id="fnref:credit_gay_straight"><a href="#fn:credit_gay_straight" class="footnote">18</a></sup> <sup id="fnref:credit_blur2"><a href="#fn:credit_blur2" class="footnote">19</a></sup><sup id="fnref:credit_sport2"><a href="#fn:credit_sport2" class="footnote">20</a></sup>.</p>
<div class="footnotes">
<ol>
<li id="fn:dixon2018measuring">
<p>Dixon, Lucas, et al. “Measuring and mitigating unintended bias in text classification.” Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 2018. <a href="#fnref:dixon2018measuring" class="reversefootnote">↩</a></p>
</li>
<li id="fn:xiao2020noise">
<p>Xiao, Kai, et al. “Noise or signal: The role of image backgrounds in object recognition.” arXiv preprint arXiv:2006.09994 (2020). <a href="#fnref:xiao2020noise" class="reversefootnote">↩</a></p>
</li>
<li id="fn:resnet">
<p>He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. <a href="#fnref:resnet" class="reversefootnote">↩</a></p>
</li>
<li id="fn:zemel2013learning">
<p>Zemel, Rich, et al. “Learning fair representations.” International Conference on Machine Learning. 2013. <a href="#fnref:zemel2013learning" class="reversefootnote">↩</a></p>
</li>
<li id="fn:wang2019balanced">
<p>Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. <a href="#fnref:wang2019balanced" class="reversefootnote">↩</a></p>
</li>
<li id="fn:khani2020noise">
<p>Khani, Fereshte, and Percy Liang. “Feature Noise Induces Loss Discrepancy Across Groups.” International Conference on Machine Learning. PMLR, 2020. <a href="#fnref:khani2020noise" class="reversefootnote">↩</a></p>
</li>
<li id="fn:kleinberg2019simplicity">
<p>Kleinberg, Jon, and Sendhil Mullainathan. “Simplicity creates inequity: implications for fairness, stereotypes, and interpretability.” Proceedings of the 2019 ACM Conference on Economics and Computation. 2019. <a href="#fnref:kleinberg2019simplicity" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_blur">
<p>photo from Torralba, Antonio. “Contextual priming for object detection.” International journal of computer vision 53.2 (2003): 169-191. <a href="#fnref:credit_blur" class="reversefootnote">↩</a></p>
</li>
<li id="fn:zhao2019inherent">
<p>Zhao, Han, and Geoff Gordon. “Inherent tradeoffs in learning fair representations.” Advances in neural information processing systems. 2019. <a href="#fnref:zhao2019inherent" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_sport">
<p>photo from Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE International Conference on Computer Vision. 2019. <a href="#fnref:credit_sport" class="reversefootnote">↩</a></p>
</li>
<li id="fn:paper">
<p>Khani, Fereshte, and Percy Liang. “Removing Spurious Features can Hurt Accuracy and Affect Groups Disproportionately.” arXiv preprint arXiv:2012.04104 (2020). <a href="#fnref:paper" class="reversefootnote">↩</a></p>
</li>
<li id="fn:double_descent">
<p>Nakkiran, Preetum, et al. “Deep double descent: Where bigger models and more data hurt.” arXiv preprint arXiv:1912.02292 (2019). <a href="#fnref:double_descent" class="reversefootnote">↩</a></p>
</li>
<li id="fn:montanari">
<p>Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560. <a href="#fnref:montanari" class="reversefootnote">↩</a></p>
</li>
<li id="fn:aditi_michael">
<p>Raghunathan, Aditi, et al. “Understanding and mitigating the tradeoff between robustness and accuracy.” arXiv preprint arXiv:2002.10716 (2020). <a href="#fnref:aditi_michael" class="reversefootnote">↩</a></p>
</li>
<li id="fn:gunasekar2017implicit">
<p>Gunasekar, Suriya, et al. “Implicit regularization in matrix factorization.” 2018 Information Theory and Applications Workshop (ITA). IEEE, 2018. <a href="#fnref:gunasekar2017implicit" class="reversefootnote">↩</a></p>
</li>
<li id="fn:liu2015">
<p>Liu, Ziwei, et al. “Deep learning face attributes in the wild.” Proceedings of the IEEE international conference on computer vision. 2015. <a href="#fnref:liu2015" class="reversefootnote">↩</a></p>
</li>
<li id="fn:xiao2020noise2">
<p>Xiao, Kai, et al. “Noise or signal: The role of image backgrounds in object recognition.” arXiv preprint arXiv:2006.09994 (2020). <a href="#fnref:xiao2020noise2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_gay_straight">
<p>Garg, Sahaj, et al. “Counterfactual fairness in text classification through robustness.” Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 2019. <a href="#fnref:credit_gay_straight" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_blur2">
<p>photo from Torralba, Antonio. “Contextual priming for object detection.” International journal of computer vision 53.2 (2003): 169-191. <a href="#fnref:credit_blur2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:credit_sport2">
<p>photo from Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE International Conference on Computer Vision. 2019. <a href="#fnref:credit_sport2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 24 Jan 2021 00:00:00 -0800Blue People v. City of Ney
/blog/Bluepeoplevs.Neycity/
/blog/Bluepeoplevs.Neycity/<script type="text/javascript" src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.0/MathJax.js?config=TeX-AMS_CHTML"></script>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image8.jpg" /></p>
</div></figure>
<h1 id="introduction">Introduction</h1>
<p>Discriminatory behavior towards certain groups by machine learning (ML) models is especially concerning in critical applications such as hiring. This blog post explains one source of discrimination: the reliance of ML models on different groups’ data distributions. We will show that when ML models use noisy features (which are pervasive in the real world, e.g., exam scores), they’re incentivized to devalue a good candidate from a lower-performing group. This blog post is based on:</p>
<p><em>Fereshte Khani and Percy Liang, “Feature Noise Induces Loss Discrepancy
Across Groups.” International Conference on Machine Learning. PMLR, 2020</em></p>
<p>The findings are illustrated by reviewing the hiring process in the
fictitious city of Ney, where recently a group of people has accused the
government of discrimination.</p>
<h1 id="hiring-people-in-ney">Hiring people in Ney</h1>
<p>The government of Ney wants to hire qualified people. Each person in Ney has a skill level that is normally distributed with a mean \(\mu\) and a standard deviation
of \(\sigma_\text{skill}\). A person is qualified if their skill level is greater than 0 and non-qualified
otherwise. The government wants to hire qualified people (all people
with skills greater than 0). For example, Alice with skill level 2, is
qualified, but Bob with the skill level of -1 is not qualified.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image13.png" />
<em>The skills level of the people in Ney is normally distributed with a mean of \(\mu\) and a standard deviation of \(\sigma_\text{skill}\).</em></p>
</div></figure>
<p>To assess people’s skills, the government created an exam. The exam score is a noisy indicator of the applicant’s skill since it cannot capture the true skill of a person (e.g., the same applicant would score differently on different versions of SAT). In the city of Ney, exam noise is nice and simple: If an individual has skill \(z\), then their
score is distributed as \(\mathcal{N} (z,
\sigma_\text{noise}^2)\),
where \(\sigma_\text{noise}^2\) indicates the variance of noise
on the exam.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image11.png" />
<em>The exam score of an individual with a skill of \(z\) is a random variable normally distributed with a mean of \(z\) and a standard deviation of \(\sigma_\text{noise}\).</em></p>
</div></figure>
<p>The government wants to choose a threshold \(\tau\), and hire all
people whose exam scores are greater than \(\tau\). There are two
kinds of errors that the government can make:</p>
<ol>
<li>Not hiring a qualified person (\(z > 0 \land x \le \tau\))</li>
<li>Hiring a non-qualified person (\(z \le 0 \land x > \tau\))</li>
</ol>
<p>For simplicity, let’s assume the government cares about these two types
of errors equally and wants to minimize the overall error, i.e., the
number of non-qualified hired people plus the number of qualified
non-hired people.</p>
<script type="math/tex; mode=display">\begin{align}
\text{Error} = \mathbb{E}\left[[z>0] \neq [x > \tau]\right]\\
\end{align}</script>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image4.png" />
<em>The government’s goal is to find a cut-off threshold such that it minimizes the error.</em></p>
</div></figure>
<p>Given all exam scores and knowledge of the skill distribution of the people,
what cut-off threshold should the government use to minimize the error (the above equation)?
Is it a good strategy for the government to simply use 0 as the
threshold and hire all individuals with scores greater than zero?</p>
<p>Let’s consider an example where the skill distribution
is \(\mathcal{N}(-1,1)\), and the exam noise
has a standard deviation of \(\sigma_\text{noise}=1\). The following lines of code plot
the average error for various thresholds for this example. As
illustrated, 0 is not the best threshold to use. In fact, in this
example, a threshold of \(\tau=1\) leads to minimum error.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image1.png" />
<em>A simple example with \(\mu=-1\) and \(\sigma_\text{skill}=\sigma_\text{noise}=1\). As shown on the right, accepting individuals with a score higher than \(0\) does not result in the minimum error.</em></p>
</div></figure>
<blockquote>
<blockquote>
<h4 id="the-government-wants-to-minimize-the-number-of-hired-people-with-negative-skill-levels--the-number-of-non-hired-people-with-positive-skill-levels-hiring-all-people-with-positive-exam-scores-a-noisy-indicator-of-the-skill-is-not-optimal">The government wants to minimize the number of hired people with negative skill levels + the number of non-hired people with positive skill levels. Hiring all people with positive exam scores (a noisy indicator of the skill) is not optimal.</h4>
</blockquote>
</blockquote>
<p>If 0 is not always the optimal threshold, then what is the optimal
threshold for minimizing error for different values of \(\mu,
\sigma_\text{skill}\) and \(\sigma_\text{noise}\)?
Generally, given a person’s exam score (\(x\)) and the skill level distribution (\(\mathbb{P}(z)\)), what can we infer
about their real skill (\(z\))? Here is where Bayesian inference
comes in.</p>
<h1 id="bayesian-inference-">Bayesian inference </h1>
<p>Let’s see what we can infer about a person’s skill given their exam score and knowing the skill level distribution
\(\mathbb{P} (z)\) (known as the <em>prior distribution</em> since it shows the prior over a person’s skill). Using Bayes rule, we can calculate \(\mathbb{P} (z|x)\) (known as the <em>posterior distribution</em> since it shows the distribution over a person’s skill after observing their score).</p>
<p>Let’s first consider two extreme cases:</p>
<ol>
<li>If the exam is completely precise
(i.e., \(\sigma_\text{noise}=0\)), then the exam score is
the exact indicator of a person’s skill (irrespective of the prior
distribution).</li>
<li>If the exam is pure noise (i.e., \(\sigma_\text{noise}
\rightarrow \infty\)), then the exam score is meaningless, and
the best estimate for a person’s skill is the average
skill \(\mu\) (irrespective of the exam score).</li>
</ol>
<p>Intuitively, when the noise variance has a value between \(0\) and \(\infty\), the best estimate of a person’s skill is a number
between their exam score (\(x\)) and the average skill
(\(\mu\)). The figure below shows the standard formulation of the
posterior distribution \(\mathbb{P} (z \mid x)\) after observing
an exam score (\(x_0\)). For more details on how to derive this
formula, see
<a href="https://www.google.com/url?q=https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf&sa=D&ust=1608704068777000&usg=AOvVaw1E_EmGAxQ8A_gOtp6_dTHk">this</a>.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image3.png" />
<em>Posterior distribution of a person’s skill after observing their exam score (\(x_0\)).</em></p>
</div></figure>
<p>Based on this formula (and as hypothesized), depending on the amount of noise, \(\mathbb{E} [z\mid x]\) is a number between \(x\) and \(\mu\).</p>
<blockquote>
<blockquote>
<h4 id="an-applicants-expected-skill-level-is-between-their-exam-score-and-the-average-skill-among-ney-people-if-the-exam-is-noisier-it-is-closer-to-the-average-skill-if-the-exam-is-more-precise-it-is-closer-to-the-exam-score">An applicant’s expected skill level is between their exam score and the average skill among Ney people. If the exam is noisier, it is closer to the average skill; if the exam is more precise, it is closer to the exam score.</h4>
</blockquote>
</blockquote>
<h1 id="optimal-threshold">Optimal threshold</h1>
<p>Now that we have exactly characterized the posterior distribution
(\(\mathbb{P} (z \mid x)\)), the government can find the optimal
threshold. For any exam score \(x\), if the government hires people
with score \(x\), it incurs \(\mathbb{P}(z \le 0 \mid x) \)
error (probability of hiring non-qualified people). On the other hand,
if it does not hire people with score \(x\), it
incurs \(\mathbb{P}(z > 0 \mid x)\) error (probability of
non-hiring qualified people). Thus, in order to minimize the error, the
government should hire a person iff \(\mathbb{P} (z > 0 \mid x) >
\mathbb{P}(z \le 0 \mid x)\). Since the posterior distribution is a
normal distribution, the government must hire an applicant
iff \(\mathbb{E}[z \mid x] > 0\).</p>
<p>Using the formulation in the previous section, we have:</p>
<script type="math/tex; mode=display">\begin{align}\mu \frac{\sigma_\text{noise}^2}{\sigma_\text{noise}^2 +
\sigma_\text{skill}^2} + x
\frac{\sigma_\text{skill}^2}{\sigma_\text{skill}^2 +
\sigma_\text{noise}^2} > 0 \iff x > -\mu
\frac{\sigma_\text{noise}^2}{\sigma_\text{skill}^2}
\end{align}</script>
<p>Therefore, the optimal threshold is:</p>
<script type="math/tex; mode=display">\bbox[5px, border: 2px solid grey]{
\text{optimal threshold} = -\mu\frac{\sigma_\text{noise}^2}{\sigma_\text{skill}^2}
}</script>
<p>In our running example with average skill \(\mu=-1\)
and \(\sigma_\text{skill} = \sigma_\text{noise}=1\), the optimal threshold is 1.
The figure below shows how the optimal threshold varies according
to \(\mu\) and \(\sigma_\text{noise}\).
As \(\sigma_\text{noise}\) increases or \(\mu\) decreases,
the optimal threshold moves farther away from \(0\).</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image5.png" />
<em>(left) The optimal threshold increases as the average of the prior distribution decreases (with a fixed exam noise \(\sigma_\text{noise} > 0\)). (right) The optimal threshold increases if the exam noise increases (with a fixed average skill \(\mu < 0\)). Note that, if exam scores are not noisy or the average skill is zero, then the optimal threshold is zero.</em></p>
</div></figure>
<blockquote>
<blockquote>
<h4 id="as-exams-become-more-noisy-or-the-average-skill-becomes-more-negative-the-optimal-threshold-moves-further-away-from-0">As exams become more noisy or the average skill becomes more negative, the optimal threshold moves further away from 0.</h4>
</blockquote>
</blockquote>
<h1 id="what-does-machine-learning-have-to-do-with-all-of-this">What does machine learning have to do with all of this?</h1>
<p>So far, we precisely identified the optimal cut-off threshold given the
exact knowledge of \(\mu, \sigma_\text{skill}\),
and \(\sigma_\text{noise}\). But how can the government find the
optimal threshold using observational data? This is where machine
learning (ML) comes into the picture.
Let’s imagine very favorable conditions. Let’s assume everyone (an infinite number of them!) takes the exam, the government hires all of them and observes their true skills. Further, assume the modeling assumption is perfectly correct (i.e., both the true prior distribution and conditional distribution are normal). What would happen if the government trains a model with an infinite number of \((x,z)\)
pairs?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_50" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image6.png" />
<em>The government has collected lots of data and now wants to use ML models to predict the best threshold that minimizes the error.</em></p>
</div></figure>
<p>Before delving into this, we would like to note that in real-world
scenarios, we do not have infinite data (finite data issues); the
government does not hire everyone (selection bias issues), and the true
skill is not perfectly observable (target noise/biases issues).
Furthermore, the modeling assumptions are often incorrect (model
misspecification issues). Each of these issues may affect the model
adversely; however, in this blog post our goal is to analyze the model
decisions when none of these issues exist. In the next section, we will show that discrimination occurs even under these ideal conditions.</p>
<p>Under these very favorable conditions and the right loss function,
machine learning algorithms can perfectly predict \(\mathbb{E} [z
\mid x]\) from \(x\); therefore, can find the optimal threshold
that minimizes the error. The following few lines of Python code show
how linear regression and logistic regression fit the data. In this
example, we set \(\mu = -1,
\sigma_\text{skill}=\sigma_\text{noise}=1\), and as shown in
the figure on the right, the cut-off threshold predicted by the model is
one, which matches the optimal threshold as we observed previously.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image2.png" />
A simple example along with the predicted cut-off
threshold for linear and logistic regression. The predicted cut-off
threshold results in the minimum error, as previously discussed.</p>
</div></figure>
<blockquote>
<blockquote>
<h4 id="under-very-favorable-conditions-machine-learning-models-find-the-optimal-threshold-which-is-a-function-of-average-skill-exam-noise-and-skill-variance-among-people">Under very favorable conditions, machine learning models find the optimal threshold, which is a function of average skill, exam noise, and skill variance among people.</h4>
</blockquote>
</blockquote>
<h1 id="optimal-thresholds-for-different-groups">Optimal thresholds for different groups</h1>
<p>So far, we have shown how to calculate the optimal threshold and
illustrated that ML models also recover this threshold. Let’s now
analyze the optimal threshold when different groups exist in the
population. There are two kinds of people in the city of Ney: blue and red. The
blue people’s skills are normally distributed centered
on \(\mu_\text{blue}\), and the red people’s skills are normally
distributed centered on \(\mu_\text{red}\). The standard deviation for
both groups is \(\sigma_\text{skill}\). There can be various
reasons for disparities between groups, for example historically blue
people might not have been allowed to attend school.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image9.png" />
<em>In Ney, people are divided into two groups: blue and red. The blue people have a lower average skill level than the red people.</em></p>
</div></figure>
<p>First of all, let’s see what happens if the exam is completely precise. As
previously discussed in this case, the optimal threshold to use is 0 for
both groups independent of their distribution. Thus, both groups are
held to the same standard, and the error for the government is 0.</p>
<blockquote>
<blockquote>
<h4 id="if-there-is-no-noise-in-the-exam-then-zero-is-the-optimal-threshold-for-both-groups-and-leads-to-zero-error">If there is no noise in the exam, then zero is the optimal threshold for both groups and leads to zero error.</h4>
</blockquote>
</blockquote>
<p>Now let’s analyze the case where the exam is noisy
( \(\sigma_\text{noise} > 0\)). As discussed in the prior
sections, the optimal threshold depends on the average of the prior
distribution, thus the optimal threshold differs between blue and red
groups. Therefore, if the government knows the demographic information,
then it’s a better strategy for the government to classify different
groups separately (in order to minimize the error). In particular, the
government can calculate the optimal threshold for blue and red people
using Bayesian inference.</p>
<script type="math/tex; mode=display">\begin{align}
\text{Red Threshold} = -\mu_\text{red} \frac{\sigma_\text{noise}^2}{\sigma_\text{skill}^2} \quad \quad \text{Blue Threshold} = -\mu_\text{blue}\frac{\sigma_\text{noise}^2}{\sigma_\text{skill}^2}
\end{align}</script>
<blockquote>
<blockquote>
<h4 id="people-in-a-group-that-has-lower-average-skills-need-to-pass-a-higher-bar-for-hiring-not-only-do-blue-people-need-to-overcome-other-associated-effects-of-being-in-a-group-with-lower-average-skills-they-also-need-to-pass-a-higher-bar-to-get-hired---------">People in a group that has lower average skills need to pass a higher bar for hiring! Not only do blue people need to overcome other associated effects of being in a group with lower average skills, they also need to pass a higher bar to get hired. </h4>
</blockquote>
</blockquote>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image7.png" />
The cut-off threshold for hiring is higher for blue people in comparison to the red people.</p>
</div></figure>
<p>As stated, the government uses a higher threshold for people in a group
with a lower average skill! Consider two individuals with the same skill
level but from different groups. The blue person is less likely to get
hired by the government than the red person. Surprisingly, blue people
who are already in a group with a lower average skill (which probably
affects their confidence and society’s view of them) need to also pass a
higher bar to get hired!</p>
<p>Finally, note that the gap between thresholds for the different groups
grows as the noise increases.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-20-Bluepeoplevs-Neycity/image12.png" />
As the exam noise increases, the gap between the optimal thresholds among different groups widens. Blue people need to get a better score than red people on the exam to get hired.</p>
</div></figure>
<blockquote>
<blockquote>
<h4 id="a-blue-person-has-a-lower-chance-of-getting-hired-in-comparison-with-a-red-person-with-the-same-skill">A blue person has a lower chance of getting hired in comparison with a red person with the same skill.</h4>
</blockquote>
</blockquote>
<h1 id="conclusion">Conclusion</h1>
<p>We examined the discriminatory effect of relying on noisy features. When ML models use noisy features, they’re naturally incentivized to devalue a good score when the candidate in question comes from an overall lower-performing group. Note that noisy features are prevalent in any real-world application (here, we assumed that noise is the same among all individuals, but it’s usually worse for disadvantaged groups). Ideally, we would like to improve the features to better reflect a candidate’s skill/potential or make the features more closely approximate the job requirements. If that’s not possible, it’s important to be conscious that the “optimal decision” is to discriminate, and we should adjust our process (e.g., hiring) in acknowledgment that group membership can shade an individual’s evaluation.</p>
<hr />
<h1 id="frequently-asked-questions">Frequently asked questions</h1>
<h5 id="can-we-just-remove-the-group-membership-information-so-the-model-treats-individuals-from-both-groups-similarly"><strong>Can we just remove the group membership information, so the model treats individuals from both groups similarly?</strong></h5>
<p>Unlike this example where group membership is a removable feature,
real-world datasets are more complex. Usually, datasets contain many
features such that the group membership can be predicted from them
(recall that ML models benefit from predicting group membership since it
lowers error). Thus, it is not obvious how to remove group membership in
these datasets. See
[<a href="http://proceedings.mlr.press/v28/zemel13.pdf">1</a>,<a href="https://arxiv.org/pdf/1707.00075.pdf">2</a>,<a href="https://arxiv.org/abs/1907.00020">3</a>]
for some efforts on removing group information.</p>
<h5 id="why-should-we-treat-these-two-groups-similarly-when-their-distributions-are-inherently-different-utilizing-group-membership-information-reduces-error-overall-and-for-both-groups"><strong>Why should we treat these two groups similarly when their distributions are inherently different? Utilizing group membership information reduces error overall and for both groups!</strong></h5>
<p>Fairness in machine learning usually studies the impact of ML algorithms
on groups according to protected attributes such as sex, sexual
orientation, race, etc. Usually, there has been some discrimination
towards these groups throughout history, which leads to huge disparities
among their distributions. For example, women (because of their sex)
were not allowed to go to universities. Thus, these disparities are not
inherent and could (and probably should!) change over time. For
instance, see women in the labor force
[<a href="https://www.dol.gov/agencies/wb/data/facts-over-time/women-in-the-labor-force%23civilian-labor-force-by-sex">4</a>].</p>
<p>Another reason to avoid relying on disparities among protected groups in
models is feedback loops. Feedback loops might exacerbate distributional
disparities among protected groups over time. (e.g., few women get
accepted → the self-doubt between women increases → women perform
worse in the exam → fewer women get accepted and so on). For
instance, see
[<a href="https://arxiv.org/abs/1806.08010">5</a>]
and
[<a href="https://arxiv.org/abs/1706.09847">6</a>].</p>
<p>Finally, note that although the government objective may be to minimize the
error by weighting the costs of hiring non-qualified and non-hiring
qualified candidates similarly, it is not clear whether the group
objectives should be the same. For example, a group might be worse off
as a result of the government not hiring its qualified members than if
the government had hired its non-qualified members (for example, in
settings where the lack of minority role models in higher-level
positions leads to a lower perceived sense of belonging in other members
of a group). Thus, using group membership to minimize the error is not
necessarily the most beneficial outcome for a group; and depending on
the context we might need to minimize other objectives.</p>
<h5 id="what-about-other-notions-of-fairness-in-machine-learning"><strong>What about other notions of fairness in machine learning?</strong></h5>
<p>In this blog post, we studied the ML model’s prediction for two similar individuals (here same z) but from different groups (blue vs. red). This is referred to as the counterfactual notion of fairness. There is another common notion of fairness known as the statistical notion of fairness, which looks at the groups as a whole and compares their incurred error (it is also common to compare the error incurred by qualified members of different groups known as the equal opportunity [<a href="https://arxiv.org/pdf/1610.02413.pdf">7</a>]). Statistical and counterfactual notions of fairness are independent of each other, and satisfying one does not guarantee satisfying the other. Another consequence of feature noise is causing a trade-off between these two notions of fairness, which is beyond this blog post’s scope. See our paper [<a href="https://arxiv.org/abs/1911.09876">8</a>] for critiques regarding these two notions and the effect of feature noise on statistical notions of fairness.</p>
<h1 id="acknowledgement">Acknowledgement</h1>
<p>I would like to thank Percy Liang, Megha Srivastava, Frieda Rong, and Rishi Bommasani, Yeganeh Alimohammadi, and Michelle Lee for their useful comments.</p>
Sun, 20 Dec 2020 00:00:00 -0800A Model-Based Approach Towards Identifying the Brain's Learning Algorithms
/blog/lr-identify/
/blog/lr-identify/<h3 id="introduction"><strong>Introduction</strong></h3>
<p>One of the tenets of modern neuroscience is that the brain modifies the
strengths of its synaptic connections (“weights”) during learning in
order to better adapt to its environment. However, the underlying
learning rules (“weight updates”) in the brain are currently unknown.
Many proposals have been suggested, ranging from Hebbian-style
mechanisms that seem biologically plausible but are not very effective
as learning algorithms in that they prescribe purely local changes to
the weights between two neurons that increase only if they activate
together -- to backpropagation, which is effective from a learning
perspective by assigning credit to neurons along the entire downstream
path from outputs to inputs, but has numerous biologically implausible
elements.</p>
<p>A major long-term goal of computational neuroscience is to identify
which learning rules actually drive learning in the brain. A further
difficulty is that we do not even have strong ideas for what needs to be
measured in the brain to quantifiably assert that one learning rule is
more consistent with those measurements than another learning rule. So
how might we approach these issues? We take a simulation-based approach,
meaning that experiments are done on artificial neural networks rather
than real brains. We train over a thousand artificial neural networks
across a wide range of possible learning rule types (conceived of as
“optimizers”), system architectures, and tasks, where the ground truth
learning rule is known, and quantify the impact of these choices. Our
work suggests that recording activities from several hundred neurons,
measured semi-regularly during learning, may provide a good basis to
identify learning rules -- a testable hypothesis within reach of
current neuroscience tools!</p>
<h3 id="background-a-plethora-of-theories-and-a-paucity-of-evidence"><strong>Background: A Plethora of Theories and a Paucity of Evidence</strong></h3>
<p>The brain modifies the connections between neurons during learning to
improve behavior; however, the underlying rules that govern these
modifications are unknown. The most famous proposed learning rule is
“Hebbian learning”, also known by the mantra: “neurons that fire
together; wire together”. In this proposal, a synaptic connection
strengthens if one neuron ("pre-synaptic") consistently sends a signal
to another neuron ("post-synaptic"). The changes prescribed by Hebbian
learning are “local” in that they do not take into account a synapse’s
influence further downstream in the network. This locality makes
learning rather slow even in the cases where additional issues, such as
the weight changes becoming arbitrarily large, are mitigated. Though
there have been many suggested theoretical strategies to deal with this
problem, commonly involving simulations with artificial neural networks
(ANNs), these strategies appear difficult to scale up to solve
large-scale tasks such as ImageNet categorization
[<a href="https://arxiv.org/abs/1807.04587">1</a>].</p>
<p>This property of local changes is in stark contrast to backpropagation,
the technique commonly used to optimize artificial neural networks. In
backpropagation, as the name might suggest, an error signal is
propagated backward along the entire downstream path from the outputs of
a model to the inputs of the model. This allows credit to be effectively
assigned to every neuron along the path.</p>
<p>Although backpropagation has long been a standard component of deep
learning, its plausibility as a <em>biological</em> learning rule (i.e. how the
brain modifies the strengths of its synaptic connections) is called into
question for several reasons. Chief among them is that backpropagation
requires perfect symmetry, whereby the backward error-propagating
weights are the transpose of the forward inference weights, for which
there is currently little biological support
[<a href="https://www.sciencedirect.com/science/article/pii/S0364021387800253">2</a>,
<a href="https://www.nature.com/articles/337129a0">3</a>].</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-09-lr-identify/weight_symmetry.gif" /></p>
<figcaption>
<b>Avoiding weight symmetry.</b> Backpropagation naturally couples the
forward and backward weights. This constraint can be relaxed by
uncoupling them, thereby generating a spectrum of learning rule
hypotheses about how the backward weights may be updated.
For more details, see our recent <a href="https://arxiv.org/abs/2003.01513">prior work</a>.
</figcaption>
</div></figure>
<p>Recent approaches, from us and others
[<a href="https://arxiv.org/abs/1904.05391">4</a>,
<a href="https://arxiv.org/abs/2003.01513">5</a>], introduce approximate
backpropagation strategies that do not require this symmetry, and can
still succeed at large-scale learning as backpropagation does. However,
given the number of proposals, a natural question to ask is how
realistic they are. At the moment, our hypotheses are governed by domain
knowledge that specifies what “can” and “cannot” be biologically
plausible (e.g. “exact weight symmetry is likely not possible” or
“separate forward and backward passes during learning seem
implausible”), as well as characterizations of ANN task performance
under a given learning rule (which is not always directly measurable
from animal behavior). In order to be able to successfully answer this
question, we need to be able to empirically <em>refute</em> hypotheses. In
other words, we would ideally want to know what biological data to
collect in order to claim that one hypothesis is more likely than
another.</p>
<p>More concretely, we can ask: what specific measurements from the brain,
in the form of individual activation patterns over time, synaptic
strengths, or paired-neuron input-output relations, would allow one to
draw quantitative comparisons of whether the observations are more
consistent with one or another specific learning rule? For example,
suppose we record neural responses (“activation patterns”) while an
animal is learning a task. Would these data be sufficient to enable us
to broadly differentiate between learning rule hypotheses, e.g. by
reliably indicating that one learning rule’s changes over time more
closely match the changes measured from real data than those prescribed
by another learning rule?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-09-lr-identify/neuron_schematic.gif" /></p>
<figcaption>
Some potential observables to measure on which to separate candidate
learning rule hypotheses. (Pyramidal neuron schematic adapted from Figure
4 of [<a href="https://www.nature.com/articles/s41583-020-0277-3">6</a>])
</figcaption>
</div></figure>
<p>Answering this question turns out to be a substantial challenge, because
it is difficult on purely theoretical grounds to identify which patterns
of neural changes arise from given learning rules, without also knowing
the overall network connectivity and reward target (if any) of the
learning system.</p>
<p>But, there may be a silver lining. While ANNs consist of units that are
highly simplified with respect to biological neurons, recent progress
within the past few years has shown that the internal representations that
emerge in trained deep ANNs often overlap strongly with representations
in the brain, and are in fact quantifiably similar to many
neurophysiological and behavioral observations in animals
[<a href="https://www.nature.com/articles/s41593-019-0520-2">7</a>]. For
instance, task-optimized, deep convolutional neural networks (CNNs) have
emerged as quantitatively accurate models of encoding in primate visual
cortex [<a href="https://www.pnas.org/content/111/23/8619">8</a>,
<a href="https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003915">9</a>,
<a href="https://www.jneurosci.org/content/35/27/10005">10</a>]. This is
due to (1) their cortically-inspired architecture, a cascade of
spatially-tiled linear and nonlinear operations; and (2) their being
optimized to perform certain behaviors that animals must perform to
survive, such as object recognition
[<a href="https://www.nature.com/articles/nn.4244">11</a>]. CNNs trained
to recognize objects on ImageNet predict neural responses of primate
visual cortical neurons better than any other model class. Thus, these
models are, at the moment, some of our current best algorithmic
“theories” of the brain -- a system that was ultimately not designed by
us, but rather the product of millions of years of evolution. On the
other hand, ANNs <em>are</em> designed by us -- so the ground truth learning
rule is known and every unit (artificial “neuron”) can be measured up to
machine precision.</p>
<p>Can we marry what we can measure in neuroscience with what we can
conclude from machine learning, in order to identify what experimentally
measurable observables may be most useful for inferring the underlying
learning rule? If we can’t do this in our models, then it seems very
unlikely to be able to do this in the real brain. But if we can do this
in principle, then we are in a position to generate predictions as to
what data to collect, and whether that is even within reach of current
experimental neuroscience tools.</p>
<h3 id="methods"><strong>Methods</strong></h3>
<p>We adopt a two-stage “virtual experimental” approach. In the first
stage, we train ANNs with different learning rules, across a variety of
architectures, tasks, and associated hyperparameters. These will serve
as our “model organisms” on which we will subsequently perform idealized
neuroscience measurements. In the second stage, we calculate aggregated
statistics (“measurements”) from each layer of the models as features
from which to train simple classifiers that classify the category that a
given learning rule belongs to (specified below). These classifiers
include the likes of a linear SVM, as well as simple non-linear ones
such as a Random Forest and a 1D convolutional two-layer perceptron.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/approach_schematic.png" /></p>
<figcaption>
<b>Overall approach.</b> Observable statistics are generated from each
neural network's layer, through the model training process for each
learning rule. We take a quantitative approach whereby a classifier is
cross-validated and trained on a subset of these trajectories and
evaluated on the remaining data.
</figcaption>
</div></figure>
<p>Generating a large-scale dataset is crucial to this endeavor, in order
to both emulate a variety of experimental neuroscience scenarios and be
able to derive robust conclusions from them. Thus, in the first stage,
we train ANNs on tasks and architectures that have been shown to explain
variance in neural responses from sensory (visual and auditory)
brain areas [<a href="https://www.pnas.org/content/111/23/8619">8</a>,
<a href="https://www.sciencedirect.com/science/article/pii/S0896627318302502?via%3Dihub">12</a>].
These include <em>supervised</em> tasks across vision and audition, as well as
<em>self-supervised</em> ones. We consider both shallow and deep feedforward
architectures on these tasks, that are of depth comparable to what is
considered reasonable from the standpoint of shallower non-primate (e.g.
mouse
[<a href="https://www.nature.com/articles/s41586-019-1716-z">13</a>]) and
deeper primate sensory systems
[<a href="https://www.pnas.org/content/111/23/8619">8</a>,
<a href="https://arxiv.org/abs/1807.00053">14</a>,
<a href="https://www.biorxiv.org/content/10.1101/407007v2.full">15</a>].</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/table.png" /></p>
<figcaption>
The learning rules, tasks, architectures, and hyperparameters from which
we generate data, comprising over a thousand training experiments in total.
</figcaption>
</div></figure>
<p>In the second stage, we train classifiers on the observable statistics from these ANNs to predict the learning rules (as specified in the table above) used to train them.
The four learning rules were chosen as they span the space of commonly
used variants of backpropagation (<a href="http://proceedings.mlr.press/v28/sutskever13.pdf">SGDM</a> and <a href="https://arxiv.org/abs/1412.6980">Adam</a>), as well as potentially
more biologically-plausible “local” learning rules (<a href="https://arxiv.org/abs/1411.0247">Feedback
Alignment (FA)</a> and <a href="https://arxiv.org/abs/2003.01513">Information Alignment (IA)</a>) that efficiently
train networks at scale to varying degrees of performance but avoid exact weight
symmetry.</p>
<p>Because the primary aim of this study is to determine the extent that
different learning rules led to different encodings within ANNs, we
begin by defining representative features that can be drawn from the
course of model training. For each layer in a model, we consider three
measurements: <em>weights</em> of the layer, <em>activations</em> from the layer, and
<em>layer-wise activity change</em> of a given layer’s outputs relative to its
inputs. We choose ANN weights to analogize to synaptic strengths in the
brain, activations to analogize to post-synaptic firing rates, and
layer-wise activity changes to analogize to paired measurements that
involve observing the change in post-synaptic activity with respect to
changes induced by pre-synaptic input.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/statistics.gif" /></p>
<figcaption>
Defining observable statistics.
</figcaption>
</div></figure>
<p>For each measure, we consider three functions applied to it: “identity”,
“absolute value”, and “square”. Finally, for each function of the
weights and activations, we consider seven statistics, and for the
layer-wise activity change observable, we only use the mean statistic
due to computational restrictions. This results in a total of 45
continuous valued observable statistics for each layer, though 24
observable statistics are ultimately used for training the classifiers,
since we remove any statistic that has a divergent value during the
course of model training. We also use a ternary indicator of layer
position in the model hierarchy: “early”, “middle”, or “deep”
(represented as a one-hot categorical variable).</p>
<h3 id="we-can-separate-learning-rules-from-aggregate-statistics-of-the-weights-activations-or-layer-wise-activity-changes"><strong>We Can Separate Learning Rules from Aggregate Statistics of the Weights, Activations, or Layer-wise Activity Changes</strong></h3>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/example.png" /></p>
<figcaption>
Across tasks, different learning rules give rise to perceptible
differences in observable statistics.
</figcaption>
</div></figure>
<p>Already by eye, one can pick up distinctive differences across the
learning rules for each of the training trajectories of these metrics.
Of course, this is not systematic enough to clearly judge one set of
observables versus another, but provides some initial assurance that
these metrics seem to capture some inherent differences in learning
dynamics across rules.</p>
<p>So these initial observations seem promising, but we want to make this
approach more quantitative. Suppose for each layer we concatenate the
trajectories of each observable and the position in the model hierarchy
that this observable came from. Can we generalize well across held-out
examples?</p>
<p>It turns out that the answer is in fact, yes. Across all classes of
observables, the Random Forest attains the highest test accuracy, and
all observable measures perform similarly under this classifier.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/conf_mats.png" /></p>
<figcaption>
<b>Test set confusion matrices.</b> Random Forest performs the best and differences in learning rate policy
(Adam vs. SGDM) are more difficult to distinguish.
</figcaption>
</div></figure>
<p>Looking at confusion matrices on the test set, we see that the Random
Forest hardly mistakes one learning rule from any of the others. And
when the classifiers do make mistakes, they generally tend to confuse
Adam vs. SGDM more so than IA vs. FA, suggesting that they are able to
pick up more on differences (reflected in the observable statistics) due
to high-dimensional direction of the gradient tensor than the magnitude
of the gradient tensor (the latter being directly tied to learning rate
policy).</p>
<h3 id="adding-back-some-experimental-neuroscience-realism"><strong>Adding Back Some Experimental Neuroscience Realism</strong></h3>
<p>Up until this point, we have had access to all input types, the full learning trajectory, and noiseless access to all units when making our virtual measurements of ANN observable statistics.
But in a real experiment where someone were to
collect such data from a neural circuit, the situation would be far from
this ideal scenario. We therefore explore experimental realism in
several ways, in order to identify which observable measures are robust
across these scenarios.</p>
<h4 id="access-to-only-portions-of-the-learning-trajectory-subsampling-observable-trajectories"><strong><em>Access to only portions of the learning trajectory: subsampling observable trajectories</em></strong></h4>
<p>The results presented thus far were obtained with access to the entire
learning trajectory of each model. Often however, an experimentalist
collects data throughout learning at regularly spaced intervals. We
capture this variability by randomly sampling a fixed number of points
at a fixed temporal spacing for each trajectory, which we refer to as a
“subsample period”.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/sparse_subsampling.png" /></p>
<figcaption>
Sparse subsampling across learning trajectory is most robust to
trajectory undersampling.
</figcaption>
</div></figure>
<p>We find across observable measures that robustness to undersampling of
the trajectory is largely dependent on the subsample period length. As
the subsample period length increases (in the middle and right-most
columns), the Random Forest classification performance increases
compared to the same number of sampled points for a smaller period
(depicted in the left-most column).</p>
<p>Taken together, these results suggest that data consisting of
measurements collected temporally further apart across the learning
trajectory is more robust to undersampling than data collected closer
together in training time. Furthermore, across individual observable
measures, the weights are overall the most robust to undersampling of
the trajectory, but with enough frequency of samples we can achieve
comparable performance with the activations.</p>
<h4 id="incomplete-and-noisy-measurements-subsampling-units-and-gaussian-noise-before-collecting-observables"><strong><em>Incomplete and noisy measurements: subsampling units and Gaussian noise before collecting observables</em></strong></h4>
<p>The aggregate statistics computed from the observable measures thus far
have operated under the idealistic assumption of noiseless access to
every unit in the model. However, in most datasets, there is a
significant amount of unit undersampling as well as non-zero measurement
noise. How do these two factors affect learning rule identification, and
in particular, how noise and subsample-robust are particular observable
measures?</p>
<p>Addressing this question would provide insight into the types of
experimental neuroscience paradigms that may be most useful for
identifying learning rules, and predict how certain experimental tools
may fall short for given observables. For instance, optical imaging
techniques can use fluorescent indicators of electrical activities of
neurons to give us simultaneous access to thousands of neurons.
But these techniques can have lower temporal resolution and signal-to-noise than
electrophysiological recordings that more directly measure the
electrical activities of neurons, which in turn may lack the same
coverage.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_100" src="/blog/assets/img/posts/2020-12-09-lr-identify/subsample_noise.png" /></p>
<figcaption>
<b>Activations are the most robust to measurement noise and unit
undersampling.</b> Reported here is Random Forest test set accuracy in
separating IA vs. FA, averaged over 10 train/test splits per random
sampling and simulated measurement noise seed.
</figcaption>
</div></figure>
<p>To account for these tradeoffs, we model measurement noise as an
additive white Gaussian noise process added to units of ResNet-18
trained on the ImageNet and self-supervised SimCLR tasks. We choose IA
vs. FA since the differences between them are conceptually stark: IA
imposes dynamics on the feedback error weights during learning, whereas
FA keeps them fixed. If there are scenarios of measurement noise and
unit subsampling where we are at chance accuracy for this problem (50%),
then it may establish a strong constraint on learning rule
separability more generally.</p>
<p>Our results suggest that if one makes experimental measurements by
imaging synaptic strengths, it is still crucial that the optical imaging
readout not be very noisy, since even with the amount of units typically
recorded currently (on the order of several hundred to several thousand
synapses), a noisy imaging strategy of synaptic strengths may be
rendered ineffective.</p>
<p>Instead, current electrophysiological techniques that measure the
activities from hundreds of units could form a good set of neural data
to separate learning rules. Recording more units with these techniques
can improve learning rule separability from the activities, but it does
not seem necessary, at least in this setting, to record a majority of
units to perform this separation effectively.</p>
<h3 id="conclusions"><strong>Conclusions</strong></h3>
<p>As experimental techniques in neuroscience continue to advance, we will
be able to record data from more neurons with higher temporal
resolution. But even if we had the perfect measurement tools, it is not
clear ahead of time what should be measured in order to identify the
learning rule(s) operative within a given neural circuit, or whether
this is even possible in principle. Our model-based approach
demonstrates that we can identify learning rules <em>solely</em> on the basis of
standard types of experimental neuroscience measurements from the
weights, activations, or layer-wise activity changes, without knowledge
of the architecture or loss target of the learning system.</p>
<p>Additionally, our results suggest the following prescription for the type of
experimental neuroscience data to be collected towards this goal:</p>
<p><strong>Electrophysiological recordings of post-synaptic activities
from a neural circuit on the order of several hundred units, frequently
measured at wider intervals during the course of learning, may provide a
good basis on which to identify learning rules.</strong></p>
<p>We have made our <a href="https://github.com/neuroailab/lr-identify">dataset, code, and interactive
tutorial</a> publicly
available so that others can analyze these properties without needing to
train neural networks themselves. Our dataset may also be of interest to
researchers theoretically or empirically investigating learning in deep
neural networks. For further details, check out our <a href="https://arxiv.org/abs/2010.11765">NeurIPS 2020
paper</a>.</p>
<h3 id="acknowledgements"><strong>Acknowledgements</strong></h3>
<p>I would like to thank my collaborator Sanjana Srivastava
and advisors Surya Ganguli and Daniel Yamins. I would also like to
thank Jacob Schreiber, Sidd Karamcheti, and Andrey Kurenkov for their
editorial suggestions on this post.</p>
Wed, 09 Dec 2020 00:00:00 -0800iGibson: A Simulation Environment to Train AI Agents in Large Realistic Scenes
/blog/igibson/
/blog/igibson/<h2 id="why-simulation-for-ai">Why simulation for AI?</h2>
<p>We are living in a Golden Age of simulation environments in AI and robotics. Looking back ten years, simulation environments were rare, with only a handful of available solutions, and were complex and used only by experts. Today, there are many available simulation environments and most papers in AI and robotics at first tier conferences such as NeurIPS, CoRL or even ICRA and IROS, make some use of them. What has changed?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-08-igibson/sim_img.png" /></p>
</div></figure>
<p>This extensive use of simulation environments is the result of several trends:</p>
<ul>
<li>First, the increasing role of machine learning in robotics creates a demand for more data (for example, interactive experiences) than what can be generated in real time <sup id="fnref:dexterity"><a href="#fn:dexterity" class="footnote">1</a></sup><sup id="fnref:todorov"><a href="#fn:todorov" class="footnote">2</a></sup><sup id="fnref:peng"><a href="#fn:peng" class="footnote">3</a></sup><sup id="fnref:robosuite"><a href="#fn:robosuite" class="footnote">4</a></sup>. Also, the initial data collection process often involves random exploration that may be dangerous for physical robots or their surroundings.</li>
<li>Second, simulation environments have matured to be more robust, realistic (visually and physically), user friendly and accessible to all types of users, and the necessary computation to simulate complex physics is reasonably fast on most modern machines. Therefore, simulation environments have the potential to lower the barrier to entry in robotics, even for researchers without the funds to acquire expensive real robot platforms.</li>
<li>Finally, the increasing number of robotic solutions to tasks such as grasping, navigation or manipulation have brought more attention to a critical absence in our community: the lack of repeatable benchmarks. Mature sciences are based on experiments that can be easily and reliably replicated, so that different techniques, theories, and solutions can be compared in fair conditions. Simulation environments can help us to establish repeatable benchmarks, which is very difficult to achieve with real robots, which can in turn help us understand the status of our field.</li>
</ul>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-08-igibson/image9.png" /></p>
</div></figure>
<h2 id="why-igibson">Why iGibson?</h2>
<p>These ideas motivated us in the Stanford Vision and Learning Lab to develop a simulation environment that can serve as a “playground” to train and test interactive AI agents – an environment we call iGibson (*footnote on naming at bottom of post) . What makes iGibson special? To understand this, let’s first define what a simulation environment is and how it is different from a physics simulator. A physics simulator is an engine capable of computing the physical effect of actions on an environment (e.g. motion of bodies when a force is applied, or flow of liquid particles when being poured). There are many existing physics simulation engines. The best known in robotics are Bullet and its python extension, PyBullet, MuJoCo, Nvidia PhysX and Flex, UnrealEngine, DART, Unity, and ODE. Given a physical problem (objects, forces, particles, and physics parameters), these engines compute the temporal evolution of the system. On the other hand, a simulation environment is a framework that includes a physics simulator, a renderer of virtual signals, and a set of assets (i.e. models of scenes, objects, and robots) that can be used to create simulations of problems to study and develop solutions for different tasks. The decision on what physics engine to use is based on the type of physical process that dominates the problem, for example rigid body physics or motion of fluids. However, to decide on what simulation environment to use, researchers are guided by the application domain they are interested in, and the research questions they want to explore. With iGibson, we aim to support the study of interactive tasks in large realistic scenes, guided by high quality virtual visual signals.</p>
<h2 id="comparison-to-existing-simulators">Comparison to existing simulators</h2>
<p>No existing simulation environments support developing solutions for problems involving interactions in large scale scenes like full houses. There are several simulation environments for tasks with stationary arms, such as meta-world, RLBench, RoboSuite or DoorGym, but none of them include large realistic scenes like homes with multiple rooms for tasks that include navigation. For navigation, our previous version, Gibson (v1) and Habitat have proven to be great environments that allow researchers to study visual and language guided navigation. However, the included assets (scenes) are single meshes that cannot change when interactions are applied, like opening doors or moving objects.</p>
<p>Finally, a set of recent simulation environments allow for scene-level interactive tasks, such as Sapien, AI2Thor and ThreeDWorld (TDW). Sapien focuses on interaction with articulated objects (doors, cabinets, and drawers). TDW is a multi-modal simulator with audio, high quality visuals, and simulation of flexible materials and liquids via Nvidia Flex. But neither Sapien nor TDW include fully interactive scenes aligned with real object distribution and layout as part of the environment. AI2Thor includes fully interactive scenes, but the interactions are scripted: interactable objects are annotated with the possible actions they can receive. When the agent is close enough to an object and the object is in the right state (precondition), the agent can select a predefined action, and the object is “transitioned’” to the next state (postcondition). RoboThor, an alternative version of AI2Thor, enables continuous interactions but focuses on navigation. It provides limited sensory signals to the agent (only RGB-D images) that is always embodied as a <a href="https://www.google.com/url?q=http://www.locobot.org/&sa=D&ust=1607413428167000&usg=AOvVaw1ZTY10cnxkvqoOZHiIr9Hw">locobot</a>, a low-cost platform with limited interaction capabilities. Here at SVL, we want to study complex, long-horizon mobile manipulation tasks such as tidying a house or searching for objects, which requires access to fully interactive realistic large-scale scenes.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-12-08-igibson/image10.png" /></p>
</div></figure>
<h2 id="igibsons-new-features">iGibson’s new features</h2>
<p>The main focus of iGibson in interactivity: enabling realistic interactions in large scenes. For that, we have included several key features:</p>
<ul>
<li>Fifteen fully interactive visually realistic scenes representing real world homes with furniture and articulated object models annotated with materials and dynamics properties.</li>
<li>Capabilities to import models from CubiCasa5K <sup id="fnref:cubicasa"><a href="#fn:cubicasa" class="footnote">5</a></sup> and 3D-Front <sup id="fnref:3dfront"><a href="#fn:3dfront" class="footnote">6</a></sup>, giving access to more than 12000 additional interactive home scenes.</li>
<li>Realistic virtual sensor signals, including high quality RGB images from a physics-based renderer, depth maps, 1 beam and 16 beams virtual LiDAR signals, semantic/instance/material segmentation, optical and scene flow, and surface normals.</li>
<li>Domain randomization for visual texture, dynamics properties and object instances for endless variations of scenes.</li>
<li>Human-computer interface for humans to provide demonstrations of fully physical interactions with the scenes.</li>
<li>Integration with sampling-based motion planners to facilitate motion of robotic bases (navigation in 2D layout) and arms (interaction in 3D space).</li>
</ul>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image5.gif" />
<img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image1.gif" /></p>
</div></figure>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image3.gif" />
<img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image8.gif" /></p>
</div></figure>
<h2 id="using-igibson-for-robot-learning">Using iGibson for robot learning</h2>
<p>These novel features in iGibson allow us to study and develop solutions for new interactive tasks in large environments. One of these new problems is Interactive Navigation, where the agents need to interact with the environment to change its configuration, for example, to open doors or push obstacles away. This is a common type of navigation in our homes and offices, but non-interactive simulation environments cannot be used to study it. In iGibson we have developed hierarchical reinforcement learning solutions for interactive navigation that decide explicitly what part of the body to use in the next phase of the task: the arm (for interactions), the base (for navigation) or the combination of both <sup id="fnref:hrl4in2"><a href="#fn:hrl4in2" class="footnote">7</a></sup>. We also propose a new learning solution for interactive navigation that integrates a motion planner: the learning algorithm decides on the next point to interact, and the motion planner finds a collision free path to that point of interaction <sup id="fnref:relmogen2"><a href="#fn:relmogen2" class="footnote">8</a></sup>. But these are just the tips of the iceberg: many of SVL’s projects are leveraging iGibson to study a wide variety of interactive tasks in large realistic scenes.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image11.gif" />
<img class="postimagehalf" src="/blog/assets/img/posts/2020-12-08-igibson/image6.gif" /></p>
</div></figure>
<h2 id="summary">Summary</h2>
<p>Simulation environments have the potential to support researchers in their study of robotics and embodied AI problems. With iGibson, SVL contributes to the community with an open source, fully academically developed simulation environment for interactive tasks in large realistic scenes. If you want to start using it, visit <a href="http://svl.stanford.edu/igibson/">our website</a> and download - setup should be straightforward, and we’re happy to answer any questions about getting the simulator up and running for your research! You can also read <a href="https://arxiv.org/pdf/2012.02924.pdf">our preprint on arxiv</a>. We hope we can facilitate new avenues of research in robotics and AI.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-08-igibson/image7.png" /></p>
</div></figure>
<hr />
<p>* A note on Gibson - Our simulation environment takes the name from James J. Gibson [1904-1979]. Gibson was an influential psychologist and cognitive scientist with, at the time, disruptive ideas. He pushed forward a new concept of perception to be considered 1) an ecological process that cannot and should not be studied in isolation from the environment, and 2) an active process that needs agency and interactivity. This was in contrast to the predominant view of the time of perception to be a passive process where signals “arrive” and “are processed” by the brain. Instead, he argued that agents seek for information, interacting and revealing it. He also coined the term “affordance” as the opportunity the environment offers to an agent to perform a task. This is a quote from a colleague summarizing his research that directly connects to the guiding principle behind our work in the iGibson team: “ask not what’s inside your head, but what your head is inside of”.</p>
<div class="footnotes">
<ol>
<li id="fn:dexterity">
<p>Andrychowicz, OpenAI: Marcin, et al. “Learning dexterous in-hand manipulation.” The International Journal of Robotics Research 39.1 (2020): 3-20. <a href="#fnref:dexterity" class="reversefootnote">↩</a></p>
</li>
<li id="fn:todorov">
<p>Rajeswaran, Aravind, et al. “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.” Robotics: Science and Systems, 2017 <a href="#fnref:todorov" class="reversefootnote">↩</a></p>
</li>
<li id="fn:peng">
<p>Peng, Xue Bin, et al. “Sfv: Reinforcement learning of physical skills from videos.” ACM Transactions on Graphics (TOG) 37.6 (2018): 1-14. <a href="#fnref:peng" class="reversefootnote">↩</a></p>
</li>
<li id="fn:robosuite">
<p>Zhu, Yuke, et al. “robosuite: A modular simulation framework and benchmark for robot learning.” arXiv preprint arXiv:2009.12293 (2020). <a href="#fnref:robosuite" class="reversefootnote">↩</a></p>
</li>
<li id="fn:cubicasa">
<p>Kalervo, Ahti, et al. “Cubicasa5k: A dataset and an improved multi-task model for floorplan image analysis.” Scandinavian Conference on Image Analysis. Springer, Cham, 2019. <a href="#fnref:cubicasa" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3dfront">
<p>Fu, Huan, et al. “3D-FRONT: 3D Furnished Rooms with layOuts and semaNTics.” arXiv preprint arXiv:2011.09127 (2020). <a href="#fnref:3dfront" class="reversefootnote">↩</a></p>
</li>
<li id="fn:hrl4in2">
<p>Li, Chengshu, et al. “Hrl4in: Hierarchical reinforcement learning for interactive navigation with mobile manipulators.” Conference on Robot Learning. PMLR, 2020. <a href="#fnref:hrl4in2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:relmogen2">
<p>Xia, Fei, et al. “Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation.” arXiv preprint arXiv:2008.07792 (2020). <a href="#fnref:relmogen2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Tue, 08 Dec 2020 00:00:00 -0800Stanford AI Lab Papers and Talks at NeurIPS 2020
/blog/neurips-2020/
/blog/neurips-2020/<p><img class="postimage_75" src="https://ai.stanford.edu/blog/assets/img/posts/2020-12-06-neurips-2020/logo.png" /></p>
<p>The <a href="https://neurips.cc">Neural Information Processing Systems</a> (NeurIPS) 2020 conference is being hosted virtually from Dec 6th - Dec 12th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!</p>
<h2 id="list-of-accepted-papers">List of Accepted Papers</h2>
<hr />
<h4 id="provably-efficient-reward-agnostic-navigation-with-linear-value-iteration">Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration</h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img33" />
<strong>Authors</strong>: Andrea Zanette, Alessandro Lazaric, Mykel Kochenderfer, Emma Brunskill
<br /><strong>Contact</strong>: zanette@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2008.07737.pdf">Paper</a>
<br /><strong>Keywords</strong>: reinforcement learning, function approximation, exploration</p>
<hr />
<h4 id="acceleration-with-a-ball-optimization-oracle"><a href="https://arxiv.org/abs/2003.08078">Acceleration with a Ball Optimization Oracle</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img29" />
<strong>Authors</strong>: Yair Carmon, Arun Jambulapati, Qijia Jiang, Yujia Jin, Yin Tat Lee, Aaron Sidford, Kevin Tian
<br /><strong>Contact</strong>: kjtian@stanford.edu
<br /><strong>Award nominations:</strong> Oral presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2003.08078">Paper</a>
<br /><strong>Keywords</strong>: convex optimization, local search, trust region methods</p>
<hr />
<h4 id="banditpam-almost-linear-time-k-medoids-clustering-via-multi-armed-bandits"><a href="https://arxiv.org/abs/2006.06856">BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img10" />
<strong>Authors</strong>: Mo Tiwari, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, Ilan Shomorony
<br /><strong>Contact</strong>: Motiwari@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.06856">Paper</a> | <a href="https://studio.slideslive.com/web_recorder/share/20201019T224008Z__NeurIPS_posters__17289__bandit-pam-almost-linear-time?s=c3456b98-724c-4903-b216-e4cd5810b6b8">Video</a>
<br /><strong>Keywords</strong>: clustering, k-means, k-medoids, multi-armed bandits</p>
<hr />
<h4 id="caspr-learning-canonical-spatiotemporal-point-cloud-representations"><a href="https://geometry.stanford.edu/projects/caspr/content/CaSPR_CR.pdf">CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img20" />
<strong>Authors</strong>: Davis Rempe, Tolga Birdal, Yongheng Zhao, Zan Gojcic, Srinath Sridhar, Leonidas J. Guibas
<br /><strong>Contact</strong>: drempe@stanford.edu
<br /><strong>Links:</strong> <a href="https://geometry.stanford.edu/projects/caspr/content/CaSPR_CR.pdf">Paper</a> | <a href="https://www.youtube.com/watch?v=1CrITE28DeM">Video</a> | <a href="https://geometry.stanford.edu/projects/caspr/">Website</a>
<br /><strong>Keywords</strong>: 3d vision, dynamic point clouds, representation learning</p>
<hr />
<h4 id="compositional-explanations-of-neurons"><a href="https://arxiv.org/abs/2006.14032">Compositional Explanations of Neurons</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img15" />
<strong>Authors</strong>: Jesse Mu, Jacob Andreas
<br /><strong>Contact</strong>: muj@stanford.edu
<br /><strong>Award nominations:</strong> oral
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.14032">Paper</a>
<br /><strong>Keywords</strong>: interpretability, explanation, deep learning, computer vision, natural language processing, adversarial examples</p>
<hr />
<h4 id="continuous-meta-learning-without-tasks"><a href="https://arxiv.org/abs/1912.08866">Continuous Meta-Learning without Tasks</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img35" />
<strong>Authors</strong>: James Harrison, Apoorva Sharma, Chelsea Finn, Marco Pavone
<br /><strong>Contact</strong>: jharrison@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/1912.08866">Paper</a>
<br /><strong>Keywords</strong>: meta-learning, continuous learning, changepoint detection</p>
<hr />
<h4 id="deep-learning-versus-kernel-learning-an-empirical-study-of-loss-landscape-geometry-and-the-time-evolution-of-the-neural-tangent-kernel"><a href="https://arxiv.org/abs/2010.15110">Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img16" />
<strong>Authors</strong>: Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli
<br /><strong>Contact</strong>: sfort1@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.15110">Paper</a>
<br /><strong>Keywords</strong>: loss landscape, neural tangent kernel, linearization, taylorization, basin, nonlinear advantage</p>
<hr />
<h4 id="diversity-can-be-transferred-output-diversification-for-white--and-black-box-attacks"><a href="https://arxiv.org/abs/2003.06878">Diversity can be Transferred: Output Diversification for White- and Black-box Attacks</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img8" />
<strong>Authors</strong>: Yusuke Tashiro, Yang Song, Stefano Ermon
<br /><strong>Contact</strong>: ytashiro@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2003.06878">Paper</a> | <a href="https://github.com/ermongroup/ODS">Website</a>
<br /><strong>Keywords</strong>: adversarial examples, deep learning, robustness</p>
<hr />
<h4 id="evidential-sparsification-of-multimodal-latent-spaces-in-conditional-variational-autoencoders"><a href="https://arxiv.org/abs/2010.09164">Evidential Sparsification of Multimodal Latent Spaces in Conditional Variational Autoencoders</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img1" />
<strong>Authors</strong>: Masha Itkina, Boris Ivanovic, Ransalu Senanayake, Mykel J. Kochenderfer, and Marco Pavone
<br /><strong>Contact</strong>: mitkina@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.09164">Paper</a> | <a href="https://github.com/sisl/EvidentialSparsification">Website</a>
<br /><strong>Keywords</strong>: sparse distributions, generative models, discrete latent spaces, behavior prediction, image generation</p>
<hr />
<h4 id="federated-accelerated-stochastic-gradient-descent"><a href="https://papers.nips.cc/paper/2020/hash/39d0a8908fbe6c18039ea8227f827023-Abstract.html">Federated Accelerated Stochastic Gradient Descent</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img18" />
<strong>Authors</strong>: Honglin Yuan, Tengyu Ma
<br /><strong>Contact</strong>: yuanhl@stanford.edu
<br /><strong>Award nominations:</strong> Best Paper Award of Federated Learning for User Privacy and Data Confidentiality in Conjunction with ICML 2020 (FL-ICML’20)
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/hash/39d0a8908fbe6c18039ea8227f827023-Abstract.html">Paper</a> | <a href="https://github.com/hongliny/FedAc-NeurIPS20">Website</a>
<br /><strong>Keywords</strong>: federated learning, local sgd, acceleration, fedac</p>
<hr />
<h4 id="fourier-transform-based-attribution-priors-improve-the-interpretability-and-stability-of-deep-learning-models-for-genomics"><a href="https://proceedings.neurips.cc/paper/2020/hash/1487987e862c44b91a0296cf3866387e-Abstract.html">Fourier-transform-based attribution priors improve the interpretability and stability of deep learning models for genomics</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img11" />
<strong>Authors</strong>: Alex Michael Tseng, Avanti Shrikumar, Anshul Kundaje
<br /><strong>Contact</strong>: amtseng@stanford.edu
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc/paper/2020/hash/1487987e862c44b91a0296cf3866387e-Abstract.html">Paper</a> | <a href="https://github.com/amtseng/fourier_attribution_priors">Website</a>
<br /><strong>Keywords</strong>: deep learning, interpretability, attribution prior, computational biology, genomics</p>
<hr />
<h4 id="from-trees-to-continuous-embeddings-and-back-hyperbolic-hierarchical-clustering"><a href="https://arxiv.org/abs/2010.00402">From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img5" />
<strong>Authors</strong>: Ines Chami, Albert Gu, Vaggos Chatziafratis, Christopher Ré
<br /><strong>Contact</strong>: chami@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.00402">Paper</a> | <a href="https://www.youtube.com/watch?v=11bIx4v_Mz4&feature=youtu.be&ab_channel=HazyResearch">Video</a> | <a href="https://github.com/HazyResearch/HypHC">Website</a>
<br /><strong>Keywords</strong>: hierarchical clustering, hyperbolic embeddings</p>
<hr />
<h4 id="frugalml-how-to-use-ml-prediction-apis-more-accurately-and-cheaply"><a href="https://papers.nips.cc/paper/2020/file/789ba2ae4d335e8a2ad283a3f7effced-Paper.pdf">FrugalML: How to Use ML Prediction APIs More Accurately and Cheaply</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img27" />
<strong>Authors</strong>: Lingjiao Chen; Matei Zaharia; James Zou
<br /><strong>Contact</strong>: lingjiao@stanford.edu
<br /><strong>Award nominations:</strong> Oral Presentation
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/file/789ba2ae4d335e8a2ad283a3f7effced-Paper.pdf">Paper</a> | <a href="https://venturebeat.com/2020/07/21/frugalml-switches-between-apis-to-improve-image-classification-and-cut-costs/">Blog Post</a> | <a href="https://github.com/lchen001/FrugalML">Website</a>
<br /><strong>Keywords</strong>: machine learning as a service, ensemble learning, meta learning, systems for machine learning</p>
<hr />
<h4 id="generative-3d-part-assembly-via-dynamic-graph-learning"><a href="https://arxiv.org/abs/2006.07793">Generative 3D Part Assembly via Dynamic Graph Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img19" />
<strong>Authors</strong>: Jialei Huang, Guanqi Zhan, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas Guibas, Hao Dong
<br /><strong>Contact</strong>: fqnchina@gmail.com
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.07793">Paper</a>
<br /><strong>Keywords</strong>: 3d part assembly, dynamic graph learning</p>
<hr />
<h4 id="generative-3d-part-assembly-via-dynamic-graph-learning-1"><a href="https://arxiv.org/abs/2006.07793">Generative 3D Part Assembly via Dynamic Graph Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img3" />
<strong>Authors</strong>: Jialei Huang*, Guanqi Zhan*, Qingnan Fan, Kaichun Mo, Lin Shao, Baoquan Chen, Leonidas J. Guibas, Hao Dong
<br /><strong>Contact</strong>: kaichun@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.07793">Paper</a> | <a href="https://hyperplane-lab.github.io/Generative-3D-Part-Assembly/">Website</a>
<br /><strong>Keywords</strong>: 3d part assembly, graph neural network</p>
<hr />
<h4 id="gradient-surgery-for-multi-task-learning"><a href="https://arxiv.org/pdf/2001.06782.pdf">Gradient Surgery for Multi-Task Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img7" />
<strong>Authors</strong>: Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, Chelsea Finn
<br /><strong>Contact</strong>: tianheyu@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2001.06782.pdf">Paper</a> | <a href="https://github.com/tianheyu927/PCGrad">Website</a>
<br /><strong>Keywords</strong>: multi-task learning, deep reinforcement learning</p>
<hr />
<h4 id="hippo-recurrent-memory-with-optimal-polynomial-projections"><a href="https://arxiv.org/abs/2008.07669">HiPPO: Recurrent Memory with Optimal Polynomial Projections</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img39" />
<strong>Authors</strong>: Albert Gu*, Tri Dao*, Stefano Ermon, Atri Rudra, Chris Ré
<br /><strong>Contact</strong>: albertgu@stanford.edu, trid@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2008.07669">Paper</a> | <a href="https://hazyresearch.stanford.edu/hippo">Blog Post</a>
<br /><strong>Keywords</strong>: representation learning, time series, recurrent neural networks, lstm, orthogonal polynomials</p>
<hr />
<h4 id="identifying-learning-rules-from-neural-network-observables"><a href="https://arxiv.org/abs/2010.11765">Identifying Learning Rules From Neural Network Observables</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img13" />
<strong>Authors</strong>: Aran Nayebi, Sanjana Srivastava, Surya Ganguli, Daniel L.K. Yamins
<br /><strong>Contact</strong>: anayebi@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight Presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.11765">Paper</a> | <a href="https://github.com/neuroailab/lr-identify">Website</a>
<br /><strong>Keywords</strong>: computational neuroscience, learning rule, deep networks</p>
<hr />
<h4 id="improved-techniques-for-training-score-based-generative-models"><a href="https://arxiv.org/pdf/2006.09011.pdf">Improved Techniques for Training Score-Based Generative Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img28" />
<strong>Authors</strong>: Yang Song, Stefano Ermon
<br /><strong>Contact</strong>: songyang@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2006.09011.pdf">Paper</a>
<br /><strong>Keywords</strong>: score-based generative modeling, score matching, deep generative models</p>
<hr />
<h4 id="language-through-a-prism-a-spectral-approach-for-multiscale-language-representations"><a href="https://arxiv.org/abs/2011.04823">Language Through a Prism: A Spectral Approach for Multiscale Language Representations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img12" />
<strong>Authors</strong>: Alex Tamkin, Dan Jurafsky, Noah Goodman
<br /><strong>Contact</strong>: atamkin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.04823">Paper</a>
<br /><strong>Keywords</strong>: bert, signal processing, self-supervised learning, interpretability, multiscale</p>
<hr />
<h4 id="large-scale-methods-for-distributionally-robust-optimization"><a href="https://arxiv.org/pdf/2010.05893.pdf">Large-Scale Methods for Distributionally Robust Optimization</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img14" />
<strong>Authors</strong>: Daniel Levy, Yair Carmon, John Duchi, Aaron Sidford
<br /><strong>Contact</strong>: danilevy@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2010.05893.pdf">Paper</a>
<br /><strong>Keywords</strong>: robustness dro optimization large-scale optimal</p>
<hr />
<h4 id="learning-physical-graph-representations-from-visual-scenes"><a href="https://proceedings.neurips.cc/paper/2020/hash/4324e8d0d37b110ee1a4f1633ac52df5-Abstract.html">Learning Physical Graph Representations from Visual Scenes</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img0" />
<strong>Authors</strong>: Daniel Bear, Chaofei Fan, Damian Mrowca, Yunzhu Li, Seth Alter, Aran Nayebi, Jeremy Schwartz, Li F. Fei-Fei, Jiajun Wu, Josh Tenenbaum, Daniel L. Yamins
<br /><strong>Contact</strong>: dbear@stanford.edu
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc/paper/2020/hash/4324e8d0d37b110ee1a4f1633ac52df5-Abstract.html">Paper</a> | <a href="https://neuroailab.github.io/physical-scene-graphs/">Blog Post</a> | <a href="https://github.com/neuroailab/PSGNets">Website</a>
<br /><strong>Keywords</strong>: structure learning, graph learning, visual scene representations, unsupervised learning, unsupervised segmentation, object-centric representation, intuitive physics</p>
<hr />
<h4 id="mopo-model-based-offline-policy-optimization"><a href="https://arxiv.org/pdf/2005.13239.pdf">MOPO: Model-based Offline Policy Optimization</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img6" />
<strong>Authors</strong>: Tianhe Yu*, Garrett Thomas*, Lantao Yu, Stefano Ermon, James Zou, Sergey Levine, Chelsea Finn, Tengyu Ma
<br /><strong>Contact</strong>: tianheyu@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2005.13239.pdf">Paper</a> | <a href="https://github.com/tianheyu927/mopo">Website</a>
<br /><strong>Keywords</strong>: offline reinforcement learning, model-based reinforcement learning</p>
<hr />
<h4 id="measuring-robustness-to-natural-distribution-shifts-in-image-classification"><a href="https://arxiv.org/abs/2007.00644">Measuring Robustness to Natural Distribution Shifts in Image Classification</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img4" />
<strong>Authors</strong>: Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, Ludwig Schmidt
<br /><strong>Contact</strong>: rtaori@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2007.00644">Paper</a> | <a href="https://modestyachts.github.io/imagenet-testbed/">Website</a>
<br /><strong>Keywords</strong>: machine learning, robustness, image classification</p>
<hr />
<h4 id="minibatch-stochastic-approximate-proximal-point-methods"><a href="https://proceedings.neurips.cc//paper_files/paper/2020/hash/fa2246fa0fdf0d3e270c86767b77ba1b-Abstract.html">Minibatch Stochastic Approximate Proximal Point Methods</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img36" />
<strong>Authors</strong>: Hilal Asi, Karan Chadha, Gary Cheng, John Duchi
<br /><strong>Contact</strong>: chenggar@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight talk
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc//paper_files/paper/2020/hash/fa2246fa0fdf0d3e270c86767b77ba1b-Abstract.html">Paper</a>
<br /><strong>Keywords</strong>: stochastic optimization, sgd, aprox</p>
<hr />
<h4 id="model-based-adversarial-meta-reinforcement-learning"><a href="https://proceedings.neurips.cc/paper/2020/file/73634c1dcbe056c1f7dcf5969da406c8-Paper.pdf">Model-based Adversarial Meta-Reinforcement Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img38" />
<strong>Authors</strong>: Zichuan Lin, Garrett Thomas, Guangwen Yang, Tengyu Ma
<br /><strong>Contact</strong>: lzcthu12@gmail.com,gwthomas@stanford.edu
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc/paper/2020/file/73634c1dcbe056c1f7dcf5969da406c8-Paper.pdf">Paper</a>
<br /><strong>Keywords</strong>: model-based rl, meta-rl, minimax</p>
<hr />
<h4 id="multi-plane-program-induction-with-3d-box-priors"><a href="http://bpi.csail.mit.edu/data/paper/2020NeurIPS-BPI.pdf">Multi-Plane Program Induction with 3D Box Priors</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img9" />
<strong>Authors</strong>: Yikai Li, Jiayuan Mao, Xiuming Zhang, William T. Freeman, Joshua B. Tenenbaum, Noah Snavely, Jiajun Wu
<br /><strong>Contact</strong>: jiajunwu@cs.stanford.edu
<br /><strong>Links:</strong> <a href="http://bpi.csail.mit.edu/data/paper/2020NeurIPS-BPI.pdf">Paper</a> | <a href="http://bpi.csail.mit.edu/data/img/intro.mp4">Video</a> | <a href="http://bpi.csail.mit.edu/">Website</a>
<br /><strong>Keywords</strong>: visual program induction, 3d vision, image editing</p>
<hr />
<h4 id="multi-label-contrastive-predictive-coding"><a href="https://arxiv.org/abs/2007.09852">Multi-label Contrastive Predictive Coding</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img25" />
<strong>Authors</strong>: Jiaming Song, Stefano Ermon
<br /><strong>Contact</strong>: jiaming.tsong@gmail.com
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2007.09852">Paper</a>
<br /><strong>Keywords</strong>: representation learning, mutual information</p>
<hr />
<h4 id="neural-bridge-sampling-for-evaluating-safety-critical-autonomous-systems"><a href="https://arxiv.org/abs/2008.10581">Neural Bridge Sampling for Evaluating Safety-Critical Autonomous Systems</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img41" />
<strong>Authors</strong>: Aman Sinha, Matthew O’Kelly, Russ Tedrake, John Duchi
<br /><strong>Contact</strong>: amans@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2008.10581">Paper</a>
<br /><strong>Keywords</strong>: safety, probabilistic methods, autonomous systems</p>
<hr />
<h4 id="neuron-shapley-discovering-the-responsible-neurons"><a href="https://papers.nips.cc/paper/2020/file/41c542dfe6e4fc3deb251d64cf6ed2e4-Paper.pdf">Neuron Shapley: Discovering the Responsible Neurons</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img32" />
<strong>Authors</strong>: Amirata Ghorbani, James Zou
<br /><strong>Contact</strong>: amiratag@stanford.edu
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/file/41c542dfe6e4fc3deb251d64cf6ed2e4-Paper.pdf">Paper</a>
<br /><strong>Keywords</strong>: interpretability, deep learning, shapley value</p>
<hr />
<h4 id="no-subclass-left-behind-fine-grained-robustness-in-coarse-grained-classification-problems"><a href="https://arxiv.org/abs/2011.12945">No Subclass Left Behind: Fine-Grained Robustness in Coarse-Grained Classification Problems</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img23" />
<strong>Authors</strong>: Nimit Sharad Sohoni, Jared Alexander Dunnmon, Geoffrey Angus, Albert Gu, Christopher Ré
<br /><strong>Contact</strong>: nims@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.12945">Paper</a> | <a href="https://hazyresearch.stanford.edu/hidden-stratification">Blog Post</a> | <a href="https://youtu.be/dI6nByor3rY">Video</a>
<br /><strong>Keywords</strong>: classification, robustness, clustering, neural feature representations</p>
<hr />
<h4 id="off-policy-policy-evaluation-for-sequential-decisions-under-unobserved-confounding"><a href="https://papers.nips.cc/paper/2020/hash/da21bae82c02d1e2b8168d57cd3fbab7-Abstract.html">Off-policy Policy Evaluation For Sequential Decisions Under Unobserved Confounding</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img26" />
<strong>Authors</strong>: Hongseok Namkoong, Ramtin Keramati, Steve Yadlowsky, Emma Brunskill
<br /><strong>Contact</strong>: keramati@stanford.edu
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/hash/da21bae82c02d1e2b8168d57cd3fbab7-Abstract.html">Paper</a>
<br /><strong>Keywords</strong>: off-policy policy evaluation, unobserved confounding, reinforcement learning</p>
<hr />
<h4 id="one-solution-is-not-all-you-need-few-shot-extrapolation-via-structured-maxent-rl"><a href="https://arxiv.org/abs/2010.14484">One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img21" />
<strong>Authors</strong>: Saurabh Kumar, Aviral Kumar, Sergey Levine, Chelsea Finn
<br /><strong>Contact</strong>: szk@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.14484">Paper</a>
<br /><strong>Keywords</strong>: robustness, diversity, reinforcement learning</p>
<hr />
<h4 id="point-process-models-for-sequence-detection-in-high-dimensional-neural-spike-trains"><a href="https://arxiv.org/abs/2010.04875">Point process models for sequence detection in high-dimensional neural spike trains</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img2" />
<strong>Authors</strong>: Alex H. Williams, Anthony Degleris, Yixin Wang, Scott W. Linderman
<br /><strong>Contact</strong>: ahwillia@stanford.edu
<br /><strong>Award nominations:</strong> Selected for Oral Presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.04875">Paper</a> | <a href="https://github.com/lindermanlab/PPSeq.jl">Website</a>
<br /><strong>Keywords</strong>: bayesian nonparametrics, unsupervised learning</p>
<hr />
<h4 id="predictive-coding-in-balanced-neural-networks-with-noise-chaos-and-delays"><a href="https://papers.nips.cc/paper/2020/file/c236337b043acf93c7df397fdb9082b3-Paper.pdf">Predictive coding in balanced neural networks with noise, chaos and delays</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img24" />
<strong>Authors</strong>: Jonathan Kadmon, Jonathan Timcheck, Surya Ganguli
<br /><strong>Contact</strong>: kadmonj@stanford.edu
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/file/c236337b043acf93c7df397fdb9082b3-Paper.pdf">Paper</a>
<br /><strong>Keywords</strong>: neuroscience, predictive coding, chaos</p>
<hr />
<h4 id="probabilistic-circuits-for-variational-inference-in-discrete-graphical-models"><a href="https://arxiv.org/abs/2010.11446">Probabilistic Circuits for Variational Inference in Discrete Graphical Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img34" />
<strong>Authors</strong>: Andy Shih, Stefano Ermon
<br /><strong>Contact</strong>: andyshih@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.11446">Paper</a>
<br /><strong>Keywords</strong>: variational inference, discrete, high-dimensions, sum product networks, probabilistic circuits, graphical models</p>
<hr />
<h4 id="provably-good-batch-off-policy-reinforcement-learning-without-great-exploration"><a href="https://proceedings.neurips.cc/paper/2020/file/0dc23b6a0e4abc39904388dd3ffadcd1-Paper.pdf">Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img30" />
<strong>Authors</strong>: Yao Liu, Adith Swaminathan, Alekh Agarwal, Emma Brunskill.
<br /><strong>Contact</strong>: yaoliu@stanford.edu
<br /><strong>Links:</strong> <a href="https://proceedings.neurips.cc/paper/2020/file/0dc23b6a0e4abc39904388dd3ffadcd1-Paper.pdf">Paper</a>
<br /><strong>Keywords</strong>: reinforcement leanring, off-policy, batch reinforcement learning</p>
<hr />
<h4 id="pruning-neural-networks-without-any-data-by-iteratively-conserving-synaptic-flow"><a href="https://papers.nips.cc/paper/2020/hash/46a4378f835dc8040c8057beb6a2da52-Abstract.html">Pruning neural networks without any data by iteratively conserving synaptic flow</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img22" />
<strong>Authors</strong>: Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli
<br /><strong>Contact</strong>: kunin@stanford.edu
<br /><strong>Links:</strong> <a href="https://papers.nips.cc/paper/2020/hash/46a4378f835dc8040c8057beb6a2da52-Abstract.html">Paper</a> | <a href="https://www.youtube.com/watch?v=8l-TDqpoUQs">Video</a> | <a href="https://github.com/ganguli-lab/Synaptic-Flow">Website</a>
<br /><strong>Keywords</strong>: network pruning, sparse initialization, lottery ticket</p>
<hr />
<h4 id="robust-sub-gaussian-principal-component-analysis-and-width-independent-schatten-packing"><a href="https://arxiv.org/abs/2006.06980">Robust Sub-Gaussian Principal Component Analysis and Width-Independent Schatten Packing</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img31" />
<strong>Authors</strong>: Arun Jambulapati, Jerry Li, Kevin Tian
<br /><strong>Contact</strong>: kjtian@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight presentation
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.06980">Paper</a>
<br /><strong>Keywords</strong>: robust statistics, principal component analysis, positive semidefinite programming</p>
<hr />
<h4 id="self-training-avoids-using-spurious-features-under-domain-shift"><a href="https://arxiv.org/abs/2006.10032">Self-training Avoids Using Spurious Features Under Domain Shift</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img17" />
<strong>Authors</strong>: Yining Chen*, Colin Wei*, Ananya Kumar, Tengyu Ma (*equal contribution)
<br /><strong>Contact</strong>: cynnjjs@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.10032">Paper</a>
<br /><strong>Keywords</strong>: self-training, pseudo-labeling, domain shift, robustness</p>
<hr />
<h4 id="wasserstein-distances-for-stereo-disparity-estimation"><a href="https://arxiv.org/abs/2007.03085">Wasserstein Distances for Stereo Disparity Estimation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-12-06-neurips-2020/img40" />
<strong>Authors</strong>: Divyansh Garg, Yan Wang, Bharath Hariharan, Mark Campbell, Kilian Q. Weinberger, Wei-Lun Chao
<br /><strong>Contact</strong>: divgarg@stanford.edu
<br /><strong>Award nominations:</strong> Spotlight
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2007.03085">Paper</a> | <a href="https://slideslive.com/38937842">Video</a> | <a href="https://div99.github.io/W-Stereo-Disp/">Website</a>
<br /><strong>Keywords</strong>: depth estimation, disparity estimation, autonomous driving, 3d object detection, statistical learning</p>
<hr />
<p>We look forward to seeing you at NeurIPS2020!</p>
Sun, 06 Dec 2020 00:00:00 -0800Learning from Language Explanations
/blog/learning-from-language/
/blog/learning-from-language/<p>Imagine you’re a machine learning practitioner and you want to solve some classification problem, like classifying groups of colored squares as being either 1s or 0s. Here’s what you would typically do: collect a large dataset of examples, label the data, and train a classifier:</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 700px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/examples.jpg" /></p>
</div></figure>
<p><em>But humans don’t learn like this</em>. We have a very powerful and intuitive mechanism for communicating information about the world - <strong>language</strong>!</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 500px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/language.jpg" /></p>
</div></figure>
<p>With just the phrase <em>at least 2 red squares</em>, we’ve summarized the entire dataset presented above in a much more efficient manner.</p>
<p><strong>Language is a crucial medium for human learning:</strong> we use it to <a href="https://www.npr.org/2010/01/18/122701268/i-have-a-dream-speech-in-its-entirety">convey beliefs</a> about the world, <a href="https://www.nature.com/articles/ncomms7029">teach others</a>, and describe things that are hard to <a href="https://en.wikipedia.org/wiki/Saturn">experience directly</a>. Thus, language ought to be a simple and effective way to supervise machine learning models. Yet past approaches to learning from language have struggled to scale up to the general tasks targeted by modern deep learning systems and the freeform language explanations used in these domains. In two short papers presented at ACL 2020 this year, we use deep neural models to learn from language explanations to help tackle a variety of challenging tasks in natural language processing (NLP) and computer vision.</p>
<ul>
<li><a href="https://arxiv.org/abs/2005.01932">ExpBERT: Representation Engineering with Natural Language Explanations</a></li>
<li><a href="https://arxiv.org/abs/1911.02683">Shaping Visual Representations with Language for Few-shot Classification</a></li>
</ul>
<h3 id="whats-the-challenge"><strong>What’s the challenge?</strong></h3>
<p>Given that language is such an intuitive interface for humans to teach others,
why is it so hard to use language for machine learning?</p>
<p>The principal challenge is the <a href="https://arxiv.org/html/cs/9906002">grounding
problem</a>: understanding language
explanations in the context of other inputs. Building models that can
understand rich and ambiguous language is tricky enough, but building models
that can relate language to the surrounding world is even more challenging. For
instance, given the explanation <em>at least two red squares</em>, a model must not
only understand the terms <em>red</em> and <em>square</em>, but also how they refer to
particular parts of (often complex) inputs.</p>
<p>Past work (<a href="https://www.aclweb.org/anthology/D17-1161">1</a>,
<a href="https://www.aclweb.org/anthology/P18-1029.pdf">2</a>,
<a href="https://arxiv.org/abs/1805.03818">3</a>) has relied on <a href="https://cs.stanford.edu/~pliang/papers/executable-cacm2016.pdf">semantic
parsers</a> which
convert natural language statements (e.g. <em>at least two red squares</em>) to formal
logical representations (e.g. <code class="highlighter-rouge">Count(Square AND Red) > 2</code>). If we can easily
check whether explanations apply to our inputs by executing these logical
formulas, we can use our explanations as features to train our model.
However, semantic parsers only work on simple domains
where we can hand-engineer a logical grammar of explanations we might expect to
see. They struggle to handle richer and vaguer language or scale up to more
complex inputs, such as images.</p>
<p>Fortunately, modern deep neural language models such as
<a href="https://arxiv.org/abs/1810.04805">BERT</a> are beginning to show promise at
solving many language understanding tasks. Our papers propose to alleviate the
grounding problem by using neural language models that are either trained to
ground language explanations in the domain of interest, or come pre-trained
with general-purpose “knowledge” that can be used to interpret explanations. We
will show that these neural models allow us to learn from richer and more
diverse language for more challenging settings.</p>
<h3 id="representation-engineering-with-natural-language-explanations"><strong>Representation Engineering with Natural Language Explanations</strong></h3>
<p>In our <a href="https://arxiv.org/abs/2005.01932">first paper</a>, we examine how to build text classifiers with language
explanations.
Consider the task of <em>relation extraction</em>, where we are given a
short paragraph and must identify whether two people mentioned in the
paragraph are <strong>married</strong>. While state-of-the-art NLP models can likely solve
this task from data alone, humans might use language to describe ways to tell
whether two people are married—for example, <em>people who go on honeymoons are
typically married</em>. Can such language explanations be used to train better
classifiers?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 700px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert_dataset.jpg" /></p>
</div></figure>
<p>In the same way that we might take an input <script type="math/tex">x</script>, and extract features (e.g.
the presence of certain words) to train a model, we can use explanations to
provide additional features. For example, knowing that honeymoons are relevant
for this task, if we can create a honeymoon feature that reliably activates
whenever the two people in a paragraph are described as going on a honeymoon,
this should be useful signal for training a better model.</p>
<p>But creating such features requires some sort of explanation <strong>interpretation</strong>
mechanism that tells us whether an explanation is true for an input. Semantic
parsers are one such tool: given <em><script type="math/tex">A</script> and <script type="math/tex">B</script> went on honeymoon</em>, we could
parse this explanation into a logical form which, when run on an input,
produces 1 if the word <em>honeymoon</em> appears between <script type="math/tex">A</script> and <script type="math/tex">B</script>. But what about
a vaguer explanation like <em><script type="math/tex">A</script> and <script type="math/tex">B</script> are in love</em>? How can we parse this?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 800px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/semantic_parsing_examples.jpg" /></p>
</div></figure>
<p>While semantic parsing is efficient and accurate in small domains, it can be
overly <em>brittle</em>, as it can only interpret explanations which adhere to a fixed
set of grammatical rules and functions that we must specify in advance (e.g.
<code class="highlighter-rouge">contains</code> and <code class="highlighter-rouge">extract_text</code>).
Instead, we turn to the soft reasoning
capabilities of <a href="https://arxiv.org/abs/1810.04805">BERT</a>, a neural language model. BERT is particularly effective
at the task of <em>textual entailment</em>: determining whether a sentence implies or
contradicts another sentence (e.g. does <em>She ate pizza</em> imply that <em>She ate
food?</em> Yes!). In our proposed <strong>ExpBERT</strong> model, we take a BERT model
trained for textual entailment, and instead ask it to identify whether a
paragraph in our task <em>entails</em> an explanation. The features produced by BERT
during this process replace the indicator features produced by the semantic
parser above.</p>
<figure class="figure"><div class="figure__main">
<video class="postimage_unpadded" style="max-width: 800px" autoplay="" muted="" loop="" playsinline="">
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert.webm" type="video/webm" />
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert.mp4" type="video/mp4" />
<p>Your browser doesn't support HTML5 video. Here is a <a href="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert.mp4">link to the video</a> instead, which you can download and run with a player like <a href="https://www.videolan.org/vlc/index.html">VLC</a></p>
</video>
</div></figure>
<p>Does the soft reasoning power of BERT improve over semantic parsing? On the
marriage identification task, we find that <strong>ExpBERT</strong> leads to substantial
improvements over a classifier that is trained on the input features only (No
Explanations). Importantly, using a semantic parser to try to parse
explanations doesn’t help much, since there are general explanations (<em>in
love</em>) that are difficult to convert to logical forms.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 285px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/expbert_results.jpg" /></p>
</div></figure>
<p>In the full paper, we compare to more baselines, explore larger relation
extraction tasks (e.g. <a href="https://nlp.stanford.edu/projects/tacred/">TACRED</a>),
conduct ablation studies to understand what kinds of explanations are
important, and examine how much more efficient explanations are compared to
additional data.</p>
<h3 id="shaping-visual-representations-with-language"><strong>Shaping Visual Representations with Language</strong></h3>
<p>The work we’ve just described uses natural language explanations for a single
task like marriage identification. However, <a href="https://plato.stanford.edu/entries/language-thought/">work in cognitive
science</a> suggests that
language also equips us with the right features and abstractions that help us
solve <em>future</em> tasks.
For example, explanations that indicate whether person <script type="math/tex">A</script> is married to
<script type="math/tex">B</script> also highlight other concepts that are crucial to human relationships:
<em>children</em>, <em>daughters</em>, <em>honeymoons</em>, and more. Knowing these additional
concepts are not just useful for identifying married people; they are also
important if we would later like to identify other relationships
(e.g. <em>siblings</em>, <em>mother</em>, <em>father</em>).</p>
<p>In machine learning, we might ask: how can language point out the right
features for challenging and underspecified domains, if we
ultimately wish to solve <em>new tasks</em> where no language is available? In our
<a href="https://arxiv.org/abs/1911.02683">second paper</a>, we explore this setting,
additionally increasing the challenge by seeing whether language can improve
the learning of representations across modalities—here, vision.</p>
<p>We’re specifically interested in few-shot visual reasoning tasks like the following (here, from the <a href="https://arxiv.org/abs/1704.04517">ShapeWorld</a> dataset):</p>
<figure class="figure"><div class="figure__main">
<video class="postimage_unpadded" style="max-width: 500px" autoplay="" muted="" loop="" playsinline="">
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/shapeworld.webm" type="video/webm" />
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/shapeworld.mp4" type="video/mp4" />
<p>Your browser doesn't support HTML5 video. Here is a <a href="/blog/assets/img/posts/2020-11-23-learning-from-language/shapeworld.mp4">link to the video</a> instead, which you can download and run with a player like <a href="https://www.videolan.org/vlc/index.html">VLC</a></p>
</video>
</div></figure>
<p>Given a small training set of examples of a visual concept, the task is to
determine whether a held-out test image expresses the same concept. Now, what
if we assume access to language explanations of the relevant visual concepts at
training time? Can we use these to learn a better model, <em>even if no language
is available at test time</em>?</p>
<p>We frame this as a <a href="https://arxiv.org/abs/1904.04232"><em>meta-learning</em></a> task:
instead of training and testing a model on a single task, we
train a model on a <em>set</em> of tasks, each with a small training set and
an accompanying language description (the <em>meta-train</em> set). We then test
generalization to a <em>meta-test</em> set of unseen tasks, for which no language is
available:</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 760px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/metalearning.jpg" /></p>
</div></figure>
<p>First, let’s look at how we might solve this task without language. One typical
approach is <strong>Prototype Networks</strong>, where we learn some model <script type="math/tex">f_\theta</script>
(here, a <a href="https://arxiv.org/abs/1409.1556">deep convolutional neural network</a>)
that embeds the training images, averages them, and compares to an embedding of
the test image:</p>
<figure class="figure"><div class="figure__main">
<video class="postimage_unpadded" style="max-width: 800px" autoplay="" muted="" loop="" playsinline="">
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/lsl.webm" type="video/webm" />
<source src="/blog/assets/img/posts/2020-11-23-learning-from-language/lsl.mp4" type="video/mp4" />
<p>Your browser doesn't support HTML5 video. Here is a <a href="/blog/assets/img/posts/2020-11-23-learning-from-language/lsl.mp4">link to the video</a> instead, which you can download and run with a player like <a href="https://www.videolan.org/vlc/index.html">VLC</a></p>
</video>
</div></figure>
<p>To use language, we propose a simple approach called <strong>Language Shaped Learning</strong>
(LSL): if we have access to explanations at training time, we encourage the
model to learn representations that are not only helpful for classification,
but are <em>predictive of the language explanations</em>. We do this by introducing an
<em>auxiliary</em> training objective (i.e. it is not related to the ultimate task of
interest), where we simultaneously train a recurrent neural network (RNN)
decoder to predict the explanation(s) from the representation of the
input images. Crucially, training this decoder depends on the
parameters of our image model <script type="math/tex">f_\theta</script>, so this process should encourage
<script type="math/tex">f_\theta</script> to better encode the features and abstractions exposed in
language.</p>
<p>In effect, we are training the model to “think out loud” when representing
concepts at training time. At test time, we simply discard the RNN decoder, and
do classification as normal with the “language-shaped” image embeddings.</p>
<p>We apply this model to both the ShapeWorld dataset described above, and a more
realistic <a href="http://www.vision.caltech.edu/visipedia/CUB-200-2011.html">Birds</a>
dataset, with real images and human language:</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 800px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/birds.jpg" /></p>
</div></figure>
<p>In both cases, this auxiliary training objective improves performance over a
no-explanation baseline (<strong>Meta</strong>), and <a href="https://arxiv.org/abs/1711.00482"><em>Learning with Latent
Language</em></a> (<strong>L3</strong>), a similar model proposed
for this setting that uses language as a discrete bottleneck (see the paper for
details):</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_unpadded" style="max-width: 400px" src="/blog/assets/img/posts/2020-11-23-learning-from-language/lsl_results.jpg" /></p>
</div></figure>
<p>In the full paper, we also explore which <em>parts</em> of language are most important
(spoiler: a little bit of everything), and <em>how much</em> language is needed for
LSL to improve over models that don’t use language (spoiler: surprisingly little!)</p>
<h3 id="moving-forward"><strong>Moving Forward</strong></h3>
<p>As NLP systems grow in their ability to understand and produce language, so too
grows the potential for machine learning systems to <em>learn from language</em> to
solve other challenging tasks. In the papers above, we’ve shown that deep
neural language models can be used to successfully learn from language
explanations to improve generalization across a variety of tasks in vision and
NLP.</p>
<p>We think this is an exciting new avenue for training machine learning models,
and similar ideas are already being explored in areas such as reinforcement
learning (<a href="https://arxiv.org/abs/1910.08210">4</a>,
<a href="https://arxiv.org/abs/1906.03926">5</a>). We envision a future where in order to
solve a machine learning task, we no longer have to collect a large labeled
dataset, but instead interact naturally and expressively with a model in the
same way that humans have interacted with each other for millennia—<em>through
language</em>.</p>
<h3 id="acknowledgments"><strong>Acknowledgments</strong></h3>
<p>Thanks to our coauthors (Pang Wei Koh, Percy Liang, and Noah Goodman), and to
Nelson Liu, Pang Wei Koh, and the rest of the SAIL blog team for reviewing and
publishing this blog post. This research was supported in part by the <a href="https://research.fb.com/fellowship/">Facebook
Fellowship</a> (to Pang Wei Koh), the <a href="https://www.nsfgrfp.org/">NSF Graduate Research Fellowship</a> (to Jesse Mu), <a href="https://www.tri.global/">Toyota Research
Institute</a>, and the <a href="https://www.onr.navy.mil/">Office of Naval Research</a>.</p>
Mon, 23 Nov 2020 00:00:00 -0800Stanford AI Lab Papers and Talks at CoRL 2020
/blog/corl-2020/
/blog/corl-2020/<figure class="figure"><div class="figure__main">
<p><img class="postimagethird" src="/blog/assets/img/posts/2020-11-16-corl-2020/logo.png" /></p>
</div></figure>
<p>The <a href="https://www.robot-learning.org/">Conference on Robot Learning</a> (CoRL) 2020 is being hosted virtually from November 16th - November 18th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!</p>
<h2 id="list-of-accepted-papers">List of Accepted Papers</h2>
<hr />
<h4 id="learning-3d-dynamic-scene-representations-for-robot-manipulation"><a href="https://arxiv.org/pdf/2011.01968.pdf">Learning 3D Dynamic Scene Representations for Robot Manipulation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img0" />
<strong>Authors</strong>: Zhenjia Xu, Zhanpeng He, Jiajun Wu, Shuran Song
<br /><strong>Contact</strong>: jiajunwu@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2011.01968.pdf">Paper</a> | <a href="https://www.youtube.com/watch?v=GQjYG3nQJ80">Video</a> | <a href="https://dsr-net.cs.columbia.edu/">Website</a>
<br /><strong>Keywords</strong>: scene representations, 3d perception, robot manipulation</p>
<hr />
<h4 id="learning-latent-representations-to-influence-multi-agent-interaction"><a href="https://drive.google.com/file/d/1_ezqLLEv4HLtj9vflRj0sq3PNOhaSnJm/view">Learning Latent Representations to Influence Multi-Agent Interaction</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img6" />
<strong>Authors</strong>: Annie Xie, Dylan P. Losey, Ryan Tolsma, Chelsea Finn, Dorsa Sadigh
<br /><strong>Contact</strong>: anniexie@stanford.edu
<br /><strong>Links:</strong> <a href="https://drive.google.com/file/d/1_ezqLLEv4HLtj9vflRj0sq3PNOhaSnJm/view">Paper</a> | <a href="https://ai.stanford.edu/blog/lili/">Blog Post</a> | <a href="https://sites.google.com/view/latent-strategies">Website</a>
<br /><strong>Keywords</strong>: multi-agent systems, human-robot interaction, reinforcement learning</p>
<hr />
<h4 id="learning-object-conditioned-exploration-using-distributed-soft-actor-critic"><a href="https://arxiv.org/pdf/2007.14545.pdf">Learning Object-conditioned Exploration using Distributed Soft Actor Critic</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img1" />
<strong>Authors</strong>: Ayzaan Wahid (Google), Austin Stone (Google), Brian Ichter (Google Brain), Kevin Chen (Stanford), Alexander Toshev (Google)
<br /><strong>Contact</strong>: ayzaan@google.com
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2007.14545.pdf">Paper</a>
<br /><strong>Keywords</strong>: object navigation, visual navigation</p>
<hr />
<h4 id="mats-an-interpretable-trajectory-forecasting-representation-for-planning-and-control-"><a href="https://arxiv.org/abs/2009.07517">MATS: An Interpretable Trajectory Forecasting Representation for Planning and Control </a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img2" />
<strong>Authors</strong>: Boris Ivanovic, Amine Elhafsi, Guy Rosman, Adrien Gaidon, Marco Pavone
<br /><strong>Contact</strong>: borisi@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2009.07517">Paper</a> | <a href="https://www.youtube.com/watch?v=q6hMY2y-BcQ">Video</a>
<br /><strong>Keywords</strong>: trajectory forecasting, learning dynamical systems, motion planning, autonomous vehicles</p>
<hr />
<h4 id="model-based-reinforcement-learning-for-decentralized-multiagent-rendezvous"><a href="https://arxiv.org/abs/2003.06906">Model-based Reinforcement Learning for Decentralized Multiagent Rendezvous</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img3" />
<strong>Authors</strong>: Rose E. Wang, J. Chase Kew, Dennis Lee, Tsang-Wei Edward Lee, Tingnan Zhang, Brian Ichter, Jie Tan, Aleksandra Faust
<br /><strong>Contact</strong>: rewang@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2003.06906">Paper</a> | <a href="https://youtu.be/HqeYcO1DBUU">Video</a> | <a href="https://sites.google.com/view/multiagent-hpp/home">Website</a>
<br /><strong>Keywords</strong>: multiagent systems; model-based reinforcement learning</p>
<hr />
<h4 id="reinforcement-learning-with-videos--combining-offline-observations-with-interaction"><a href="https://arxiv.org/abs/2011.06507">Reinforcement Learning with Videos: Combining Offline Observations with Interaction</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img4" />
<strong>Authors</strong>: Karl Schmeckpeper, Oleh Rybkin, Kostas Daniilidis, Sergey Levine, Chelsea Finn
<br /><strong>Contact</strong>: karls@seas.upenn.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.06507">Paper</a> | <a href="https://sites.google.com/view/rl-with-videos">Website</a>
<br /><strong>Keywords</strong>: reinforcement learning, learning from observation</p>
<hr />
<h4 id="sampling-based-reachability-analysis-a-random-set-theory-approach-with-adversarial-sampling"><a href="https://arxiv.org/abs/2008.10180">Sampling-based Reachability Analysis: A Random Set Theory Approach with Adversarial Sampling</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-16-corl-2020/img5" />
<strong>Authors</strong>: Thomas Lew, Marco Pavone
<br /><strong>Contact</strong>: thomas.lew@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2008.10180">Paper</a>
<br /><strong>Keywords</strong>: reachability analysis, robust planning and control, neural networks</p>
<h2 id="keynote">Keynote</h2>
<hr />
<h4 id="walking-the-boundary-of-learning-and-interaction-dorsa-sadigh">Walking the Boundary of Learning and Interaction (Dorsa Sadigh)</h4>
<figure class="figure"><div class="figure__main">
<p><img class="postimagethird" src="/blog/assets/img/posts/2020-11-16-corl-2020/keynote.png" /></p>
</div></figure>
<p><strong>Overview:</strong> There have been significant advances in the field of robot learning in the past decade. However, many challenges still remain when considering how robot learning can advance interactive agents such as robots that collaborate with humans. This includes autonomous vehicles that interact with human-driven vehicles or pedestrians, service robots collaborating with their users at homes over short or long periods of time, or assistive robots helping patients with disabilities. This introduces an opportunity for developing new robot learning algorithms that can help advance interactive autonomy.</p>
<p>In this talk, I will discuss a formalism for human-robot interaction built upon ideas from representation learning. Specifically, I will first discuss the notion of latent strategies— low dimensional representations sufficient for capturing non-stationary interactions. I will then talk about the challenges of learning such representations when interacting with humans, and how we can develop data-efficient techniques that enable actively learning computational models of human behavior from demonstrations, preferences, or physical corrections. Finally, I will introduce an intuitive controlling paradigm that enables seamless collaboration based on learned representations, and further discuss how that can be used for further influencing humans.</p>
<p><strong>Live Event:</strong> November 17th, 7:00AM - 7:45AM PST</p>
<hr />
<p>We look forward to seeing you at CoRL!</p>
Mon, 16 Nov 2020 00:00:00 -0800Stanford AI Lab Papers and Talks at EMNLP 2020
/blog/emnlp-2020/
/blog/emnlp-2020/<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/logo.png" /></p>
<p>The <a href="https://2020.emnlp.org/">Conference on Empirical Methods in Natural Language Processing</a> (EMNLP) 2020 is being hosted virtually from November 16th - November 20th. We’re excited to share all the work from SAIL that’s being presented, and you’ll find links to papers, videos and blogs below. Feel free to reach out to the contact authors directly to learn more about the work that’s happening at Stanford!</p>
<ul>
<li><a href="#main-conference">Main Conference</a></li>
<li><a href="#findings-of-emnlp">Findings of EMNLP</a></li>
<li><a href="#workshops-and-co-located-conferences">Workshops and Co-Located Conferences</a></li>
</ul>
<h2 id="main-conference">Main Conference</h2>
<hr />
<h4 id="pre-training-transformers-as-energy-based-cloze-models"><a href="https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf">Pre-Training Transformers as Energy-Based Cloze Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img19" />
<strong>Authors</strong>: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning
<br /><strong>Contact</strong>: kevclark@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://www.aclweb.org/anthology/2020.emnlp-main.20.pdf">Paper</a>
<br /><strong>Keywords</strong>: representation learning, self-supervised learning, energy-based models</p>
<hr />
<h4 id="alice-active-learning-with-contrastive-natural-language-explanations"><a href="https://arxiv.org/abs/2009.10259">ALICE: Active Learning with Contrastive Natural Language Explanations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img8" />
<strong>Authors</strong>: Weixin Liang, James Zou, Zhou Yu
<br /><strong>Contact</strong>: wxliang@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2009.10259">Paper</a>
<br /><strong>Keywords</strong>: natural language explanation, class-based active learning, contrastive explanation</p>
<hr />
<h4 id="chexbert-combining-automatic-labelers-and-expert-annotations-for-accurate-radiology-report-labeling-using-bert"><a href="https://arxiv.org/abs/2004.09167">CheXbert: Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img1" />
<strong>Authors</strong>: Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, Matthew P. Lungren
<br /><strong>Contact</strong>: akshaysm@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.09167">Paper</a> | <a href="https://virtual.2020.emnlp.org/paper_main.55.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: bert, natural language processing, radiology, medical imaging, deep learning</p>
<hr />
<h4 id="autoqa-from-databases-to-qa-semantic-parsers-with-only-synthetic-training-data"><a href="https://www.aclweb.org/anthology/2020.emnlp-main.31/">AutoQA: From Databases To QA Semantic Parsers With Only Synthetic Training Data</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img21" />
<strong>Authors</strong>: Silei Xu, Sina J. Semnani, Giovanni Campagna, Monica S. Lam
<br /><strong>Contact</strong>: silei@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://www.aclweb.org/anthology/2020.emnlp-main.31/">Paper</a> | <a href="https://virtual.2020.emnlp.org/paper_main.3506.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: question answering, semantic parsing, language models, synthetic training data, data augmentation</p>
<hr />
<h4 id="data-and-representation-for-turkish-natural-language-inference"><a href="https://arxiv.org/pdf/2004.14963.pdf">Data and Representation for Turkish Natural Language Inference</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img14" />
<strong>Authors</strong>: Emrah Budur, Rıza Özçelik, Tunga Güngör, Christopher Potts
<br /><strong>Contact</strong>: emrah.budur@boun.edu.tr
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2004.14963.pdf">Paper</a> | <a href="https://github.com/boun-tabi/NLI-TR">Website</a>
<br /><strong>Keywords</strong>: sentence-level semantics, natural language inference, neural machine translation, morphologically rich language</p>
<hr />
<h4 id="intrinsic-evaluation-of-summarization-datasets"><a href="https://github.com/rishibommasani/rishibommasani.github.io/blob/master/papers/EMNLP2020.pdf">Intrinsic Evaluation of Summarization Datasets</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img6" />
<strong>Authors</strong>: Rishi Bommasani, Claire Cardie
<br /><strong>Contact</strong>: nlprishi@stanford.edu
<br /><strong>Links:</strong> <a href="https://github.com/rishibommasani/rishibommasani.github.io/blob/master/papers/EMNLP2020.pdf">Paper</a> | <a href="https://slideslive.com/38938755">Video</a> | <a href="https://rishibommasani.github.io/">Website</a> | <a href="https://virtual.2020.emnlp.org/paper_main.675.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: summarization, datasets, evaluation</p>
<hr />
<h4 id="learning-music-helps-you-read-using-transfer-to-study-linguistic-structure-in-language-models"><a href="https://arxiv.org/abs/2004.14601">Learning Music Helps You Read: Using Transfer to Study Linguistic Structure in Language Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img22" />
<strong>Authors</strong>: Isabel Papadimitriou, Dan Jurafsky
<br /><strong>Contact</strong>: isabelvp@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.14601">Paper</a>
<br /><strong>Keywords</strong>: transfer learning, analysis, music, hierarchical structure</p>
<hr />
<h4 id="localizing-open-ontology-qa-semantic-parsers-in-a-day-using-machine-translation"><a href="https://arxiv.org/pdf/2010.05106.pdf">Localizing Open-Ontology QA Semantic Parsers in a Day Using Machine Translation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img5" />
<strong>Authors</strong>: Mehrad Moradshahi, Giovanni Campagna, Sina J. Semnani, Silei Xu, Monica S. Lam
<br /><strong>Contact</strong>: mehrad@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2010.05106.pdf">Paper</a> | <a href="https://github.com/stanford-oval/SPL">Website</a>
<br /><strong>Keywords</strong>: machine translation, semantic parsing, localization</p>
<hr />
<h4 id="slm-learning-a-discourse-language-representation-with-sentence-unshuffling"><a href="https://arxiv.org/pdf/2010.16249.pdf">SLM: Learning a Discourse Language Representation with Sentence Unshuffling</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img18" />
<strong>Authors</strong>: Haejun Lee, Drew A. Hudson, Kangwook Lee, Christopher D. Manning
<br /><strong>Contact</strong>: dorarad@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2010.16249.pdf">Paper</a>
<br /><strong>Keywords</strong>: transformer, bert, language, understanding, nlp, squad, glue, sentences, discourse</p>
<hr />
<h4 id="utility-is-in-the-eye-of-the-user-a-critique-of-nlp-leaderboards"><a href="https://arxiv.org/pdf/2009.13888.pdf">Utility is in the Eye of the User: A Critique of NLP Leaderboards</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img0" />
<strong>Authors</strong>: Kawin Ethayarajh, Dan Jurafsky
<br /><strong>Contact</strong>: kawin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2009.13888.pdf">Paper</a> | <a href="https://kawine.github.io/">Website</a>
<br /><strong>Keywords</strong>: nlp, leaderboard, utility, benchmark, fairness, efficiency</p>
<hr />
<h4 id="with-little-power-comes-great-responsibility"><a href="https://arxiv.org/abs/2010.06595">With Little Power Comes Great Responsibility</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img4" />
<strong>Authors</strong>: Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, Dan Jurafsky
<br /><strong>Contact</strong>: dcard@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.06595">Paper</a> | <a href="https://github.com/dallascard/NLP-power-analysis">Website</a>
<br /><strong>Keywords</strong>: statistical power, experimental methodology, leaderboards, machine translation, human evaluation</p>
<hr />
<h2 id="findings-of-emnlp">Findings of EMNLP</h2>
<hr />
<h4 id="desmog-detecting-stance-in-media-on-global-warming"><a href="https://www.aclweb.org/anthology/2020.findings-emnlp.296.pdf">DeSMOG: Detecting Stance in Media On Global Warming</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img16" />
<strong>Authors</strong>: Yiwei Luo, Dallas Card, Dan Jurafsky
<br /><strong>Contact</strong>: yiweil@stanford.edu
<br /><strong>Links:</strong> <a href="https://www.aclweb.org/anthology/2020.findings-emnlp.296.pdf">Paper</a> | <a href="http://stanford.edu/~yiweil/webpage.html">Website</a>
<br /><strong>Keywords</strong>: computational social science; framing; argumentation; stance; bias; climate change</p>
<hr />
<h4 id="investigating-transferability-in-pretrained-language-models"><a href="https://arxiv.org/abs/2004.14975">Investigating Transferability in Pretrained Language Models</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img15" />
<strong>Authors</strong>: Alex Tamkin, Trisha Singh, Davide Giovanardi, Noah Goodman
<br /><strong>Contact</strong>: atamkin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.14975">Paper</a> | <a href="http://alextamkin.com">Website</a> | <a href="https://virtual.2020.emnlp.org/paper_WS-1.1165_F.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: finetuning, transfer learning, language models, bert, probing</p>
<hr />
<h4 id="stay-hungry-stay-focused-generating-informative-and-specific-questions-in-information-seeking-conversations"><a href="https://arxiv.org/pdf/2004.14530.pdf">Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img2" />
<strong>Authors</strong>: Peng Qi, Yuhao Zhang, Christopher D. Manning
<br /><strong>Contact</strong>: pengqi@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2004.14530.pdf">Paper</a> | <a href="https://qipeng.me/blog/learning-to-ask/">Blog Post</a> | <a href="https://virtual.2020.emnlp.org/paper_WS-1.69_F.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: conversational agents, question generation, natural language generation</p>
<hr />
<h4 id="do-language-embeddings-capture-scales"><a href="https://arxiv.org/abs/2010.05345">Do Language Embeddings Capture Scales?</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img11" />
<strong>Authors</strong>: Xikun Zhang*, Deepak Ramachandran*, Ian Tenney, Yanai Elazar, Dan Roth
<br /><strong>Contact</strong>: xikunz2@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.05345">Paper</a> | <a href="https://virtual.2020.emnlp.org/paper_findings.439.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: probing, analysis, bertology, scales, common sense knowledge</p>
<hr />
<h4 id="on-the-importance-of-adaptive-data-collection-for-extremely-imbalanced-pairwise-tasks"><a href="https://arxiv.org/abs/2010.05103">On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img7" />
<strong>Authors</strong>: Stephen Mussmann, Robin Jia, Percy Liang
<br /><strong>Contact</strong>: robinjia@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.05103">Paper</a> | <a href="https://worksheets.codalab.org/worksheets/0x39ba5559790b4099a7ff75f916ce19a4">Website</a>
<br /><strong>Keywords</strong>: active learning, robustness, label imbalance</p>
<hr />
<h4 id="pragmatic-issue-sensitive-image-captioning"><a href="https://arxiv.org/abs/2004.14451">Pragmatic Issue-Sensitive Image Captioning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img12" />
<strong>Authors</strong>: Allen Nie, Reuben Cohn-Gordon, Christopher Potts
<br /><strong>Contact</strong>: anie@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.14451">Paper</a> | <a href="https://slideslive.com/38940644/pragmatic-issuesensitive-image-captioning">Video</a>
<br /><strong>Keywords</strong>: controllable caption generation, question under discussion, discourse, pragmatics</p>
<hr />
<h2 id="workshops-and-co-located-conferences">Workshops and Co-Located Conferences</h2>
<hr />
<h4 id="bleu-neighbors-a-reference-less-approach-to-automatic-evaluation"><a href="https://arxiv.org/pdf/2004.12726.pdf">BLEU Neighbors: A Reference-less Approach to Automatic Evaluation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img3" />
<strong>Authors</strong>: Kawin Ethayarajh, Dorsa Sadigh
<br /><strong>Contact</strong>: kawin@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/pdf/2004.12726.pdf">Paper</a> | <a href="https://kawine.github.io/">Website</a>
<br /><strong>Keywords</strong>: nlp, bleu, evaluation, nearest neighbors, dialogue</p>
<hr />
<h4 id="determining-question-answer-plausibility-in-crowdsourced-datasets-using-multi-task-learning"><a href="https://arxiv.org/abs/2011.04883">Determining Question-Answer Plausibility in Crowdsourced Datasets Using Multi-Task Learning</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img17" />
<strong>Authors</strong>: Rachel Gardner, Maya Varma, Clare Zhu, Ranjay Krishna
<br /><strong>Contact</strong>: rachel0@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2011.04883">Paper</a>
<br /><strong>Keywords</strong>: noisy text, bert, plausibility, multi-task learning</p>
<hr />
<h4 id="explaining-the-trump-gap-in-social-distancing-using-covid-discourse"><a href="https://openreview.net/pdf/baa636711f681ae8664818f378d565b17065c604.pdf">Explaining the ‘Trump Gap’ in Social Distancing Using COVID Discourse</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img20" />
<strong>Authors</strong>: Austin van Loon, Sheridan Stewart, Brandon Waldon, Shrinidhi K. Lakshmikanth, Ishan Shah, Sharath Chandra Guntuku, Garrick Sherman, James Zou, Johannes Eichstaedt
<br /><strong>Contact</strong>: avanloon@stanford.edu
<br /><strong>Links:</strong> <a href="https://openreview.net/pdf/baa636711f681ae8664818f378d565b17065c604.pdf">Paper</a>
<br /><strong>Keywords</strong>: computational social science, social distancing, word2vec, vector semantics, twitter, bert</p>
<hr />
<h4 id="learning-adaptive-language-interfaces-through-decomposition"><a href="https://arxiv.org/abs/2010.05190">Learning Adaptive Language Interfaces through Decomposition</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img10" />
<strong>Authors</strong>: Siddharth Karamcheti, Dorsa Sadigh, Percy Liang
<br /><strong>Contact</strong>: skaramcheti@cs.stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.05190">Paper</a> | <a href="https://virtual.2020.emnlp.org/paper_WS-6.10.html">Virtual Conference Room</a>
<br /><strong>Keywords</strong>: semantic parsing, interaction, decomposition</p>
<hr />
<h4 id="modeling-subjective-assessments-of-guilt-in-newspaper-crime-narratives"><a href="https://arxiv.org/abs/2006.09589">Modeling Subjective Assessments of Guilt in Newspaper Crime Narratives</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img23" />
<strong>Authors</strong>: Elisa Kreiss*, Zijian Wang*, Christopher Potts
<br /><strong>Contact</strong>: ekreiss@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2006.09589">Paper</a> | <a href="https://github.com/zijwang/modeling_guilt">Website</a>
<br /><strong>Keywords</strong>: psycholinguistics, pragmatics, token-level supervision, model attribution, news, guilt, hedges, corpus, subjectivity</p>
<hr />
<h4 id="neural-natural-language-inference-models-partially-embed-theories-of-lexical-entailment-and-negation"><a href="https://arxiv.org/abs/2004.14623">Neural Natural Language Inference Models Partially Embed Theories of Lexical Entailment and Negation</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img13" />
<strong>Authors</strong>: Atticus Geiger, Kyle Richardson, Chris Potts
<br /><strong>Contact</strong>: atticusg@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2004.14623">Paper</a> | <a href="https://atticusg.github.io/">Website</a>
<br /><strong>Keywords</strong>: entailment intervention causality systematic generalization</p>
<hr />
<h4 id="structured-self-attention-weights-encode-semantics-in-sentiment-analysis"><a href="https://arxiv.org/abs/2010.04922">Structured Self-Attention Weights Encode Semantics in Sentiment Analysis</a></h4>
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-15-emnlp-2020/img9" />
<strong>Authors</strong>: Zhengxuan Wu, Thanh-Son Nguyen, Desmond C. Ong
<br /><strong>Contact</strong>: wuzhengx@stanford.edu
<br /><strong>Links:</strong> <a href="https://arxiv.org/abs/2010.04922">Paper</a>
<br /><strong>Keywords</strong>: attention, explainability, sentiment analysis</p>
<hr />
<p>We look forward to seeing you at EMNLP 2020!</p>
Sun, 15 Nov 2020 00:00:00 -0800Learning to Influence Multi-Agent Interaction
/blog/lili/
/blog/lili/<p>Interaction with others is an important part of everyday life. No matter
the situation – whether it be playing a game of chess, carrying a
box together, or navigating lanes of traffic – we’re able to
seamlessly compete against, collaborate with, and acclimate to other
people.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimagethird" src="/blog/assets/img/posts/2020-11-14-lili/motiv0.jpg" />
<img class="postimagethird" src="/blog/assets/img/posts/2020-11-14-lili/motiv1.jpg" />
<img class="postimagethird" src="/blog/assets/img/posts/2020-11-14-lili/motiv2.png" /></p>
</div></figure>
<p>Likewise, as robots become increasingly prevalent and capable, their
interaction with humans and other robots is inevitable. However, despite
the many advances in robot learning, most current algorithms are
designed for robots that act in isolation. These methods miss out on the
fact that other agents are also learning and changing – and so the
behavior the robot learns for the current interaction may not work
during the next one! Instead, can robots learn to seamlessly interact
with humans and other robots by taking their changing strategies into
account? In our new work (<a href="http://iliad.stanford.edu/pdfs/publications/xie2020learning.pdf">paper</a>,
<a href="https://sites.google.com/view/latent-strategies/">website</a>), we
begin to investigate this question.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-14-lili/hockey_sac.gif" /></p>
<figcaption>
A standard reinforcement learning agent (left) based on <a href="https://arxiv.org/abs/1801.01290">Soft
Actor-Critic</a> (<b>SAC</b>) assumes that
the opponent (right) follows a fixed strategy, and only blocks on its
left side.
</figcaption>
</div></figure>
<p>Interactions with humans are difficult for robots because humans and
other intelligent agents don’t have fixed behavior – their
strategies and habits change over time. In other words, they update
their actions in response to the robot and thus continually change the
robot’s learning environment. Consider the robot on the left (the agent)
learning to play air hockey against the non-stationary robot on the
right. Rather than hitting the same shot every time, the other robot
modifies its policy between interactions to exploit the agent’s
weaknesses. If the agent ignores how the other robot changes, then it
will fail to adapt accordingly and learn a poor policy.</p>
<p>The best defense for the agent is to block where it thinks the opponent
will next target. The robot therefore needs to anticipate how the
behavior of the other agent will change, and model how its own actions
affect the other’s behavior. People can deal with these scenarios on a
daily basis (e.g., driving, walking), and they do so without explicitly
modeling every low-level aspect of each other’s policy.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-14-lili/motiv3.gif" /></p>
</div></figure>
<p>Humans tend to be bounded-rational (i.e., their rationality is limited
by knowledge and computational capacity), and so likely keep track of
much less complex entities during interaction. Inspired by how humans
solve these problems, we recognize that robots also do not need to
explicitly model every low-level action another agent will make.
Instead, we can capture the hidden, underlying intent – what we call
latent strategy (in the sense that it underlies the actions of the
agent) – of other agents through learned low-dimensional
representations. These representations are learned by optimizing neural
networks based on experience interacting with these other agents.</p>
<h3 id="learning-and-influencing-latent-intent">Learning and Influencing Latent Intent</h3>
<p>We propose a framework for learning latent representations of another
agent’s policy: <strong>Learning and Influencing Latent Intent (LILI)</strong>. The
agent of our framework identifies the relationship between its behavior
and the other agent’s future strategy, and then leverages these latent
dynamics to influence the other agent, purposely guiding them towards
policies suitable for co-adaptation. At a high level, the robot learns
two things: a way to predict latent strategy, and a policy for
responding to that strategy. The robot learns these during interaction
by “thinking back” to prior experiences, and figuring out what
strategies and policies it should have used.</p>
<figure class="figure"><div class="figure__main">
<p><img src="/blog/assets/img/posts/2020-11-14-lili/method.png" /></p>
</div></figure>
<h4 id="modeling-agent-strategies">Modeling Agent Strategies</h4>
<p>The first step, shown in the left side of the diagram above, is to learn
to represent the behavior of other agents. Many prior works assume
access to the underlying intentions or actions of other agents, which
can be a restrictive assumption. We instead recognize that a
low-dimensional representation of their behavior, i.e., their latent
strategy, can be inferred from the dynamics and rewards experienced by
the agent during the current interaction. Therefore, given a sequence of
interactions, we can train an
<a href="https://en.wikipedia.org/wiki/Autoencoder">encoder-decoder</a>
model; the encoder embeds interaction <script type="math/tex">k</script> and predicts the next
latent strategy <script type="math/tex">z^{k+1}</script>, and the decoder takes this prediction
and reconstructs the transitions and rewards observed during interaction
<script type="math/tex">k+1</script>.</p>
<h4 id="influencing-by-optimizing-for-long-term-rewards">Influencing by Optimizing for Long-Term Rewards</h4>
<p>Given a prediction of what strategy the other agent will follow next,
the agent can learn how to <em>react</em> to it, as illustrated on the right
side of the diagram above. Specifically, we train an agent policy
<script type="math/tex">\pi_\theta(a | s, z^i)</script> with reinforcement learning (RL) to
make decisions conditioned on the latent strategy <script type="math/tex">z^i</script> predicted
by the encoder.</p>
<p>However, beyond simply <em>reacting</em> to the predicted latent strategy, an
intelligent agent should proactively <em>influence</em> this strategy to
maximize rewards over repeated interactions. Returning to our hockey
example, consider an opponent with three different strategies: it fires
to the left, down the middle, or to the right. Moreover, left-side shots
are easier for the agent to block and so gives a higher reward when
successfully blocked. The agent should influence its opponent to adopt
the left strategy more frequently in order to earn higher long-term
rewards.</p>
<p>For learning this influential behavior, we train the agent policy
<script type="math/tex">\pi_\theta</script> to maximize rewards across multiple interactions:</p>
<script type="math/tex; mode=display">\max_\theta~\sum_{i=1}^{\infty} \gamma^i~ \mathbb{E} \left[ \sum_{t=1}^H R(s, z^i) \right]</script>
<p>With this objective, the agent learns to generate interactions that
influence the other agent, and hence the system, toward outcomes that
are more desirable for the agent or for the team as a whole.</p>
<h3 id="experiments">Experiments</h3>
<h4 id="2d-navigation">2D Navigation</h4>
<p>We first consider a simple point mass navigation task. Similar to
pursuit-evasion games, the agent needs to reach the other agent (i.e.,
the target) in a 2D plane. This target moves one step clockwise or
counterclockwise around a circle depending on where the agent ended the
previous interaction. Because the agent starts off-center, some target
locations can be reached more efficiently than others. Importantly, the
agent never observes the location of the target.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-14-lili/pm.png" /></p>
</div></figure>
<p>Below, we visualize 25 consecutive interactions from policies learned by
Soft Actor-Critic (<strong>SAC</strong>) (a standard RL algorithm), <strong>LILI (no influence)</strong>,
and <strong>LILI</strong>. <strong>LILI (no influence)</strong> corresponds to our approach without the
influencing objective; i.e., the agent optimizes rewards accumulated in
a <em>single</em> interaction. The gray circle represents the target, while the
teal line marks the trajectory taken by the agent and the teal circle
marks the agent’s position at the final timestep of the interaction.</p>
<figure class="figure"><div class="figure__main">
<figure class="postfigurethird">
<img src="/blog/assets/img/posts/2020-11-14-lili/pm_sac.gif" />
<figcaption>
<b>SAC</b>
</figcaption>
</figure>
<figure class="postfigurethird">
<img src="/blog/assets/img/posts/2020-11-14-lili/pm_lili_no_influence.gif" />
<figcaption>
<b>LILI (no influence)</b>
</figcaption>
</figure>
<figure class="postfigurethird">
<img src="/blog/assets/img/posts/2020-11-14-lili/pm_lili.gif" />
<figcaption>
<b>LILI</b>
</figcaption>
</figure>
</div></figure>
<p>The <strong>SAC</strong> policy, at convergence, moves to the center of the circle in
every interaction. Without knowledge of or any mechanism to infer where
the other agent is, the center of the circle gives the highest stable
rewards. In contrast, <strong>LILI (no influence)</strong> successfully models the other
agent’s behavior dynamics and correctly navigates to the other agent,
but isn’t trained to influence the other agent. Our full approach <strong>LILI</strong>
<em>does</em> learn to influence: it traps the other agent at the top of the
circle, where the other agent is closest to the agent’s starting
position and yields the highest rewards.</p>
<h4 id="robotic-air-hockey">Robotic Air Hockey</h4>
<p>Next, we evaluate our approach on the air hockey task, played between
two robotic agents. The agent first learns alongside a robot opponent,
then plays against a human opponent. The opponent is a rule-based agent
which always aims away from where the agent last blocked. When blocking,
the robot does not know where the opponent is aiming, and only observes
the vertical position of the puck. We additionally give the robot a
bonus reward if it blocks a shot on the left of the board, which
incentivizes the agent to influence the opponent into aiming left.</p>
<p>In contrast to the <strong>SAC</strong> agent, the <strong>LILI</strong> agent learns to anticipate
the opponent’s future strategies and successfully block the different
incoming shots.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-14-lili/hockey_lili.gif" /></p>
</div></figure>
<p>Because the agent receives a bonus reward for blocking left, it should
lead the opponent into firing left more often. <strong>LILI (no influence)</strong> fails
to guide the opponent into taking advantage of this bonus: the
distribution over the opponent’s strategies is uniform. In contrast,
<strong>LILI</strong> leads the opponent to strike left 41% of the time, demonstrating
the agent’s ability to influence the opponent. Specifically, the agent
manipulates the opponent into alternating between the left and middle
strategies.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-14-lili/influence.jpg" /></p>
</div></figure>
<p>Finally, we test the policy learned by <strong>LILI (no influence)</strong> against a
human player following the same strategy pattern as the robot opponent.
Importantly, the human has imperfect aim and so introduces new noise to
the environment. We originally intended to test our approach <strong>LILI</strong> with
human opponents, but we found that – although <strong>LILI</strong> worked well when
playing against another robot – the learned policy was too brittle
and did not generalize to playing alongside human opponents. However,
the policy learned with <strong>LILI (no influence)</strong> was able to block 73% of
shots from the human.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-14-lili/human.gif" /></p>
</div></figure>
<h3 id="final-thoughts">Final Thoughts</h3>
<p>We proposed a framework for multi-agent interaction that represents the
behavior of other agents with learned high-level strategies, and
incorporates these strategies into an RL algorithm. Robots with our
approach were able to anticipate how their behavior would affect another
agent’s latent strategy, and actively influenced that agent for more
seamless co-adaptation.</p>
<p>Our work represents a step towards building robots that act alongside
humans and other agents. To this end, we’re excited about these next
steps:</p>
<ul>
<li>
<p>The agents we examined in our experiments had a small number of simple strategies determining their behavior. We’d like to study the scalability of our approach to more complex agent strategies that we’re likely to see in humans and intelligent agents.</p>
</li>
<li>
<p>Instead of training alongside artificial agents, we hope to study the human-in-the-loop setting in order to adapt to the dynamic needs and preferences of real people.</p>
</li>
</ul>
<hr />
<p>This post is based on the following paper:</p>
<p>Annie Xie, Dylan P. Losey, Ryan Tolsma, Chelsea Finn, Dorsa Sadigh.
<a href="http://iliad.stanford.edu/pdfs/publications/xie2020learning.pdf"><strong>Learning Latent Representations for Multi-Agent Interaction.</strong></a>
<a href="https://sites.google.com/view/latent-strategies/">Project webpage</a></p>
<p>Finally, thanks to Dylan Losey, Chelsea Finn, Dorsa Sadigh, Andrey Kurenkov, and Michelle Lee for valuable feedback on this post.</p>
Sat, 14 Nov 2020 00:00:00 -0800Bootleg: Chasing the Tail with Self-Supervised Named Entity Disambiguation
/blog/bootleg/
/blog/bootleg/<figure style="text-align: center">
<img style="width: 20%;" src="/blog/assets/img/posts/2020-11-12-bootleg/logo.png" />
</figure>
<p>Named entity disambiguation (NED) is the process of mapping “strings” to “things” in a knowledge base. You have likely already used a system that requires NED multiple times today. Every time you ask a question to your personal assistant or issue a search query on your favorite browser, these systems use NED to understand what people, places, and things (entities) are being talked about.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-12-bootleg/ned_example_1.svg" /></p>
<figcaption style="text-align: left;">Named entity disambiguation example. The ambiguous “Lincoln” refers to the car, not the person or location.</figcaption>
</div></figure>
<p>Take the example shown above. You ask your personal assistant “What is the average gas mileage of a Lincoln?”. The assistant would need NED to know that “Lincoln” refers to Lincoln Motors (the car company)—not the former president or city in Nebraska. The ambiguity of mentions in text is what makes NED so challenging as it requires the use of subtle cues.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_90" src="/blog/assets/img/posts/2020-11-12-bootleg/ned_distribution_2.svg" /></p>
<figcaption style="text-align: left;">The spectrum of entities. Popular (head) entities occur frequently in data while rare (tail) entities are infrequent.</figcaption>
</div></figure>
<p>NED gets more interesting when we examine the full spectrum of entities shown above, specifically the more rare <em>tail</em> and <em>unseen</em> entities. These are entities that occur infrequently or not at all in data. <strong>Performance over the tail is critical because the majority of entities are rare.</strong> In <a href="https://www.wikidata.org/wiki/Wikidata:Main_Page">Wikidata</a>, only 13% of entities even have Wikipedia pages as a source of textual information.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_60" src="/blog/assets/img/posts/2020-11-12-bootleg/frequency_plot_3.svg" /></p>
<figcaption style="text-align: left;">Bootleg compared to a BERT-based baseline model <a href="https://arxiv.org/pdf/2005.14253.pdf">Févry et el. 2020</a> showing average F1 versus number of times an entity occurred in the training data. As there are 15x the number of entities in Wikidata than in Wikipedia (most of them are rare) and the baseline model needs to see an entity on average 100x for it to achieve 60 F1, it follows that the baseline model would need to train on data 1,500x the size of Wikipedia to achieve 60 F1 over all entities.</figcaption>
</div></figure>
<p>Prior approaches to NED use BERT-based systems to memorize textual patterns associated with an entity (e.g., Abraham Lincoln is associated with “president”). As shown above, the SotA BERT-based <strong>baseline</strong> from <a href="https://arxiv.org/pdf/2005.14253.pdf">Févry</a> does a great job at memorizing patterns over popular entities (it achieves 86 F1 points over all entities). For the rare entities, it does much worse (58 F1 points lower on the tail). One possible solution to better tail performance is to simply train over more data, but this would likely require training over data 1,500x the size of Wikipedia for the model to achieve 60 F1 points over all entities!</p>
<p>In this blog post, we present <strong>Bootleg</strong>, a self-supervised approach to NED that is better able to handle rare entities.</p>
<h1 id="tail-disambiguation-through-ned-reasoning-patterns">Tail Disambiguation through NED Reasoning Patterns</h1>
<p>The question we are left with is how to disambiguate these rare entities? <strong>Our insight is that humans disambiguate entities, including rare entities, by using signals from text as well as from entity relations and types.</strong> For example, the sentence “What is the gas mileage of a Lincoln?” requires reasoning that cars have a gas mileage, not people or locations. This can be used to reason that the mention of “Bluebird” in “What is the average gas mileage of a Bluebird?” refers to the car, a Nissan Bluebird, not the animal. Our goal in Bootleg is to train a model to reason over entity types and relations and better identify these tail entities.</p>
<p>Through empirical analysis, we found four reasoning patterns for NED, shown and defined in the figure below.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_90" src="/blog/assets/img/posts/2020-11-12-bootleg/reasoning_patterns_4.svg" /></p>
<figcaption style="text-align: left;">Four reasoning patterns of NED. Each pattern uses some combination of entity, type, and relation information.</figcaption>
</div></figure>
<p>These patterns rely on signals from entities, types, and relations. Luckily, <strong>tail entities do not have equally rare types and relations</strong>. This means we should be able to learn type and relation patterns from our data that can apply to tail entities.</p>
<h1 id="bootleg-a-model-for-tail-ned">Bootleg: A Model for Tail NED</h1>
<p>Bootleg takes as input a sentence, determines the possible entity candidates that could be mentioned in the sentence, and outputs the most likely candidates. The core insight that enables Bootleg to better identify rare entities is in how it internally represents entities.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_60" src="/blog/assets/img/posts/2020-11-12-bootleg/candidate_embedding_5.svg" /></p>
<figcaption style="text-align: left;">The creation of an entity candidate representation. Each candidate is a combination of an entity, type, and relation learned embedding.</figcaption>
</div></figure>
<p>Similar to how words are often represented by continuous word embeddings (e.g., <a href="https://arxiv.org/pdf/1810.04805.pdf">BERT</a> or <a href="https://arxiv.org/pdf/1802.05365.pdf">ELMo</a>), Bootleg represents entity candidates as a combination of a unique entity embedding, a type embedding, and a relation embedding, as shown above. For example, each car entity will get the <em>same</em> car type embedding (likewise for relations) which will encode patterns learned over all cars in the training data. A rare car can then use this global “car type” knowledge for disambiguation, as it will have the car embedding as part of its representation.</p>
<p>To output the correct entities, Bootleg uses these representations in a stacked <a href="https://arxiv.org/pdf/1706.03762.pdf">Transformer</a> module to allow the model to naturally learn the useful patterns for disambiguation without hard-coded rules. Bootleg then scores the output candidate representations and returns the most likely candidates.</p>
<p>There are other exciting techniques we present in our <a href="https://arxiv.org/pdf/2010.10363.pdf">paper</a> regarding regularization and weak labeling to improve tail performance.</p>
<h1 id="bootleg-improves-tail-performance-and-allows-for-knowledge-transfer">Bootleg Improves Tail Performance and Allows for Knowledge Transfer</h1>
<p>Our simple insight of training a model to reason over types and relations <strong>provides state-of-the-art performance on three standard NED benchmarks</strong> – matching or exceeding SotA by up to 5.6 F1 points – and <strong>outperforms a BERT-based NED baseline by 5.4 F1 points over all entities and 40 F1 points over tail entities</strong> (see F1 versus entity occurrence plot above).</p>
<p>
<figure class="figure"><div class="figure__main">
<table>
<thead>
<tr>
<th>Benchmark</th>
<th>System</th>
<th>Precision</th>
<th>Recall</th>
<th>F1</th>
</tr>
</thead>
<tbody>
<tr>
<td rowspan="2"><a href="https://www.hoffart.ai/wp-content/papercite-data/pdf/hoffart-2012vx.pdf">KORE50</a></td>
<td><a href="https://www.mdpi.com/2073-8994/11/4/453">Hu et al., 2019</a></td>
<td>80.0</td>
<td>79.8</td>
<td>79.9</td>
</tr>
<td>Bootleg</td>
<td><b>86.0</b></td>
<td><b>85.4</b></td>
<td><b>85.7</b></td>
<tr>
<td rowspan="2"><a href="https://link.springer.com/chapter/10.1007/978-3-642-41335-3_9">RSS500</a></td>
<td><a href="https://arxiv.org/pdf/1802.01074.pdf">Phan et al., 2019</a></td>
<td>82.3 </td>
<td>82.3</td>
<td>82.3</td>
</tr>
<td>Bootleg</td>
<td><b>82.5</b> </td>
<td><b>82.5</b></td>
<td><b>82.5</b></td>
<tr>
<td rowspan="2"><a href="https://www.aclweb.org/anthology/D11-1072.pdf">AIDA CoNLL YAGO</a></td>
<td><a href="https://arxiv.org/pdf/2005.14253.pdf">Févry et al., 2020</a></td>
<td>-</td>
<td><b>96.7</b></td>
<td>-</td>
</tr>
<td>Bootleg</td>
<td>96.9</td>
<td><b>96.7</b></td>
<td>96.8</td>
</tbody>
</table>
</div></figure>
</p>
<p>We’ll now show how the entity knowledge encoded in Bootleg’s entity representations can transfer to non-NED tasks. We extract our entity representations and use them in both a production task at a major technology company and relation extraction task. We find that the use of Bootleg embeddings in the production task provides a 8% lift in performance and even improves quality over Spanish, French, and German languages. We repeat this experiment by adding Bootleg representations to a SotA model for the <a href="https://arxiv.org/pdf/2004.14855.pdf">TACRED</a> relation extraction task (see <a href="https://github.com/HazyResearch/bootleg/tree/master/tutorials/downstream_tutorial">tutorial</a>). We find this Bootleg-enhanced model sets a new SotA by 1 F1 point.</p>
<p>
<figure class="figure"><div class="figure__main">
<table>
<thead>
<tr>
<th>Model</th>
<th>TACRED F1</th>
</tr>
</thead>
<tbody>
<tr>
<td>Bootleg-Enhanced</td>
<td><b>80.3</b></td>
</tr>
<tr>
<td><a href="https://arxiv.org/pdf/1909.04164.pdf">KnowBERT</a></td>
<td>79.3</td>
</tr>
<tr>
<td><a href="https://arxiv.org/pdf/1907.10529.pdf">SpanBERT</a></td>
<td>78.0</td>
</tr>
</tbody>
</table>
</div></figure>
</p>
<p>These results suggest that Bootleg entity representations can transfer entity knowledge to other language tasks!</p>
<h1 id="recap">Recap</h1>
<p>To recap, we described the problem of the tail of NED and showed that existing NED systems fall short at disambiguating these rare, yet important entities. We then introduced four reasoning patterns for NED and described how we trained Bootleg to learn these patterns through the use of embeddings and Transformer modules. We finally showed that Bootleg is a SotA NED system that better disambiguates rare entities than prior methods. Further, Bootleg learns representations that can transfer entity knowledge to non-NED tasks.</p>
<p>We are actively developing Bootleg and would love to hear your thoughts. See our <a href="http://hazyresearch.stanford.edu/bootleg/">website</a>, <a href="https://github.com/HazyResearch/bootleg">source code</a>, and <a href="https://arxiv.org/pdf/2010.10363.pdf">paper</a>.</p>
Thu, 12 Nov 2020 00:00:00 -0800Measuring Bias in NLP (with Confidence!)
/blog/bias-nlp/
/blog/bias-nlp/<p>Countless studies have found that “bias” – typically with respect to race and gender – pervades the <a href="https://arxiv.org/abs/1904.03310">embeddings</a> and <a href="https://arxiv.org/abs/1804.09301">predictions</a> of the black-box models that dominate natural language processing (NLP). For example, the language model <a href="https://en.wikipedia.org/wiki/GPT-3">GPT-3</a>, of OpenAI fame, can generate <a href="https://www.technologyreview.com/2020/10/23/1011116/chatbot-gpt3-openai-facebook-google-safety-fix-racist-sexist-language-ai/">racist rants</a> when given the right prompt. Attempts to detect hate speech can itself harm minority populations, <a href="https://www.aclweb.org/anthology/P19-1163.pdf">whose dialect is more likely to be flagged as hateful</a>.</p>
<p>This, in turn, has led to a wave of work on how to “<a href="http://papers.nips.cc/paper/6228-man-is-to-computer-programmer-as-woman-is-to-homemaker-d">debias</a>” models, only for others to find ways in which debiased models <a href="https://arxiv.org/abs/1903.03862">are still biased</a>, and so on.</p>
<p>But are these claims of NLP models being biased (or unbiased) being made with enough evidence?</p>
<p>Consider the sentence <em>“The doctor gave instructions to the nurse before she left.”</em> A <a href="https://en.wikipedia.org/wiki/Coreference#Coreference_resolution">co-reference resolution system</a>, tasked with finding which person the pronoun “she” is referring to<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>, may incorrectly predict that it’s the nurse. Does this incorrect prediction – which conforms to gender stereotypes that doctors are usually male – mean that the system is gender-biased? Possibly – but it may also make mistakes in the other direction with equal frequency (e.g., thinking “he” refers to a nurse when it doesn’t). What if the system makes gender-stereotypical mistakes on not one sentence, but 100, or 1000? Then we could be more confident in claiming that it’s biased.</p>
<p>In my ACL 2020 paper, “<a href="https://www.aclweb.org/anthology/2020.acl-main.262/">Measuring Fairness under Uncertainty with Bernstein Bounds</a>”, I go over how, in the haste to claim the presence or absence of bias, the inherent uncertainty in measuring bias is often overlooked in the literature:</p>
<ul>
<li>
<p><strong>Bias is not a single number</strong>. When we test how biased a model is, we are <em>estimating</em> its bias on a sample of the data; our estimate may suggest that the model is biased or unbiased, but the opposite could still be true.</p>
</li>
<li>
<p><strong>This uncertainty can be captured using confidence intervals.</strong> Instead of reporting a single number for bias, practitioners should report an interval, based on factors such as the desired confidence and the proposed definition of “bias”.</p>
</li>
<li>
<p><strong>Existing datasets are too small to conclusively identify bias.</strong> Existing datasets for measuring specific biases can only be used to make 95% confidence claims when the bias estimate is egregiously high; to catch more subtle bias, the NLP community needs bigger datasets.</p>
</li>
</ul>
<p>Although this problem can exist with any kind of model, we focus on a remedy for classification models in particular.</p>
<h3 id="bernstein-bounded-unfairness">Bernstein-Bounded Unfairness</h3>
<p>A bias estimate, made using a small sample of data, likely differs from the true bias (i.e., at the population-level). How can we express our uncertainty about the estimate? We propose a method called Bernstein-bounded unfairness that translates this uncertainty into a confidence interval<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<p>Let’s say we want to measure whether some <a href="https://en.wikipedia.org/wiki/Protected_group">protected group</a> <script type="math/tex">A</script> – that is legally protected due to an attribute such as race or gender – is being discriminated against by some classifier, relative to some unprotected group <script type="math/tex">B</script>. They occur in the population with frequency <script type="math/tex">\gamma_A, \gamma_B</script> respectively. We need</p>
<ul>
<li>
<p>An annotation function <script type="math/tex">f</script> that maps each example <script type="math/tex">x</script> to <script type="math/tex">A, B,</script> or neither. Note that the annotation function maps inputs to the protected/unprotected groups, not to the output space <script type="math/tex">Y</script>. For example, if we wanted to study how a sentiment classifier performed across different racial groups, then the inputs <script type="math/tex">x</script> would be sentences, labels <script type="math/tex">y</script> would be the sentiment, and the annotation function <script type="math/tex">f</script> might map <script type="math/tex">x</script> to {white, non-white} depending on the racial group of the sentence author.</p>
</li>
<li>
<p>A cost function <script type="math/tex">c : (y, \hat{y}) \to [0,C]</script> that describes the cost of incorrectly predicting <script type="math/tex">\hat{y}</script> when the true label is <script type="math/tex">y</script>, where <script type="math/tex">C</script> is the maximum possible cost. Since a model making an incorrect prediction for <script type="math/tex">x</script> is an undesirable outcome for the group that <script type="math/tex">x</script> belongs to, we frame this as a cost that must be borne by the group.</p>
</li>
</ul>
<p>We want to choose these functions such that our bias metric of choice – which we call the <em>groupwise disparity</em> <script type="math/tex">\delta(f,c)</script> – can be expressed as the difference in expected cost borne by the protected and unprotected groups. Given a model that makes predictions <script type="math/tex">\hat{y}_a</script> for protected <script type="math/tex">x_a \in A</script> and <script type="math/tex">\hat{y}_b</script> for unprotected <script type="math/tex">x_b \in B</script>, we want to express the bias as:</p>
<script type="math/tex; mode=display">\delta(f,c) = \mathbb{E}_a[c(y_a, \hat{y}_a)] - \mathbb{E}_b[c(y_b, \hat{y}_b)]</script>
<p>If the protected group is incurring higher costs in expectation, it is being biased against. For example, if we want to determine whether a classifier is more accurate on the unprotected group <script type="math/tex">B</script>, then we would set the cost function to be the 1-0 loss (1 for an incorrect prediction, 0 for a correct one). If <script type="math/tex">B</script> has a lower cost on average then <script type="math/tex">A</script>, then it would mean that the classifier is more accurate on <script type="math/tex">B</script>.</p>
<p>For a desired confidence level <script type="math/tex">\rho \in [0,1)</script>, a dataset of <script type="math/tex">n</script> examples, and the variance <script type="math/tex">\sigma^2</script> of the amortized groupwise disparity across examples, the confidence interval <script type="math/tex">t</script> would be</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
t &= \frac{B + \sqrt{B^2 - 8 n \sigma^2 \log \left[\frac{1}{2} (1 - \rho) \right]}}{2n} \\
\text{where } B &= -\frac{2 C}{3 \gamma} \log \left[ \frac{1}{2} (1 - \rho) \right], \gamma = \min(\gamma_A, \gamma_B)
\end{aligned} %]]></script>
<p>If we set <script type="math/tex">\rho = 0.95</script>, we could claim with 95% confidence that the true bias experienced by the protected group lies in the interval <script type="math/tex">[ \hat{\delta} - t, \hat{\delta} + t]</script>, where <script type="math/tex">\hat{\delta}</script> is our bias estimate.</p>
<h3 id="why-we-need-bigger-datasets">Why We Need Bigger Datasets</h3>
<p>If we want to say with 95% confidence that a classifier is biased <em>to some extent</em> – but want to spend as little time annotating data as possible – we need to find the smallest <script type="math/tex">n</script> such that <script type="math/tex">0 \not\in [ \hat{\delta} - t, \hat{\delta} + t]</script>. We can do this by working backwards from the formula for <script type="math/tex">t</script> given above (see paper for details).</p>
<p>Let’s go back to our original example. Say we want to figure out whether a co-reference resolution system, tasked with matching pronouns to the nouns they refer to, is gender-biased or not. We have a dataset of 500 examples to test whether the model does better on gender-stereotypical examples (e.g., a female nurse) than non-gender-stereotypical examples (e.g., a male nurse). Since we are measuring the difference in accuracy, we set the cost function to be the 1-0 loss.</p>
<p>On this dataset, our bias estimate for a model we’re evaluating is <script type="math/tex">\bar{\delta} = 0.05</script>. Is this enough to claim with 95% confidence that the model is gender-biased?</p>
<p>In this scenario <script type="math/tex">C = 1, \bar{\delta} = 0.05, \rho = 0.95</script>. We assume that there are equally many stereotypical and non-stereotypical examples and that the variance is maximal, so <script type="math/tex">\gamma = 0.5, \sigma^2 = 4</script>.</p>
<p>With these settings, <script type="math/tex">n > 11903</script>; we would need a dataset of more than 11903 examples to claim with 95% confidence that the co-reference resolution system is gender-biased. This is roughly 3.8 times larger than <a href="https://arxiv.org/abs/1804.06876">WinoBias</a>, the largest dataset currently available for this purpose. We could only use WinoBias if <script type="math/tex">\bar{\delta} = 0.0975</script> – that is, if the sample bias were almost twice as high.</p>
<p align="center">
<img src="/blog/assets/img/posts/2020-11-11-bias-nlp/bbu_3.png" style="width: 80%" />
<figcaption>As seen above, the WinoBias dataset cannot be used to make claims of bias with 95% confidence unless the sample bias is egregiously high.</figcaption>
</p>
<h3 id="conclusion">Conclusion</h3>
<p>In the haste to claim the presence or absence of bias in models, the uncertainty in estimating bias is often overlooked in the literature. A model’s bias is often thought of as a single number, even though this number is ultimately an estimate and not the final word on whether the model is or is not biased.</p>
<p>We proposed a method called Bernstein-bounded unfairness for capturing this uncertainty using confidence intervals. To faithfully reflect the range of possible conclusions, we recommend that NLP practitioners measuring bias not only report their bias estimate but also this confidence interval.</p>
<p>What if we want to catch more subtle bias? Although it may be possible to derive tighter confidence intervals, what we really need are larger bias-specific datasets. The datasets we currently have are undoubtedly helpful, but they need to be much larger in order to diagnose biases with confidence.</p>
<h5 id="acknowledgements">Acknowledgements</h5>
<p class="small-text">
Many thanks to Krishnapriya Vishnubhotla, Michelle Lee, and Kaitlyn Zhou for their feedback on this blog post.
</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The goal of coreference resolution more broadly is to find all expressions that refer to the same entity in a text. For example, in “I gave my mother Sally a gift for her birthday.”, the terms “my mother”, “Sally”, and “her” all refer to the same entity. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>We use <a href="https://en.wikipedia.org/wiki/Bernstein_inequalities_(probability_theory)">Bernstein’s inequality</a> to derive the confidence intervals, hence the name Bernstein-bounded unfairness. This inequality tells us with what probability the average of <script type="math/tex">n</script> independent random variables will be within a constant $t$ of their true mean $\mu$. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Wed, 11 Nov 2020 00:00:00 -0800Learning to Fix Programs from Error Messages
/blog/DrRepair/
/blog/DrRepair/<h3 id="machine-learning-for-program-repair"><strong>Machine Learning for Program Repair</strong></h3>
<p>When writing programs, a lot of time is spent debugging or fixing source code errors, both for beginners (imagine the intro programming classes you took) as well as for professional developers (for example, <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42184.pdf">this case study from Google</a> <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>). Automating program repair could dramatically enhance the productivity of both programming and learning programming. In <a href="https://arxiv.org/pdf/2005.10636.pdf">our recent work</a> published at ICML 2020, we study how to use machine learning to repair programs automatically.</p>
<h3 id="problem-setting"><strong>Problem Setting</strong></h3>
<p>Programmers write programs incrementally: write code, compile or execute it, and if there are any errors, repair the program based on the received feedback. Can we model and solve this problem with machine learning?</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-11-08-DrRepair/task.png" /></p>
</div></figure>
<p>Let’s say we have a broken C++ program (figure left), where the <code class="highlighter-rouge">char</code> in line 5 should actually be <code class="highlighter-rouge">string</code>. When we compile it, we get an error (figure top right), which says “line 9 is requesting for size in <code class="highlighter-rouge">a</code> which is of type <code class="highlighter-rouge">char</code>”. From this message, a programmer can notice that the error is related to the type of the variable <code class="highlighter-rouge">a</code>, track how <code class="highlighter-rouge">a</code> has been used or declared in the source code, reaching line 5, and then edit the line to fix the error. Thus, the concrete task we want our machine learning model to solve is, given broken code (figure left) and an error message (figure top right), <strong>localize</strong> the error line (line 5) and <strong>generate a repaired version</strong> of it (“string tmp, a, b;”) (figure bottom right).</p>
<p><strong>Challenges</strong>:
This task poses two main challenges. First, on the modeling side, we need to connect and jointly reason over two modalities, the program and the error message: for instance, tracking variables that caused the error as we saw in the example above. Second, on the training data side, we need an efficient source of data that provides supervision for correcting broken programs; unfortunately, existing labeled datasets with <broken code, fixed code> pairs are small and hard to come by, and don’t scale up. In this work, we introduce promising solutions to those two challenges by: 1) modeling program repair with program-feedback graph, and 2) introducing a self-supervised training scheme that uses unlabeled programs.</p>
<h3 id="modeling-approach-program-feedback-graph"><strong>Modeling Approach: Program-Feedback Graph</strong></h3>
<p>How can we effectively connect the two modalities (programs and error messages) and perform the reasoning needed for repair? To achieve this, we introduce a program-feedback graph, a joint graph representation that connects symbols across the program and error message. For instance, the compiler message in the example mentions <code class="highlighter-rouge">a</code>, <code class="highlighter-rouge">size</code>, and <code class="highlighter-rouge">char</code>, so we connect these symbols to their occurrences in the source code, to capture semantic correspondence. This way, we treat the two modalities in a shared semantic space rather than separately. We then perform reasoning over the symbols in this space using <a href="https://arxiv.org/abs/1710.10903">graph attention</a> <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-08-DrRepair/graph.png" /></p>
</div></figure>
<p>Specifically, for the model architecture, we build on the encoder-decoder framework commonly used in NLP, which encodes input sequences (in our case, the program and error message; next figure bottom) and then decodes outputs (in our case, the localized line index, and the repaired version of the line; figure top), and we incorporate a graph attention module applied to the program-feedback graph in the intermediate layer of the architecture (figure middle).</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-11-08-DrRepair/model.png" /></p>
</div></figure>
<h3 id="training-approach-self-supervised-learning"><strong>Training Approach: Self-Supervised Learning</strong></h3>
<p>Our second technique is self-supervised learning. Labeled datasets of program repair are small, but there are vast amounts of unlabeled programs available online. For example, GitHub has more than 30M public repositories. Using this large amount of freely available code to improve learning program repair would significantly enhance the scalability and reliability of our system.
Our idea is as follows: we first collect unlabeled, working programs from online resources such as GitHub and codeforce.com (figure left). We then design randomized program corruption procedures (e.g. delete/insert/replace tokens) and corrupt the unlabeled programs (figure middle). As a result, the corrupted programs give us errors (figure right). This way, we can create a lot of new examples of program repair, <broken code, error message, fixed code>. We can use this extra data to pre-train the program repair model, and then fine-tune on the labeled target dataset.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-11-08-DrRepair/self-supervised.png" /></p>
</div></figure>
<h3 id="lets-use-our-program-repair-model"><strong>Let’s use our program repair model!</strong></h3>
<p>We apply and evaluate our repair model (we call DrRepair) on two benchmark tasks:</p>
<ul>
<li>Correcting C programs written by students (<a href="https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603">DeepFix dataset</a><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>)</li>
<li>Correcting the output of C++ program synthesis <a href="https://arxiv.org/abs/1906.04908">(SPoC dataset</a><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>)</li>
</ul>
<p><strong>Application to DeepFix (Correcting Student Programs)</strong></p>
<p>In DeepFix, the task is to correct C programs written by students in an intro programming class so that they will compile. The input programs may have multiple lines with errors, so we apply the repair model iteratively, addressing one error at a time. For instance, the following figure shows an example program in DeepFix, which has a compiler error saying that “<code class="highlighter-rouge">i</code> is undeclared”. By applying the repair model, DrRepair, it repairs this error by inserting a declaration of <code class="highlighter-rouge">i</code> in line 5. After this fix, we notice that there is another error, which says “expected semicolon before brace”. We can apply the repair model again - this time, the model inserts a semicolon in line 12, and now the repaired program compiles successfully! This approach is conducive to the idea of iterative refinement: we can keep running the repair model and progressively fixing errors.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-11-08-DrRepair/application_deepfix.png" /></p>
</div></figure>
<p><strong>What is the effect of using error messages, program-feedback graphs, and self-supervised pre-training?</strong> Existing repair systems studied on DeepFix did not use compiler error messages - they aimed to directly translate from broken code to fixed code. To see the effect of using error messages in the first place, we tried removing all our techniques from the system: the use of compiler messages, program-feedback graphs, and pre-training. This version of our model (“ours: no compiler” in the figure below) achieves 34% repair accuracy on DeepFix, which is comparable to the existing systems. Now we add compiler messages to our input. We find that this model achieves much better performance and generalization (62.5% accuracy; “ours: base” in the figure). This suggests that with an access to error messages, the model learns the right inductive bias to repair the code based on the feedback. Next, we add program-feedback graphs and self-supervised pre-training. We find that both provide further improvements (“ours: base+graph” and “ours: base+graph+pretrain”), and our final system can fix 68.2% of the broken programs in DeepFix!</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage_75" src="/blog/assets/img/posts/2020-11-08-DrRepair/result_deepfix.png" /></p>
</div></figure>
<p><strong>Application to SPoC (Natural Language to Code)</strong></p>
<p>Program synthesis, in particular systems that can translate natural language descriptions (e.g. English) into code (e.g. Python, C++), are useful because they can help a wider range of people use programming languages. In SPoC (Pseudocode-to-Code), the task is to synthesize C++ implementation from pseudocode, a natural language description of a program. However, one challenge experienced by existing synthesizers (machine translation models applied to SPoC) is that they tend to output inconsistent code that does not compile - for instance, in the figure below, the variable <code class="highlighter-rouge">i</code> is declared twice in the synthesized code. We find that we can apply our program repair model to this invalid code and fix it into a correct one, helping the program synthesis task. In the evaluation on SPoC, the use of our repair model improves the final synthesis success rate from the existing system’s 34% to 37.6%.</p>
<figure class="figure"><div class="figure__main">
<p><img class="postimage" src="/blog/assets/img/posts/2020-11-08-DrRepair/application_spoc.png" /></p>
</div></figure>
<h3 id="conclusion"><strong>Conclusion</strong></h3>
<p>In this work, we studied how to use machine learning to repair programs from error messages, and developed three key insights:</p>
<ol>
<li>Error messages provide a crucial signal for learning program repair.</li>
<li>Program-feedback graphs (joint representations of code & error messages) help model the reasoning of repair (e.g. tracking variables that caused the error).</li>
<li>Self-supervised learning allows us to turn freely-available, unlabeled programs (e.g. GitHub code) into useful training examples of program repair.</li>
</ol>
<p>This work also provides a general framework of “learning from feedback”, which has various applications: editing documents based on comments, learning from users in interactive dialog, etc.</p>
<p>You can check out our full paper (ICML 2020) <a href="https://arxiv.org/pdf/2005.10636.pdf">here</a> and our source code/data on <a href="https://github.com/michiyasunaga/DrRepair">GitHub</a>. You can also find the presentation slides on this work <a href="https://cs.stanford.edu/~myasu/files/DrRepair_slides.pdf">here</a>. If you have questions, please feel free to email us!</p>
<ul>
<li>Michihiro Yasunaga: <a href="mailto:myasu@cs.stanford.edu">myasu@cs.stanford.edu</a></li>
</ul>
<h3 id="acknowledgments"><strong>Acknowledgments</strong></h3>
<p>Many thanks to Percy Liang, as well as members of the P-Lambda lab and the Stanford NLP group for their valuable feedback, and to Sidd Karamcheti and Andrey Kurenkov for edits on this blog post!</p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p><a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42184.pdf">Programmers’ Build Errors: A Case Study (at Google)</a>. Hyunmin Seo, Caitlin Sadowski, Sebastian Elbaum, Edward Aftandilian, Robert Bowdidge. 2014 <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p><a href="https://arxiv.org/abs/1710.10903">Graph Attention Networks</a>. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, Yoshua Bengio. 2018. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p><a href="https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14603">DeepFix: Fixing common C language errors by deep learning</a>. Rahul Gupta, Soham Pal, Aditya Kanade, Shirish Shevade. 2017. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p><a href="https://arxiv.org/abs/1906.04908">SPoC: Search-based Pseudocode to Code</a>. Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken and Percy Liang. 2019. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Sun, 08 Nov 2020 00:00:00 -0800Adapting on the Fly to Test Time Distribution Shift
/blog/adaptive-risk-minimization/
/blog/adaptive-risk-minimization/<p>Imagine that you are building the next generation machine learning model for handwriting transcription. Based on previous iterations of your product, you have identified a key challenge for this rollout: after deployment, new end users often have different and unseen handwriting styles, leading to <em>distribution shift</em>. One solution for this challenge is to learn an <em>adaptive</em> model that can specialize and adjust to each user’s handwriting style over time. This solution seems promising, but it must be balanced against concerns about ease of use: requiring users to provide feedback to the model may be cumbersome and hinder adoption. Is it possible instead to learn a model that can adapt to new users <em>without labels</em>?</p>
<p>In many scenarios, including this example, the answer is “yes”. Consider the ambiguous example shown enlarged in the figure below. Is this character a “2” with a loop or a <a href="https://en.wikipedia.org/wiki/A#English">double-storey “a”</a>? For a non adaptive model that pays attention to the biases in the training data, the reasonable prediction would be “2”. However, even without labels, we can extract useful information from the user’s other examples: an adaptive model, for example, can observe that this user has written “2”s without loops and conclude that this character is thus more likely to be “a”.</p>
<figure class="figure"><div class="figure__main">
<p><img width="100%" src="/blog/assets/img/posts/2020-11-05-adaptive-risk-minimization/intro.gif" /></p>
</div></figure>
<p>Handling the distribution shift that arises from deploying a model to new users is an important motivating example for unlabeled adaptation. But, this is far from the only example. In an ever-changing world, autonomous cars need to adapt to new weather conditions and locations, image classifiers need to adapt to new cameras with different intrinsics, and recommender systems need to adapt to users’ evolving preferences. Humans have demonstrated the ability to <a href="http://pages.cs.wisc.edu/~jerryzhu/pub/tie.pdf">adapt without labels</a> by inferring information from the distribution of test examples. Can we develop methods that can allow machine learning models to do the same?</p>
<p>This question has enjoyed growing attention from researchers, with a number of recent works proposing methods for unlabeled test time adaptation. In this post, I will survey these works as well as other prominent frameworks for handling distribution shift. With this broader context in mind, I will then discuss our recent work (see the paper <a href="https://arxiv.org/abs/2007.02931">here</a> and the code <a href="https://github.com/henrikmarklund/arm">here</a>), in which we propose a problem formulation that we term <strong>adaptive risk minimization</strong>, or ARM.</p>
<h2 id="diving-into-distribution-shift">Diving into Distribution Shift</h2>
<p>The vast majority of work in machine learning follows the canonical framework of <strong>empirical risk minimization</strong>, or ERM. ERM methods assume that there is no distribution shift, so the test distribution exactly matches the training distribution. This assumption simplifies the development and analysis of powerful machine learning methods but, as discussed above, is routinely violated in real-world applications. To move beyond ERM and learn models that generalize in the face of distribution shift, we must introduce additional assumptions. However, we must carefully choose these assumptions such that they are still realistic and broadly applicable.</p>
<p>How do we maintain realism and applicability? One answer is to model the assumptions on the conditions that machine learning systems face in the real world. For example, in the ERM setting, models are evaluated on each test point one at a time, but in the real world, these test points are often available sequentially or in <em>batches</em>. For handwriting transcription, for example, we can imagine collecting entire sentences and paragraphs from new users. If there is distribution shift, observing multiple test points can be useful either to infer the test distribution or otherwise adapt the model to this new distribution, even in the absence of labels.</p>
<p>Many recent methods that use this assumption can be classified as <strong>test time adaptation</strong>, including <a href="https://arxiv.org/abs/1603.04779">batch normalization</a>, <a href="https://arxiv.org/abs/1802.03916">label shift estimation</a>, <a href="https://arxiv.org/abs/1909.13231">rotation prediction</a>, <a href="https://arxiv.org/abs/2006.10726">entropy minimization</a>, and more. Oftentimes, these methods build in strong inductive biases that enable useful adaptation; for example, rotation prediction is well aligned with many image classification tasks. But these methods generally either propose heuristic training procedures or do not consider the training procedure at all, relying instead on pretrained models.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> This begs the question: can test time adaptation be further enhanced by improved training, such that the model can make better use of the adaptation procedure?</p>
<p>We can gain insight into this question by investigating other prominent frameworks for handling distribution shift and, in particular, the assumptions these frameworks make. In real-world applications, the training data generally does not consist only of input label pairs; instead, there are additional <em>meta-data</em> associated with each example, such as time and location, or the particular user in the handwriting example. These meta-data can be used to organize the training data into <em>groups</em>,<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> and a common assumption in a number of frameworks is that the test time distribution shifts represent either new group distributions or new groups altogether. This assumption still allows for a wide range of realistic distribution shifts and has driven the development of numerous practical methods.</p>
<p>For example, <strong>domain adaptation</strong> methods typically assume access to two training groups: source and target data, with the latter being drawn from the test distribution. Thus, these methods augment training to focus on the target distribution, such as through <a href="https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.370.4921&rep=rep1&type=pdf">importance</a> <a href="http://sifaka.cs.uiuc.edu/czhai/pub/acl07.pdf">weighting</a> or learning <a href="https://arxiv.org/abs/1505.07818">invariant</a> <a href="https://arxiv.org/abs/1702.05464">representations</a>. Methods for <a href="http://papers.neurips.cc/paper/3019-mixture-regression-for-covariate-shift.pdf"><strong>group</strong></a> <a href="https://arxiv.org/abs/1611.02041"><strong>distributionally robust</strong></a> <a href="https://arxiv.org/abs/1911.08731"><strong>optimization</strong></a> and <a href="https://papers.nips.cc/paper/4312-generalizing-from-several-related-classification-tasks-to-a-new-unlabeled-sample"><strong>domain</strong></a> <a href="https://arxiv.org/abs/2007.01434"><strong>generalization</strong></a> do not directly assume access to data from the test distribution, but instead use data drawn from multiple training groups in order to learn a model that generalizes at test time to new groups (or new group distributions). So, these prior works have largely focused on the training procedure and generally do not adapt at test time (despite the name “domain adaptation”).</p>
<h2 id="combining-training-and-test-assumptions">Combining Training and Test Assumptions</h2>
<p>Prior frameworks for distribution shift have assumed either training groups or test batches, but we are not aware of any prior work that uses both assumptions. In our work, we demonstrate that it is precisely this conjunction that allows us to <em>learn to adapt</em> to test time distribution shift, by simulating both the shift and the adaptation procedure at training time. In this way, our framework can be understood as a <strong>meta-learning</strong> framework, and we refer interested readers to this <a href="https://bair.berkeley.edu/blog/2017/07/18/learning-to-learn/">blog post</a> for a detailed overview of meta-learning.</p>
<h3 id="adaptive-risk-minimization">Adaptive Risk Minimization</h3>
<p>Our work proposes <a href="https://arxiv.org/abs/2007.02931">adaptive risk minimization</a>, or ARM, which is a problem setting and objective that makes use of both groups at training time and batches at test time. This synthesis provides a general and principled answer, through the lens of meta-learning, to the question of how to train for test time adaptation. In particular, we <em>meta-train</em> the model using simulated distribution shifts, which is enabled by the training groups, such that it exhibits strong <em>post-adaptation</em> performance on each shift. The model therefore directly learns how to best leverage the adaptation procedure, which it then executes in the exact same way at test time. If we can identify which test distribution shifts are likely, such as seeing data from new end users, then we can better construct simulated training shifts, such as sampling data from only one particular training user.</p>
<figure class="figure"><div class="figure__main">
<p><img width="100%" src="/blog/assets/img/posts/2020-11-05-adaptive-risk-minimization/arm.gif" /></p>
</div></figure>
<p>The training procedure for optimizing the ARM objective is illustrated in the graphic above. From the training data, we sample different batches that simulate different group distribution shifts. An <em>adaptation model</em> then has the opportunity to adapt the model parameters using the unlabeled examples. This allows us to meta-train the model for post-adaptation performance by directly performing gradient updates on both the model and the adaptation model.</p>
<figure class="figure"><div class="figure__main">
<p><img width="100%" src="/blog/assets/img/posts/2020-11-05-adaptive-risk-minimization/methods.png" /></p>
<figcaption>
We draw inspiration from contextual meta-learning (left) and gradient based meta-learning (right) in order to devise methods for ARM. For contextual meta-learning, we investigate two different methods that fall under this category. These methods are described in detail in <a href="https://arxiv.org/abs/2007.02931">our paper</a>.
</figcaption>
</div></figure>
<p>The connection to meta-learning is one key advantage of the ARM framework, as we are not starting from scratch when devising methods for solving ARM. In our work in particular, we draw inspiration from both <a href="https://arxiv.org/abs/1807.01613">contextual meta-learning</a> and <a href="https://arxiv.org/abs/1703.03400">gradient based meta-learning</a> to develop three methods for solving ARM, which we name ARM-CML, ARM-BN, and ARM-LL. We omit the details of these methods here, but they are illustrated in the figure above and described in full in <a href="https://arxiv.org/abs/2007.02931">our paper</a>.</p>
<p>The diversity of methods that we construct demonstrate the versatility and generality of the ARM problem formulation. But do we actually observe empirical gains using these methods? We investigate this question next.</p>
<h3 id="experiments">Experiments</h3>
<p>In our experiments, we first conducted a thorough study of the proposed ARM methods compared to various baselines, prior methods, and ablations, on four different image classification benchmarks exhibiting group distribution shift. <a href="https://arxiv.org/abs/2007.02931">Our paper</a> provides full details on the benchmarks and comparisons.</p>
<figure class="figure"><div class="figure__main">
<p><img width="100%" src="/blog/assets/img/posts/2020-11-05-adaptive-risk-minimization/results.png" /></p>
<figcaption>
We found that ARM methods empirically resulted in both better worst case (WC) and average (Avg) performance across groups compared to prior methods, indicating both better robustness and performance from the final trained models.
</figcaption>
</div></figure>
<p>In our main study, we found that ARM methods do better across the board both in terms of worst case and average test performance across groups, compared to a number of prior methods along with other baselines and ablations. The simplest method of ARM-BN, which can be implemented in just a few lines of additional code, often performed the best. This empirically shows the benefits of meta-learning, in that the model can be meta-trained to take greater advantage of the adaptation procedure.</p>
<figure class="figure"><div class="figure__main">
<p><img width="75%" src="/blog/assets/img/posts/2020-11-05-adaptive-risk-minimization/femnist.gif" /></p>
</div></figure>
<p>We also conducted some qualitative analyses, in which we investigated a test situation similar to the motivating example described at the beginning with a user that wrote double-storey a’s. We empirically found that models trained with ARM methods did in fact successfully adapt and predict “a” in this situation, when given enough examples of the user’s handwriting that included other “a”s and “2”s. Thus, this confirms our original hypothesis that training adaptive models is an effective way to deal with distribution shift.</p>
<p>We believe that the motivating example from the beginning as well as the empirical results in our paper convincingly argue for further study into general techniques for <em>adaptive models</em>. We have presented a general scheme for meta-training these models to better harness their adaptation capabilities, but a number of open questions remain, such as devising better adaptation procedures themselves. This broad research direction will be crucial for machine learning models to truly realize their potential in complex, real-world environments.</p>
<hr />
<p>Thanks to Chelsea Finn and Sergey Levine for providing valuable feedback on this post.
This blog post also appeared on the <a href="https://bair.berkeley.edu/blog/2020/11/05/arm/">Berkeley AI Research Blog</a>.</p>
<p>Part of this post is based on the following paper:</p>
<p>Marvin Zhang*, Henrik Marklund*, Nikita Dhawan*, Abhishek Gupta, Sergey Levine, Chelsea Finn.
<a href="https://arxiv.org/abs/2007.02931"><strong>Adaptive Risk Minimization: A Meta-Learning Approach for Tackling Group Shift.</strong></a>
<a href="https://sites.google.com/view/adaptive-risk-minimization">Project webpage</a>
<a href="https://github.com/henrikmarklund/arm">Open source code</a></p>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>On the flip side, applicability to even pretrained models can be seen as a strength of these methods. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Alternatively referred to as domains, subpopulations, tasks, users, and more. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Thu, 05 Nov 2020 00:00:00 -0800