Introduction
Machine learning models are susceptible to learning irrelevant patterns. In other words, they rely on some spurious features that we humans know to avoid. For example, assume that you are training a model to predict whether a comment is toxic on social media platforms. You would expect your model to predict the same score for similar sentences with different identity terms. For example, “some people are Muslim” and “some people are Christian” should have the same toxicity score. However, as shown in ^{1}, training a convolutional neural net leads to a model which assigns different toxicity scores to the same sentences with different identity terms. Reliance on spurious features is prevalent among many other machine learning models. For instance, ^{2} shows that state of the art models in object recognition like Resnet50 ^{3} rely heavily on background, so changing the background can also change their predictions .
Machine learning models rely on spurious features such as background in an image or identity terms in a comment. Reliance on spurious features conflicts with fairness and robustness goals.
Of course, we do not want our model to rely on such spurious features due to fairness as well as robustness concerns. For example, a model’s prediction should remain the same for different identity terms (fairness); similarly its prediction should remain the same with different backgrounds (robustness). The first instinct to remedy this situation would be to try to remove such spurious features, for example, by masking the identity terms in the comments or by removing the backgrounds from the images. However, removing spurious features can lead to drops in accuracy at test time ^{4}^{5}. In this blog post, we explore the causes of such drops in accuracy.
There are two natural explanations for accuracy drops:
 Core (nonspurious) features can be noisy or not expressive enough so that even an optimal model has to use spurious features to achieve the best accuracy ^{6}^{7}^{8}.
 Removing spurious features can corrupt the core features ^{9}^{10}.
One valid question to ask is whether removing spurious features leads to a drop in accuracy even in the absence of these two reasons. We answer this question affirmatively in our recently published work in ACM Conference on Fairness, Accountability, and Transparency (ACM FAccT) ^{11}. Here, we explain our results.
Removing spurious features can lead to drop in accuracy even when spurious features are removed properly and core features exactly determine the target!
Before delving into our result, we note that understanding the reasons behind the accuracy drop is crucial for mitigating such drops. Focusing on the wrong mitigation method fails to address the accuracy drop.
Before trying to mitigate the accuracy drop resulting from the removal of the spurious features, we must understand the reasons for the drop.
Previous work  Previous work  This work  

Removing spurious features causes drops in accuracy because…  core features are noisy and not sufficiently expressive.  spurious features are not removed properly and thus corrupt core features.  a lack of training data causes spurious connections between some features and the target. 
We can mitigate such drops by…  focusing on collecting more expressive features (e.g., highresolution images)  focusing on more accurate methods for removing spurious features.  focusing on collecting more diverse training data. We show how to leverage unlabeled data to achieve such diversity. 
This work in a nutshell:
 We study overparameterized models that fit training data perfectly.
 We compare the “core model” that only uses core features (nonspurious) with the “full model” that uses both core features and spurious features.
 Using the spurious feature, the full model can fit training data with a smaller norm.
 In the overparameterized regime, since the number of training examples is less than the number of features, there are some directions of data variation that are not observed in the training data (unseen directions).
 Though both models fit the training data perfectly, they have different “assumptions’’ for the unseen directions. This difference can lead to
 Drop in accuracy
 Affecting different test distributions (we also call them groups) disproportionately (increasing accuracy in some while decreasing accuracy in others).
Noiseless Linear Regression
Over the last few years, researchers have observed some surprising phenomena about deep networks that conflict with classical machine learning. For example, training models to zero training loss leads to better generalization instead of overfitting ^{12}. A line of work ^{13}^{14} found that these unintuitive results happen even for simple models such as linear regression if the number of features are greater than the number of training data, known as the overparameterized regime.
Accuracy drops due to the removal of spurious features is also unintuitive. Classical machine learning tells us that removing spurious features should decrease generalization error (since these features are, by definition, irrelevant for the task). Analogous to the mentioned work, we will explain this unintuitive result in overparameterized linear regression as well.
Accuracy drop due to removal of the spurious feature can be explained in overparameterized linear regression.
Let’s first formalize the noiseless linear regression setup. Recall that we are going to study a setup in which the target is completely determined by the core features, and the spurious feature is a single feature that can be removed perfectly without affecting predictive performance. Formally, we assume there are \(d\) core features \(z \in \mathbb{R}^d\) that determine the target \(y \in \mathbb{R}\) perfectly, i.e., \( y = {\theta^\star}^\top z\). In addition, we assume there is a single spurious feature \(s\) that can also be determined by the core features \(s = {\beta^\star}^\top z\). Note that the spurious feature can have information about features that determine the target or it can be completely unrelated to the target (i.e., for all \(i\), \(\beta^\star_i \theta^\star_i=0\)).
We consider a setup where target (\(y\)) is a deterministic function of core features (\(z\)). In addition, there is a spurious feature (\(s\)) that can also be determined by the core feature. We compare two models, the core model that only uses \(z\) to predict \(y\) and the full model which uses both \(z\) and \(s\) to predict \(y\).
We consider two models:
 Core model that only uses the core features \(z\) to predict the target \(y\), and it is parametrized by \({\theta^\text{s}}\). For a data point with core features \(z\), its prediction is \(\hat y = {\theta^\text{s}}^\top z\).
 Full model that uses the core features \(z\) and also uses the spurious feature \(s\), and it is parametrized by \({\theta^\text{+s}}\), and \(w\), For a data point with core feature \(z\) and a spurious feature \(s\), its prediction is \(\hat y = {\theta^\text{+s}}^\top z + ws\).
In this setup, the mentioned two reasons that naturally can cause accuracy drop after removing the spurious feature (depicted in the table above) do not exist.
 The spurious feature \(s\) adds no information about the target \(y\) beyond what already exists in the core features \(z\) (reason 1),
 Removing \(s\) does not corrupt \(z\) (reason 2).
Motivated by recent work in deep learning, which speculates that gradient descent converges to the minimumnorm solution that fits training data perfectly ^{15}, we consider the minimumnorm solution.
 Training data: We assume we have \(n < d\) triples of \((z_i, s_i, y_i)\)
 Test data: We assume core features in the test data are from a distribution with covariance matrix \(\Sigma = \mathbb{E}[zz^\top]\) (we use group and test data distribution exchangeably).
In this simple setting, one might conjecture that removing the spurious feature should only help accuracy. However, we show that this is not always the case. We exactly characterize the test distributions that are negatively affected by removing spurious features, as well as the ones that are positively affected by it.
Example
Let’s first look at a simple example with only one training data and three core features (\(z_1, z_2\) and \(z_3\)). Let the true parameters \(\theta^\star =[2,2,2]^\top\) which results in \(y=2\), and let the spurious feature parameter \({\beta^\star} = [1,2,2]^\top\) which results in \(s=1\).
First, note that the smallest L2norm vector that can fit the training data for the core model is \({\theta^\text{s}}=[2,0,0]\). On the other hand, in the presence of the spurious feature, the full model can fit the training data perfectly with a smaller norm by assigning weight \(1\) for the feature \(s\) (\({\theta^\text{s}}_2^2 = 4\) while \({\theta^\text{+s}}_2^2 + w^2 = 2 < 4\)).
Generally, in the overparameterized regime, since the number of training examples is less than the number of features, there are some directions of data variation that are not observed in the training data. In this example, we do not observe any information about the second and third features. The core model assigns weight \(0\) to the unseen directions (weight \(0\) for the second and third features in this example). However, the nonzero weight for the spurious feature leads to a different assumption for the unseen directions. In particular, the full model does not assign weight \(0\) to the unseen directions. Indeed, by substituting \(s\) with \({\beta^\star}^\top z\), we can view the full model as not using \(s\) but implicitly assigning weight \(\beta^\star_2=2\) to the second feature and \(\beta^\star_3=2\) to the third feature (unseen directions at training).
Let’s now look at different examples and the prediction of these two models:
In this example, removing \(s\) reduces the error for a test distribution with high deviations from zero on the second feature, whereas removing \(s\) increases the error for a test distribution with high deviations from zero on the third feature.
Main result
As we saw in the previous example, by using the spurious feature, the full model incorporates \({\beta^\star}\) into its estimate. The true target parameter (\(\theta^\star\)) and the true spurious feature parameters (\({\beta^\star}\)) agree on some of the unseen directions and do not agree on the others. Thus, depending on which unseen directions are weighted heavily in the test time, removing \(s\) can increase or decrease the error.
More formally, the weight assigned to the spurious feature is proportional to the projection of \(\theta^\star\) on \({\beta^\star}\) on the seen directions. If this number is close to the projection of \(\theta^\star\) on \({\beta^\star}\) on the unseen directions (in comparison to 0), removing \(s\) increases the error, and it decreases the error otherwise. Note that since we are assuming noiseless linear regression and choose models that fit training data, the model predicts perfectly in the seen directions and only variations in unseen directions contribute to the error.
(Left) The projection of \(\theta^\star\) on \(\beta^\star\) is positive in the seen direction, but it is negative in the unseen direction; thus, removing \(s\) decreases the error. (Right) The projection of \(\theta^\star\) on \(\beta^\star\) is similar in both seen and unseen directions; thus, removing \(s\) increases the error.
Drop in accuracy in test time depends on the relationship between the true target parameter (\(\theta^\star\)) and the true spurious feature parameters (\({\beta^\star}\)) in the seen directions and unseen direction.
Let’s now formalize the conditions under which removing the spurious feature (\(s\)) increases the error. Let \(\Pi = Z(ZZ^\top)^{1}Z\) denote the column space of training data (seen directions), thus \(I\Pi\) denotes the null space of training data (unseen direction). The below equation determines when removing the spurious feature decreases the error.
The left side is the difference between the projection of \(\theta^\star\) on \(\beta^\star\) in the seen direction with their projection in the unseen direction scaled by test time covariance. The right side is the difference between 0 (i.e., not using spurious features) and the projection of \(\theta^\star\) on \(\beta^\star\) in the unseen direction scaled by test time covariance. Removing \(s\) helps if the left side is greater than the right side.
Experiments
While the theory applies only to linear models, we now show that in nonlinear models trained on realworld datasets, removing a spurious feature reduces the accuracy and affects groups disproportionately.
Datasets. We are going to study the CelebA dataset ^{16} which contains photos of celebrities along with 40 different attributes. \footnote{See our paper for the results on the commenttoxicitydetection and MNIST datasets} We choose wearing lipstick (indicating if a celebrity is wearing lipstick) as the target and wearing earrings (indicating if a celebrity is wearing earrings) as the spurious feature.
Note that although wearing earrings is correlated with wearing lipstick, we expect our model to not change its prediction if we tell the model the person is wearing earrings.
In the CelebA dataset wearing earrings is correlated with wearing lipstick. In this dataset, if a celebrity wears earrings, it is almost five times more likely that they will wear lipstick than not wearing lipstick. Similarly, if a celebrity does not wear earrings, it is almost two times more likely for them not to wear lipstick than wearing lipstick.
Setup. We train a twolayer neural network with 128 hidden units. We flatten the picture and concatenate the binary variable of wearing earrings to it (we tuned a multiplier for it). We also want to know how much each model relies on the spurious feature. In other words, we want to know how much the model prediction changes as we change the wearing earrings variable. We call this attacking the model (i.e, swapping the value of the binary feature of wearing earrings). We run each experiment 50 times and report the average.
Results. The below diagram shows the accuracy of different models, and their accuracies when they are attacked. Note that, because our attack focuses on the spurious feature, the core model’s accuracy will remain the same.
Removal of the wearing lipstick decreases the overall accuracy. The decrease in accuracy is not monotonic among different groups. The accuracy has decreased in the group where people are not wearing lipstick or earrings and in the group that they both have lipstick and earrings. On the other hand, accuracy increases for the group that only wears one of them.
Let’s break down the diagram and analyze each section.
All celebrities together: have a reasonable accuracy of 82% The overall accuracy drops 1% when we remove the spurious feature (core model accuracy). The full model relies on the spurious feature a lot, thus attacking the full model leads to a ~ 17% drop in overall accuracy.  
The celebrities who follow the stereotype (people who do not have earrings or lipstick, and people who wear both) have a good accuracy overall (both above 85%); The accuracy of both groups drop as we remove the wearing earrings (i.e., core model accuracy). Using the spurious feature helps their accuracy, thus attacking the full model leads to a ~30% drop in their accuracy.  
The celebrities who do not follow the stereotypes have a very low accuracy; this is especially worse for people who only wear earrings (33% accuracy in comparison to the average of 85%). Removing the wearing earring increases their accuracy substantially. Using the spurious feature does not help their accuracy, thus attacking the full model does not change accuracy for these groups. 
In nonlinear models trained on realworld datasets, removing a spurious feature reduces the accuracy and affects groups disproportionately.
Q&A (Other results):
I know about my problem setting, and I am certain that disjoint features determine the target and the spurious feature (i.e., for all \(i\), \(\theta^\star_i\beta^\star_i=0\)). Can I be sure that my model will not rely on the spurious feature, and removing the spurious feature definitely reduces the error? No! Actually, for any \(\theta^\star\) and \({\beta^\star}\), we can construct a training set and two test sets with \(\theta^\star\) and \({\beta^\star}\) as the true parameters and the spurious feature parameter, such that removing the spurious feature reduces the error in one but increases the error in the other one (see Corollary 1 in our paper).
I am collecting a balanced dataset such that the spurious feature and the target are completely independent (i.e., \(p[y,s]= p[y]p[s]\)). Can I be sure that my model will not rely on the spurious feature, and removing the spurious feature definitely reduces the error? No! for any \(S \in \mathbb{R}^n\) and \(Y \in \mathbb{R}^n\), we can generate a training set and two test sets with \(S\) and \(Y\) as their spurious feature and targets, respectively, such that removing the spurious feature reduces the error in one but increases the error in the other (see Corollary 2 in our paper).
What happens when we have many spurious features? Good question! Let’s say \(s_1\) and \(s_2\) are two spurious features. We show that:
 Removing \(s_1\) makes the model more sensitive against \(s_2\), and
 If a group has high error because of the new assumption about unseen direction enforced by using \(s_2\), then it will have an even higher error by removing \(s_1\). (See Proposition 3 in our paper).
Is it possible to have the same model (a model with the same assumptions on unseen directions as the full model) without relying on the spurious feature (i.e., be robust against the spurious feature)? Yes! You can recover the same model as the full model without relying on the spurious feature via robust selftraining and unlabeled data (See Proposition 4).
Conclusion
In this work, we first showed that overparameterized models are incentivized to use spurious features in order to fit the training data with a smaller norm. Then we demonstrated how removing these spurious features altered the model’s assumption on unseen directions. Theoretically and empirically, we showed that this change could hurt the overall accuracy and affect groups disproportionately. We also proved that robustness against spurious features (or error reduction by removing the spurious features) cannot be guaranteed under any condition of the target and spurious feature. Consequently, balanced datasets do not guarantee a robust model and practitioners should consider other features as well. Studying the effect of removing noisy spurious features is an interesting future direction.
Acknowledgement
I would like to thank Percy Liang, Jacob Schreiber and Megha Srivastava for their useful comments. The images in the introduction are from ^{17}^{18} ^{19}^{20}.

Dixon, Lucas, et al. “Measuring and mitigating unintended bias in text classification.” Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 2018. ↩

Xiao, Kai, et al. “Noise or signal: The role of image backgrounds in object recognition.” arXiv preprint arXiv:2006.09994 (2020). ↩

He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. ↩

Zemel, Rich, et al. “Learning fair representations.” International Conference on Machine Learning. 2013. ↩

Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. ↩

Khani, Fereshte, and Percy Liang. “Feature Noise Induces Loss Discrepancy Across Groups.” International Conference on Machine Learning. PMLR, 2020. ↩

Kleinberg, Jon, and Sendhil Mullainathan. “Simplicity creates inequity: implications for fairness, stereotypes, and interpretability.” Proceedings of the 2019 ACM Conference on Economics and Computation. 2019. ↩

photo from Torralba, Antonio. “Contextual priming for object detection.” International journal of computer vision 53.2 (2003): 169191. ↩

Zhao, Han, and Geoff Gordon. “Inherent tradeoffs in learning fair representations.” Advances in neural information processing systems. 2019. ↩

photo from Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE International Conference on Computer Vision. 2019. ↩

Khani, Fereshte, and Percy Liang. “Removing Spurious Features can Hurt Accuracy and Affect Groups Disproportionately.” arXiv preprint arXiv:2012.04104 (2020). ↩

Nakkiran, Preetum, et al. “Deep double descent: Where bigger models and more data hurt.” arXiv preprint arXiv:1912.02292 (2019). ↩

Hastie, T., Montanari, A., Rosset, S., & Tibshirani, R. J. (2019). Surprises in highdimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560. ↩

Raghunathan, Aditi, et al. “Understanding and mitigating the tradeoff between robustness and accuracy.” arXiv preprint arXiv:2002.10716 (2020). ↩

Gunasekar, Suriya, et al. “Implicit regularization in matrix factorization.” 2018 Information Theory and Applications Workshop (ITA). IEEE, 2018. ↩

Liu, Ziwei, et al. “Deep learning face attributes in the wild.” Proceedings of the IEEE international conference on computer vision. 2015. ↩

Xiao, Kai, et al. “Noise or signal: The role of image backgrounds in object recognition.” arXiv preprint arXiv:2006.09994 (2020). ↩

Garg, Sahaj, et al. “Counterfactual fairness in text classification through robustness.” Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. 2019. ↩

photo from Torralba, Antonio. “Contextual priming for object detection.” International journal of computer vision 53.2 (2003): 169191. ↩

photo from Wang, Tianlu, et al. “Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations.” Proceedings of the IEEE International Conference on Computer Vision. 2019. ↩