Productive Struggle: The Future of Human Learning in the Age of AI

January 29, 2025

Walking through our computer science building, we can see ChatGPT on nearly every screen. Today, students can use AI at every stage of their learning process. For example, instead of struggling to figure out how to start a coding assignment, students can simply copy and paste the question into an AI model. Even if the solution doesn’t work perfectly out of the box, they can re-prompt the model with its own solution and an error description to receive a fixed solution.

We can’t help but compare this to our own experiences learning to program during undergrad. We remember the struggle of writing our first lines of code, the days spent debugging with friends at the student center, and the feeling of success after a night’s sleep when finally fixing the bug. We didn’t enjoy being stuck in the moment; but now, we look back and understand that going through these surmountable struggles was important for our learning. Our productive struggles not only helped us provide the correct solution in the short run, but also how to write stronger, less error-prone code in the long run.

**Figure 1:** Productive struggle is important for human learning. Have we become too dependent on ChatGPT? Source

AI systems like ChatGPT are undeniably exciting, but they also challenge the very essence of how we, humans, learn. These systems excel at tasks that requires years of training to master, such as competitive mathematics or college-level programming ¹. They are also becoming more accessible, ready to be used whenever a task poses the slightest bit of difficulty for us. Despite progress in AI, skills such as literacy in both children and adults are declining ², raising the question: what aspects of learning do we want technology to cultivate?

We need to struggle in order to develop new skills ³⁴⁵. By “struggle”, we mean the effort students put into understanding a concept and working through challenges to uncover solutions that are not immediately obvious. While we may not enjoy the struggles of learning, struggle teaches persistence and deepens understanding. Our worry is that with AI, we may develop a habit of avoiding struggle, and that habit risks eroding the depth of our knowledge⁶.

How do we preserve meaningful learning in a world where answers are just a prompt away?

The Evaluation Paradox

Our traditional paradigms for evaluating AI systems often rely on user satisfaction ratings or benchmark assessments — metrics that research has shown to be insufficient for education. For example, Hiroko Warshauer and James Hiebert show that effective support for struggle in pedagogical settings requires attention to multiple dimensions, such as the nature of a teacher’s language ⁷³, the design of tasks⁸, and the broader learning environment⁹¹⁰¹¹. Furthermore, studies by Arthur Glenberg and Michael Pressley et al., have shown that students often overestimate their own understanding and may prefer systems that reduce struggle in the short term ¹²¹³. How is this paradox between user preference and their learning reflected in our current AI systems?

In our work on evaluating interactions between humans and AI systems ¹⁴ —in this case, language models—for information-seeking tasks like question answering, we also observed a disconnect between users’ views of helpfulness and their task performance: the language models which users self-reported as helpful were not always the ones that led to higher task accuracy (Figure 2)! This result was attributed to users putting misplaced trust on to the “confident” and “definitive” language generated by certain language models, particularly those with additional fine-tuning ¹⁵. Only if these users encountered a confident-sounding answer that was obviously wrong, did their assessment of the language model rapidly decline. On the other hand, a few users working with language models that provided in-direct and lengthier answers trusted that their struggle was intentional and held value, including one participant who stated “the task may not be as fun if the AI would give you all the answers!”¹⁶.

**Figure 2:** We asked users to answer College Chemistry questions from the MMLU dataset while given access to different language models (LMs) for help. Those interacting with stronger LMs (e.g. instruction-tuned models) received more direct and confident responses than those interacting with weaker LMs, even in the presence of hallucinations. This resulted in a discrepancy between user helpfulness and task performance, suggesting overreliance that can hinder learning.

Another domain where this evaluation paradox occurs is rehabilitation technology, where it is crucial for the patient to trust that any robot-assisted therapy (e.g. repetitive movements) will actually lead to long-term improvement ¹⁷. In our work on AI-assisted motor learning ¹⁸, we again saw that users self-reported decreased preference for the type of AI-assistance that actually led to improved learning (Figure 3) . Our experiment asked participants to learn to control a vehicle in a simulated environment, and we found that our AI-based instruction encouraged participants to learn a new skill of successfully operating the vehicle in reverse. However, participants found reversing uncomfortable and frustrating. In this setting, AI succeeded at helping students learn a new skill, leading to overall task improvement, but failed to inspire student resilience.

**Figure 3:** Participants that received personalized AI training in our parking simulator were encouraged to try to learn to reverse – a challenging skill that, when learned, leads to task improvement (e.g. taking less time to park). While participants found personalized AI training less helpful (left) than the control training curricula, it led to higher task performance (middle) and usage of this skill in evaluation trials (right), showing how user preference is a poor proxy for learning gain.

If student self-reported assessments and engagement isn’t necessarily reflective of learning, what then does good teaching look like when students are struggling? Can we help teachers guide their students through productive struggle with AI?

Fostering Productive Struggle by Empowering Teachers

While many people can be a teacher—from parents, mentors, tutors to traditional classroom teachers—good teachers are hard to come by. Good teachers gain their expertise through years of training or trial and error, and students who most need experienced teachers often have the least access to them ¹⁹²⁰²¹. This inequity impacts the quality of their education and the nature of how students struggle.

One exciting direction is using AI to help human educators create better moments of productive struggle for their students. In earlier research, we observed that novice educators have difficulties helping struggling students, particularly under time pressure ²². These educators were not sure how to nudge students and come up with the right thing to say on the spot. Without guidance on how to effectively foster productive struggle, they frequently defaulted to providing the solution to the student. This meant missed opportunities to turn a student’s struggle into meaningful learning!

To address this, we developed Tutor CoPilot ²³, an AI-powered system designed to provide live suggestions to human tutors on how to foster productive struggle (Figure 4). Tutor CoPilot is a language model that generates expert-like suggestions on scaffolding the student’s learning, such as asking a guiding question or providing a hint. Unlike generic tools like ChatGPT that risk providing the answer to students or may not be able to engage a student for an entire hour of learning, Tutor CoPilot focuses on amplifying the tutor’s ability to foster productive struggle.

**Figure 4:** Illustration of Tutor CoPilot. Tutor CoPilot provides real-time suggestions for tutors on how to help struggling students. These suggestions follow expert-informed strategies, like asking a guiding question or providing a hint.

We tested Tutor CoPilot to provide tutors guidance in a large randomized controlled trial and found that the technology improved student performance on math tests (Figure 5, left). But, what actually changed in the tutor’s instruction to enable students to learn better? Were the tutors actually fostering productive struggle? When we looked at all the tutors’ language, we found that tutors who had access to Tutor CoPilot were indeed using language that better scaffolded learning and fostered productive struggle, such as prompting students to explain their answers (Figure 5, right). Tutors who didn’t have access to Tutor CoPilot gave away the answer and solution strategy. By improving how tutors foster productive struggle, students were learning better as a result!

**Figure 5:** Tutor CoPilot results. (Left) We found that students working with tutors that had access to Tutor CoPilot were 4 percentage points (p.p.) more likely to pass their math lesson tests. (Right) Tutors with access to Tutor CoPilot used strategies that better fostered productive struggle, whereas tutors who didn’t have access gave away answers and gave generic encouragement to the students.

There are other exciting approaches that invite us to reimagine AI’s role in engaging real educators and students in productive struggle. For example, work from CU Boulder explores how AI can help students collaborate better with their peers by establishing community agreements ²⁴. Work from Amplify leverages technology to make classroom learning more social by enabling students to try different ideas, share their math observations and have more meaningful classroom discussions ²⁵. These examples illustrate how we can support learning by inviting student curiosity and empowering human relationships between students and educators via technology.

Conclusion

At its heart, learning is so much more than about just finding the right answer: it’s about building resilience, fostering curiosity, and enriching the journey of discovery for students ²⁶. Despite incredible advances in technology such as AI, human skills – from literacy to fine motor skills – are continuing to decline, with some blaming increased screen time and technology reliance²⁷. In an era where AI can deliver instant solutions and gratification, we must reconsider how to actively preserve the essential aspects of learning.

We believe AI’s role in education isn’t to eliminate struggle but to enhance it. Whether it’s by empowering educators with tools like Tutor CoPilot, or empowering students with inquiry-driven environments, we wish to ensure that AI supports deeper, more meaningful learning experiences. Let’s build a future where AI systems encourage—not shortcut—meaningful learning from the get-go!

Contact: rewang@cs.stanford.edu and megha@cs.stanford.edu

OpenAI. “GPT-4 Technical Report.” ArXiv abs/2303.08774 (2023): n. pag. ↩
Malkus, Nat. Testing Theories of Why: Four Keys to Interpreting US Student Achievement Trends. American Enterprise Institute, https://www.aei.org/research-products/report/testing-theories-of-why-four-keys-to-interpreting-us-student-achievement-trends/ ↩
Warshauer, H.K. Productive struggle in middle school mathematics classrooms. J Math Teacher Educ 18, 375–400 (2015). https://doi.org/10.1007/s10857-014-9286-3 ↩ ↩²
Hiebert, J., & Grouws, D. A. (2007). The Effects of Classroom Mathematics Teaching on Students’ Learning. In F. Lester (Ed.), Second Handbook of Research on Mathematics Teaching and Learning (pp. 371-404). Charlotte, NC: Information Age. ↩
Dewey, John. How We Think. D. C. HEATH & CO., PUBLISHERS, 1910, www.gutenberg.org/files/37423/37423-h/37423-h.htm. ↩
Do we still need to learn certain skills if AI can do them better? Yes. The calculator didn’t make learning mathematics obsolete—students still need to master the principles of mathematics, including arithmetic. Similarly, even if AI can generate code, mastering the principles is still important for understanding, debugging and adapting that code. ↩
Warshauer, Hiroko K. “Strategies to Support Productive Struggle.” Mathematics Teaching in the Middle School, vol. 20, no. 7, National Council of Teachers of Mathematics, Mar. 2015, pp. 390-393. JSTOR, https://www.jstor.org/stable/10.5951/mathteacmiddscho.20.7.0390. ↩
Hiebert, James, and Diana Wearne. “Instructional Tasks, Classroom Discourse, and Students’ Learning in Second-Grade Arithmetic.” American Educational Research Journal, vol. 30, no. 2, Summer 1993, pp. 393-425. ↩
Yeager, David S., et al. “A National Experiment Reveals Where a Growth Mindset Improves Achievement.” Nature, vol. 573, 2019, pp. 364-369. ↩
Noddings, Nel. “Small Groups as a Setting for Research on Mathematical Problem Solving.” Teaching and Learning Mathematical Problem Solving: Multiple Research Perspectives, edited by Edward A. Silver, 1st ed., Routledge, 1985. ↩
Schoenfeld, Alan H. “Ideas in the Air: Speculations on Small Group Learning, Environmental and Cultural Influences on Cognition, and Epistemology.” Teaching and Learning Mathematical Problem Solving: Multiple Research Perspectives, edited by Edward A. Silver, Routledge, 1985. ↩
Glenberg, Arthur M., Alex Cherry Wilkinson, and William Epstein. “The Illusion of Knowing: Failure in the Self-Assessment of Comprehension.” Memory & Cognition, vol. 10, 1982, pp. 597-602. ↩
Pressley, Michael, et al. “Sometimes adults miss the main ideas and do not realize it: Confidence in responses to short-answer and multiple-choice comprehension questions.” Reading Research Quarterly, 1990: 232-249. ↩
Lee, Mina, and Megha Srivastava and Amelia Hardy and John Thickstun et al. “Evaluating Human-Language Model Interaction.” Transactions on Machine Learning Research, vol. 9, 2023. ↩
Srivastava, Megha, and John Thickstun. “Observations from HALIE: A Closer Look at Human-LM Interactions in Information-Seeking Contexts.” Center for Research on Foundation Models Blog, Stanford University, 2023. ↩
Srivastava, Megha, and John Thickstun. “Observations from HALIE: A Closer Look at Human-LM Interactions in Information-Seeking Contexts.” Center for Research on Foundation Models Blog, Stanford University, 2023. ↩
Kellmeyer, Philipp, et al. “Social Robots in Rehabilitation: A Question of Trust.” Science Robotics, vol. 3, no. 21, 22 Aug. 2018, DOI: 10.1126/scirobotics.aat1587. ↩
Srivastava, Megha et al. "Assistive Teaching of Motor Control Tasks to Humans" Advances in Neural Information Processing Systems 36 (2022). https://arxiv.org/pdf/2211.14003 ↩
Hong, Joe, and Erica Yee. “Low-Income Students Are More Likely to Be in Classrooms with Underqualified Teachers.” KQED, 21 July 2022, https://www.kqed.org. ↩
Peske, Heather G., and Kati Haycock. Teaching Inequality: How Poor and Minority Students Are Shortchanged on Teacher Quality. The Education Trust, 2006. ↩
Clotfelter, Charles T., et al. “High Poverty Schools and the Distribution of Teachers and Principals.” North Carolina Review, 2007. ↩
Wang, Rose E., et al. “Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes.” NAACL 2024, https://arxiv.org/abs/2310.10648. ↩
Wang, Rose E., et al. “Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise.” arXiv, 26 Jan. 2025, https://arxiv.org/abs/2410.03017. ↩
Breideband, Thomas, et al. “The Community Builder (CoBi): Helping Students to Develop Better Small Group Collaborative Learning Skills.” CSCW ‘23 Companion: Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing, 2023. ↩
Meyer, Dan. “The Math Kids Most Want to Learn.” Dan Meyer Blog, 19 June 2024. https://danmeyer.substack.com/p/the-math-kids-most-want-to-learn ↩
Meyer, Dan. “Amplify Desmos Math is More than a Curriculum.” Dan Meyer Blog, 18 August 2024. https://danmeyer.substack.com/p/amplify-desmos-math-is-more-than ↩
National Geographic. “Kids are losing fine motor skills – and screens might be to blame. 28 January 2025. ↩

Keep on top of the latest SAIL Blog posts via , , or email: