[July 20, 2021] Our work was recently covered by the New York Times here. You can also find a technical preprint here.
tl;dr. With the rise of large online computer science courses, there is an abundance of high-quality content. At the same time, the sheer size of these courses makes high-quality feedback to student work more and more difficult. Talk to any educator, and they will tell you how instrumental instructor feedback is to a student’s learning process. Unfortunately, giving personalized feedback isn’t cheap: for a large online coding course, this could take months of labor. Today, large online courses either don’t offer feedback at all or take shortcuts that sacrifice the quality of the feedback given.
Several computational approaches have been proposed to automatically produce personalized feedback, but each falls short: they either require too much upfront work by instructors or are limited to very simple assignments. A scalable algorithm for feedback to student code that works for university-level content remains to be seen. Until now, that is. In a recent paper, we proposed a new AI system based on meta-learning that trains a neural network to ingest student code and output feedback. Given a new assignment, this AI system can quickly adapt with little instructor work. On a dataset of student solutions to Stanford’s CS106A exams, we found the AI system to match human instructors in feedback quality.
To test the approach in a real-world setting, we deployed the AI system at Code in Place 2021, a large online computer science course spun out of Stanford with over 12,000 students, to provide feedback to an end-of-course diagnostic assessment. The students’ reception to the feedback was overwhelmingly positive: across 16,000 pieces of feedback given, students agreed with the AI feedback 97.9% of the time, compared to 96.7% agreement to feedback provided by human instructors. This is, to the best of our knowledge, the first successful deployment of machine learning based feedback to open-ended student work.
In the middle of the pandemic, while everyone is forced to social distance in the confines of their own homes, thousands of people across the world were hard at work figuring out why their code was stuck in an infinite loop. Stanford CS106A, one of the university’s most popular courses and its largest introductory programming offering with nearly 1,600 students every year, grew even bigger. Dubbed Code in Place, CS106A instructors Chris Piech, Mehran Sahami and Julie Zelenski wanted to make the curriculum and teaching philosophy of CS106A publicly available as an uplifting learning experience for students and adults alike during a difficult time. In its inaugural showing in April ‘20, Code in Place pulled together 908 volunteer teachers to run an online course for 10,428 students from around the world. One year later, with the pandemic still in full force in many areas of the world, Code in Place kicked off again, growing to over 12,000 students and 1,120 volunteer teachers.
While crowd-sourcing a teaching team did make a lot of things possible for Code in Place that usual online courses lack, there are still limits to what can be done with a class of this scale. In particular, one of the most challenging hurdles was providing high-quality feedback to 10,000 students.
What is feedback?
Everyone knows high quality content is an important ingredient for learning, but another equally important but more subtle ingredient is getting high quality feedback. Knowing the breakdown of what you did well and what the areas for improvement are, is fundamental to understanding. Think back to when you first got started programming: for me, small errors that might be obvious to someone more experienced, cause a lot of frustration. This is where feedback comes in, helping students overcome this initial hurdle with instructor guidance. Unfortunately, feedback is something online code education has struggled with. With popular “massively open online courses” (MOOCs), feedback on student code boils down to compiler error messages, standardized tooltips, or multiple-choice quizzes.
You can find an example of each below. On the left, multiple choice quizzes are simple to grade and can easily assign numeric scores to student work. However, feedback is limited to showing the right answer, which does little to help students understand their underlying misconceptions. The middle picture shows an example of an opaque compiler error complaining about a syntax issue. As a beginner learning to code, error messages are very intimidating and difficult to interpret. Finally, on the right, we see an example of a standardized tooltip: upon making a mistake, a pre-specified message is shown. Pre-specified messages tend to be very vague: here, the tooltip just tells us our solution is wrong and to try something different.
It makes a lot of sense why MOOCs settle for subpar feedback: it’s really difficult to do otherwise! Even for Stanford CS106A, the teaching team is constantly fighting the clock in office hours in an attempt to help everyone. Outside of Stanford, where classes may be more understaffed, instructors are already unable to provide this level of individualized support. With large online courses, the sheer size makes any hope of providing feedback unimaginable. Last year, Code in Place gave a diagnostic assessment during the course for students to summarize what they have learned. However, there was no way to give feedback scalably to all these student solutions. The only option was to release the correct solutions online for students to compare to their own work, displacing the burden of feedback onto the students.
Code in Place and its MOOC cousins are examples of a trend of education moving online, which might only grow given the lasting effects of the pandemic. This shift surfaces a very important challenge: can we provide feedback at scale?
The feedback challenge.
In 2014, Code.org, one of the largest online platforms for code education, launched an initiative to crowdsource thousands of instructors to provide feedback to student solutions [1,2]. The hope of the initiative was to tag enough student solutions with feedback so that for a new student, Code.org could automatically provide feedback by matching the student’s solution to a bank of solutions already annotated with feedback by an instructor. Unfortunately, Code.org quickly found that even after thousands of aggregate hours spent providing feedback, instructors were only scratching the surface. New students were constantly coming up with new mistakes and new strategies. The initiative was cancelled after two years and has not been reproduced since.
We might ask: why did this happen? What is it about feedback that makes it so difficult to scale? In our research, we came up with two parallel explanations.
First, providing feedback to student code is hard work. As an instructor, every student solution requires me to reason about the student’s thought process to uncover what misconceptions they might have had. If you have ever had to debug someone else’s code, providing feedback is at least as hard as that. In a previous research paper, we found that producing feedback for only 800 block-based programs took a teaching team a collective 24.9 hours. If we were to do that for all of Code in Place, it would take 8 months of work.
Second, students approach the same programming problem in an exponential number of ways. Almost every new student solution will be unique, and a single misconception can manifest itself in seemingly infinite ways. As a concrete example, even after seeing a million solutions to a Code.org problem, there is still a 15% chance that a new student generates a solution never seen before. Perhaps not coincidentally, it turns out the distribution of student code closely follows the famous Zipf distribution, which reveals an extremely “long tail” of rare solutions that only one student will ever submit. Moreover, this close relationship to Zipf doesn’t just apply to Code.org; it is a much more general phenomenon. We see similar patterns for student work for university level programming assignments in Python and Java, as well as free response solutions to essay-like prompts.
So, if asking instructors to manually provide feedback at scale is nearly impossible, what else can we do?
Automating feedback.
“If humans can’t do it, maybe machines can” (famous last words). After all, machines process information a lot faster than humans do. There have been several approaches applying computational techniques to provide feedback, the simplest of which is unit tests. An instructor can write a collection of unit tests for the core concepts and use them to evaluate student solutions. However, unit tests expect student code to compile and, often, student code does not due to errors. If we wish to give feedback on partially complete solutions, we need to be able to handle non-compiling code. Given the successes of AI and deep learning in computer vision and natural language, there have been attempts of designing AI systems to automatically provide feedback, even when student code does not compile.
Supervised Learning
Given a dataset of student code, we can ask an instructor to provide feedback for each of the solutions, creating a labeled dataset. This can be used to train a deep learning model to predict feedback for a new student solution. While this is great in theory, in practice, compiling a sufficiently large and diverse dataset is difficult. In machine learning, we are accustomed to datasets with millions of labeled examples since annotating an image is both cheap and requires no domain knowledge. On the other hand, annotating student code with feedback is both time-consuming and needs expertise, limiting datasets to be a few thousand examples in size. Given the Zipf-like nature of student code, it is very unlikely that a dataset of this size can capture all the different ways students approach a problem. This is reflected in practice as supervised attempts perform poorly on new student solutions.
Generative Grading
While annotating student code is difficult work, instructors are really good at thinking about how students would tackle a coding problem and what mistakes they might make along the way. Generative grading [2,3] asks instructors to distill this intuition about student cognition into an algorithm called a probabilistic grammar. Instructors specify what misconceptions a student might make and how that translates to code. For example, if a student forgets a stopping criterion resulting in an infinite loop, their program likely contains a “while” statement with no “break” condition. Given such an algorithm, we can run it forward to generate a full student solution with all misconceptions already labeled. Doing this repeatedly, we curate a large dataset to train a supervised model. This approach was very successful on block-based code, where performance rivaled human instructors. However, the success of it hinges on a good algorithm. While tractable for block-based programs, it became exceedingly difficult to build a good algorithm for university level assignments where student code is much more complex.
The supervised approach requires the instructor to curate a dataset of student solutions with feedback where as the generative grading approach requires the instructor to build an algorithm to generate annotated data. In contrast, the meta-learning approach requires the instructor to annotate feedback for K examples across N programming problems. K is typically very small (~10) and N not much larger (~100).
Meta-learning how to give feedback.
So far, neither approach is quite right. In different ways, supervised learning and generative grading both expect too much from the instructor. As they stand, for every new coding exercise, the instructor would have to put in days, if not weeks to months of effort. In an ideal world, we would shift more of the burden of feedback onto the AI system. While we would still like instructors to play a role, the AI system should bear the onus of quickly adapting to every new exercise. To accomplish this, we built an AI system to “learn how to learn” to give feedback.
Meta-learning is an old idea from the 1990s [9, 10] that has seen a resurgence in the last five years. Recall that in supervised learning a model is trained to solve a single task; in meta-learning, we solve many tasks at once. The catch is that we are limited to a handful of labeled examples for every task. Whereas supervised learning gets lots of labels for one task, we spread the annotation effort evenly across many tasks, leaving us with a few labels per task. In research literature, this is called the few-shot classification problem. The upside to meta-learning is that after training, if your model is presented with a new task that it has not seen before, it can quickly adapt to solve it with only a “few shots” (i.e., a few annotations from the new task).
So, what does meta-learning for feedback look like? To answer that, we first need to describe what composes a “task” in the world of educational feedback. Last year, we compiled a dataset of student solutions from eight CS106A exams collected over the last three academic years. Each exam consists of four to six programming exercises in which the student must write code (but is unable to run or compile it for testing). Every student solution is annotated by an instructor using a feedback rubric containing a list of misconceptions tailored to a single problem. As an example, consider a coding exercise that asks the student to write a Python program that requires string insertion. A potential feedback rubric is shown in the left image: possible misconceptions are inserting at the wrong location or inserting the wrong string. So, we can treat every misconception as its own task. The string insertion example would comprise of four tasks.
One of the key ideas of this approach is to frame the feedback challenge as a few-shot classification problem. Remember that the reasons why previous methods for automated feedback struggled were the (1) high cost of annotation and (2) diversity of student solutions. Casting feedback as a few-shot problem cleverly circumvents both challenges. First, meta-learning can leverage previous data on old exams to learn to provide feedback to a new exercise with very little upfront cost. We only need to label a few examples for the new exercise to adapt the meta-learner and importantly, do not need to train a new model from scratch. Second, there are two ways to handle diversity: you can go for “depth” by training on a lot of student solutions for a single problem to see different strategies, or you can go for “breadth” and get sense of diverse strategies through student solutions on a lot of different problems. Meta-learning focuses its efforts on capturing “breadth”, accumulating more generalizable knowledge that can be shared across problems.
We will leave the details of the meta-learner to the technical report. In short, we propose a new deep neural network called a ProtoTransformer Network that combines the strengths of BERT from natural language processing and Prototypical Networks from few-shot learning literature. This architecture, in tandem with technical innovations – creating synthetic tasks for code, self-supervised pretraining on unlabeled code, careful encoding of variable and function names, and adding question and rubric descriptions as side information – together produce a highly performant AI system for feedback. To help ground this in context, we include three examples on the bottom of the last page of the AI system predicting feedback to student code. The predictions were taken from actual model output on student submissions.
Main Results
Aside from looking at qualitative examples, we can measure its performance quantitatively by evaluating the correctness of the feedback an AI system gave on exercises not used in training. A piece of feedback is considered correct if a human instructor annotated the student solution with it.
We consider two experimental settings for evaluation:
-
Held-out Questions: we randomly pick 10% of questions across all exams to evaluate the meta-learner. This simulates instructors providing feedback for part of every exam, leaving a few questions for the AI to give feedback for.
-
Held-out Exams: we hold out an entire exam for evaluation. This is a much harder setting as we know nothing about the new exam but also most faithfully represents an autonomous feedback system.
We measure the performance of human instructors by asking several teaching assistants to grade the same student solution and recording agreement. We also compare the meta-learner to a supervised baseline. As shown in the graph on the previous page, the meta-learner outperforms the supervised baseline by up to 24 percentage points, showcasing the utility of meta-learning. More surprisingly, we find that the meta-learner surpasses human performance by 6% in held-out questions. However, there is still room for improvement as we fall short 8% to human performance on held-out exams – a harder challenge. Despite this, we find these results encouraging: previous methods for feedback could not handle the complexity of university assignments, let alone approach, or match the performance of instructors.
Automated feedback for Code in Place.
Taking a step back, we began with the challenge of feedback, an important ingredient to a student’s learning process that is frustratingly difficult to scale, especially for large online courses. Many attempts have been made towards this, some based on crowdsourcing human effort and others based on computational approaches with and without AI, but all of which have faced roadblocks. In late May ‘21, we built and tested an approach based on meta-learning, showing surprisingly strong results on university level content. But admittedly, the gap between ML research and deployment can be large, and it remained to be shown that our approach can give high quality feedback at scale in a live application. Come June, Code in Place ‘21 was gearing up for its diagnostic assessment.
In an amazing turnout, Code in Place ‘21 had 12,000 students. But grading 12,000 students each solving 5 problems would be beyond intractable. To put it into numbers, it would take 8 months of human labor, or more than 400 teaching assistants working standard nine-to-five shifts to manually grade all 60,000 solutions.
The Code in Place ‘21 diagnostic contained five new questions that were not in the CS106A dataset used to train the AI system. However, the questions were similar in difficulty and scope, and correct solutions were roughly the same length as those in CS106A. Because the AI system was trained with meta-learning, it could quickly adapt to these new questions. Volunteers from the teaching team helped annotate a small portion of the student solutions that the AI meta-learning algorithm requires.
To showcase feedback to students, we were joined by Alan Chang and together we built an application for students to see their solutions and AI feedback (see image above). We were transparent in informing students that an AI was providing feedback. For each predicted misconception, we associated it with a message (shown in the blue box) to the student. We carefully crafted the language of these messages to be helpful and supportive of the student’s learning. We also provided finer grained feedback by highlighting portions of the code that the AI system weighted more strongly in making its prediction. In the image above, the student forgot to cast the height to an integer. In fact, the highlighted line should be height = int(input(…)), which the AI system picked up on.
Human versus AI Feedback
For each question, we asked the student to rate the correctness of the feedback provided by clicking either a “thumbs up” or a “thumbs down” before they can proceed to the next question (see lower left side of the image above). Additionally, after a student reviewed all their feedback, we asked them to rate the AI system holistically on a five-point scale. As part of the deployment, some of the student solutions were given feedback by humans but students did not know which ones. So, we can compare students’ holistic and per-question rating when given AI feedback versus instructor feedback.
Here’s what we found:
- 1,096 students responded to a survey after receiving 15,134 pieces of feedback. The reception was overwhelmingly positive: Across all 15k pieces of feedback, students agreed with AI suggestions 97.9% ± 0.001 of the time.
- We compared student agreement with AI feedback against agreement with instructor feedback, where we surprisingly found the AI system surpass human instructors: 97.9% > 96.7% (p-value 0.02). The improvement was driven by higher student ratings on constructive feedback – times when the algorithm suggested an improvement.
- On the five-point scale, the average holistic rating of usefulness by students was 4.6 ± 0.018 out of 5.
- Given the wide diversity of students participating in Code in Place, we segmented the quality of AI feedback by gender and country, where we found no statistically significant difference across groups.
To the best of our knowledge, this was both the first successful deployment of AI-driven feedback to open-ended student work and the first successful deployment of prototype networks in a live application. With promising results in both a research and a real-world setting, we are optimistic about the future of artificial intelligence in code education and beyond.
How could AI feedback impact teaching?
A successful deployment of an automated feedback system raises several important questions about the role of AI in education and more broadly, society.
To start, we emphasize that what makes Code in Place so successful is its amazing teaching team made up of over 1,000 section leaders. While feedback is an important part of the learning experience, it is one component of a larger ecosystem. We should not incorrectly conclude from our results that AI can automate teaching or replace instructors – nor should the system be used for high-stakes grading. Instead, we should view AI feedback as another tool in the toolkit for instructors to better shape an amazing learning experience for students.
Further, we should evaluate our AI systems with a double bottom line of both performance and fairness. Our initial experiments suggest that the AI is not biased but our initial results are being supplemented by a more thorough audit. To minimize the chance of providing incorrect feedback to student work, future research should encourage AI systems to learn to say: “I don’t know”.
Third, we find it important that progress in education research be public and available for others to critique and build upon.
Finally, this research opens so many directions moving forward. We hope to use this work to enable teachers to better reach their potential. Moreover, an AI feedback makes it scalable to study not just students’ final solutions, but the process of how students solve their assignments. Finally, there is a novel opportunity for computational approaches towards unraveling the science of how students learn.
Acknowledgements
Many thanks to Chelsea Finn, Chris Piech, and Noah Goodman for their guidance. Special thanks to Chris for his support the last three years through the successes and failures towards AI feedback prediction. Also, thanks to Alan Cheng, Milan Mosse, Ali Malik, Yunsung Kim, Juliette Woodrow, Vrinda Vasavada, Jinpeng Song, and John Mitchell for great collaborations. Thank you to Mehran Sahami, Julie Zelenki, Brahm Capoor and the Code in Place team who supported this project. Thank you to the section leaders who provided all the human feedback that the AI was able to learn from. Thank you to the Stanford Institute for Human-Centered Artificial Intelligence (in particular the Hoffman-Yee Research Grant) and the Stanford Interdisciplinary Graduate Fellowship for their support.