The Dangers of Mixing Student Data and Machine Learning

Machine learning is capable of amazing things. Speech recognition was a fragile novelty 15 years ago and now it’s ubiquitous. Self driving cars are on the verge of breaking through. Chess and Go are now mastered by machines. At the same time we are gathering unprecedented amounts of data on our students. We track their behavior in class and their usage of the Learning Management System (LMS) outside class. We measure their performance through exam scores, quiz scores, answers to in-class questions, and evaluations of their writing. To supplement this information, we have demographics, surveys, and measures of their performance in other classes. It seems obvious that combining these two technologies should yield important insights into student learning, and in fact big money is being invested by the smallest and biggest edtech companies to do exactly this. And I think it’s really dangerous.

Machine learning is very good at prediction. It identifies what combinations of values of a large number of variables are associated with particular outcomes. e.g., Males, ages 12-18 who play video games are likely to enjoy Marvel movies. These predictions, while highly accurate, are often not easily phrased in human language. It’s as if the algorithm says “Trust me—I know from looking at the data that people with characteristics like Bob’s prefer Marvel movies to DC movies. Just don’t ask me why.” We’re only slowly figuring out how to summarize these patterns in ways that are useful beyond pure “trust me” predictions. Without insight into “why?” I’m not sure how much we can learn about student learning.

The bigger problem is that correlation does not equal causation. Doctors talk about risk factors for a disease. They don’t explicitly say that old age, fatty foods, and a passive lifestyle cause heart disease, even though they are strong predictors. Social scientists work extremely hard to figure out when an observed correlation is a causal effect. Vitamin D is unambiguously associated with great health outcomes, but a large study recently found that the relationship isn’t causal. Instead, people with high levels of vitamin D are those that spend more time outside, and it seems to be the outdoor physical activity that has the positive causal effect on health. That is, even though the positive association exists, supplementing people’s diets with Vitamin D has no effect on their health.

In my own classes, the students who spend the most time studying are often not earning the highest exam scores. If I were to interpret this as a causal effect, I would want to discourage them from studying so much. This doesn’t take into account the fact that the students who study the most are often starting with weaker skills than other students, and they are studying hard in order to catch up. It’s also possible that the students who study most are studying inefficiently.

Here’s another example: Students who attend my scheduled office hours tend to do better on my exams. It’s so seductive to interpret this as evidence of the value of my one-on-one teaching, but that would ignore what econometricians call selection bias: The students who attend office hours are often the most curious and hard-working and they would do better than other students even if I wrote gibberish on my blackboard and recited bad poetry when they came to my office.

The best case scenario is that unleashing machine learning on student data identifies students at risk and allows us to focus our teaching energy on identifying what those students need in order to succeed. It’s also possible that as the technology improves it will generate interesting hypotheses about the causal determinants of academic success. But we will need to be very careful not to over-reach. What if we find that students who regularly interact with the LMS during the semester are more likely to get A’s? Does this mean we should push all students to do so? If it’s causal, then yes. Perhaps this spaced interaction induces more learning than cramming right before an exam. But it’s equally likely that students who have lots of other good study habits are the ones driving this positive association. And it’s these other good study habits (which we don’t observe) that actually induce more learning. And that just encouraging (or forcing) students to interact with the LMS more regularly would have no effect at all. It could even have a negative effect if students shift their effort away from more constructive activities.

At the beginning of the term most of my students walk in the door of my econometrics classroom knowing that correlation does not always equal causation. They spend the next several weeks learning methods that can tease out the difference through carefully designed experiments or a careful analyses of observational data. Machine learning is great for prediction, but right now it’s lousy for learning how causal processes work. And it’s knowledge of how the learning process actually works that we need to improve our teaching.