IB, headquartered in Geneva, opted to use a statistical formula instead—adding to the growing list of tech fixes proposed to automate away fallout from the pandemic. The workings of the IB diploma—and the timing of the results—proved particularly harmful for IB students applying to US colleges. Unlike AP tests, which are typically separate from high school grades, the IB results are intended to reflect a student’s work for the year. IB students are often granted college admission based on predicted grades, and they submit their final results when they become available over the summer. Some colleges, including NYU and Northeastern, warn on their admissions pages that students whose IB results don’t get close enough to those predictions may lose their place.
In normal times, IB diploma students select six subjects, from options such as physics and philosophy, and receive final grades determined in part by assignments but mostly by written tests administered in the spring. The program is offered by nearly 900 public schools in the US and is common in international schools around the world. In March, IB canceled all tests and said it would calculate each student’s final grades using a method developed by an unnamed educational organization that specializes in data analysis.
The idea was to use prior patterns to infer what a student would have scored in a 2020 not dominated by a deadly pandemic. IB did not disclose details of the methodology but said grades would be calculated based on a student’s assignment scores, predicted grades, and historical IB results from their school. The foundation said grade boundaries were set to reflect the challenges of remote learning during a pandemic. For schools where historical data was lacking, predictions would build on data pooled from other schools instead.
In a video IB posted about the process, Antony Furlong, the foundation’s manager for assessment research and design, said the system essentially created “a bespoke equation” for every school.
One visual arts teacher at a US school says what she and coworkers have seen suggests it wasn’t well tailored. “When I saw the marks, I was floored,” she says. “I am always conservative in my predicted grades, but every single student except one were downgraded.” Of 15 students she works with, four have to rethink their plans for this fall, because they missed out on college places, something she didn’t expect for any of them.
Determining whether IB’s system had flaws is challenging without knowing its formula or the inputs and outputs. Just because some humans don’t like the outputs of a data analysis doesn’t mean that it’s incorrect. But Suresh Venkatasubramanian, a professor at the University of Utah who studies the social consequences of automated decisionmaking, says it appears IB could have deployed its system more responsibly. “All this points to what happens when you try to install some sort of automated process without transparency,” he says. “The burden of proof should be on the system to justify its existence.”
Data analysis is more powerful than ever but remains far from being able to predict complex future human actions. Models that extrapolate from past statistical trends can end up treating people unfairly because their circumstances are different, even if results match past patterns on average.
“There’s something wrong with the algorithm.”
Constance Lavergne, parent of an IB student in the UK
Venkatasubramanian says that basing a student’s grades on past trends at their school, potentially unrelated to the student’s own school career, could be unfair. Using data from other schools—as IB did for schools with little track record—is a “red flag,” he says, because it would mean some students’ grades were calculated differently than others.
Constance Lavergne, whose son in the UK received lower-than-expected IB grades and missed out on his preferred college, is one of many parents struggling to understand what happened. She says her experience working closely with data analysts in the tech industry makes her suspicious of IB’s methodology. It would naturally generate noisier results for smaller classes, like her son’s, because they offer fewer past data points, she suggests. “There’s something wrong with the algorithm,” Lavergne says.