Richard Davis, CEO: 80-essay analysis shows Top Marks AI achieving a Pearson correlation coefficient of 0.91, besting experienced markers

Top Marks Correlates With Edexcel Better Than Humans: IGCSE English Literature - Literary Heritage Question

Richard Davis, CEO: 80-essay analysis shows Top Marks AI achieving a Pearson correlation coefficient of 0.91, besting experienced markers, January 13, 2025

Top Marks AI Correlates with Edexcel Better Than Humans: IGCSE English Literature

Time and again, we’re asked this crucial question: how accurate are Top Marks’ GCSE English AI marking tools?

So we have been conducting a series of experiments to help provide answers.

In today’s experiment, we will be looking at the Edexcel IGCSE - specifically, the 30-mark Literary Heritage question.

On their website, Edexcel have published 85 exemplar essays for the Literary Heritage question. These exemplars represent a broad range of quality of answers. These essays are made available for standardisation purposes - so teachers can see what various levels of responses actually look like in the wild.

We downloaded all 85 of these essays – all handwritten – and put them through our Literary marking tool. Then we measured the correlation between the official marks the board gave the essay, and the marks Top Marks AI gave those essays.

We used a measurement called the Pearson correlation coefficient. In short:

  • • A value of 1 would mean perfect correlation -- when one marker gives a high score, the other always does too, and when one gives a low score, the other always does too.
  • • A value of 0 means no correlation whatsoever -- knowing one marker's score tells you nothing about what the other marker gave.
  • • Negative values would mean the markers systematically disagree - when one gives high scores, the other gives low scores.

For context, how do humans perform?

What sort of correlation do experienced human markers achieve when marking essays already marked by a lead examiner?

Cambridge Assessment conducted a rigorous study to measure precisely this. 200 GCSE English scripts - which had already been marked by a chief examiner - were sent to a team of experienced human markers. These experienced markers were not told what the chief examiner had given these scripts. Nor were they shown any annotations.

The Pearson correlation coefficient between the scores these experienced examiners gave and the chief examiner was just below 0.7. This indicated a positive correlation, though far from perfect. If you are interested, you can find the study here.

How did Top Marks AI perform?

Top Marks, across the 85 essays, achieved a correlation of above 0.91 -- an incredibly strong positive correlation that far outperforms the experienced human markers in the Cambridge study. (Top Marks AI was also not privy to the “correct marks” or any annotations).

Moreover, 80% of the marks we gave were within a 10% tolerance of the mark given by the chief examiner. In other words, 80% of all the marks we gave were within 3 marks of the board’s grade.

Another interesting metric is the Mean Absolute Error, for which it scored a 2.4. On average, the AI differed from the board by 2.4 marks, which is comfortably within tolerance. As a percentage, that's an average of 8% different.

In contrast, in that same Cambridge study, experienced examiners marking a 40-mark question showed a Mean Absolute Error of 5.64 marks. This equates to 14.1% difference. On a 30-mark scale, this would proportionally adjust to a Mean Absolute Error of 4.23 marks, or the same 14.1%. These results highlight the exceptional accuracy of Top Marks AI compared to traditional marking practices.

For transparency, you can also access 85 exemplars we used here, and see where we sourced them from.

Can I see a graph to help me visualise this?

Absolutely.

First, here’s a scatter graph to show you what a theoretical perfect correlation of 1 would look like:

Perfect Correlation Graph

Now, let’s look at the real-life graph, drawn from the data above:

Actual Correlation Graph

On the horizontal axis, we have the grade given by the exam board. On the vertical, the grade given by Top Marks AI. The individual dots are the essays -- their position tells us both the mark given by the exam board and by Top Marks AI. You can see how closely it resembles the theoretical graph depicting perfect correlation.

The Handwriting Factor

As mentioned, all the essays we downloaded were handwritten. That Top Marks was able to correlate so closely with the official board grades indicates not only its marking efficacy but also the strength of its transcription technology.

It’s worth noting that the largest outlier in our data—Exemplar 29, with a difference of 10.2 marks—is explicitly identified in the board materials as having been written by a dyslexic student. This suggests that the discrepancy may stem from challenges in transcribing the handwriting, rather than any shortfall in the marking capabilities of Top Marks AI.

Discover how Top Marks AI can revolutionise assessment in education. Contact us at info@topmarks.ai.