THREE: Accuracy, reliability and the ‘William Tell’ effect

To illustrate the inherent uncertainty in unreliable and inaccurate assessments of the ability and attainment of students, I make an analogy with predicting the accuracy of William Tell hitting the apple. He could see where the apple was placed and could hit it with precise accuracy. Tell was also reliable and his record meant that he was expected to hit the apple every time he tried.

However, a less able fairground marksman might have boasted that he could ‘hit’ the apple, with + or – one apple width accuracy, at over 90% reliability. This means he is likely to be less accurate and much less reliable. One apple width too high is a miss that might as well have missed by a mile. One apple width too low and a new volunteer is needed. Taking bets on this would hardly attract the bookies since they would be more interested in the chances of an actual hit most of the time. The 90% reliability boast might hide inaccuracy too often and the reliability of hitting the actual target could be as low as 40%.

This was amply demonstrated in a figure from 2018, shown here again, that persisted in the August 2020 report by Ofqual ‘Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim report August 2020’. TEFS concluded that Ofqual merely demonstrates that they “get the grades wrong most of the time, but do this with great precision” (reported by TEFS 18th August 2020 ‘Exams 2020 and the demise of Ofqual, who pays the ferryman?’).

As an aside, Tell was less confident of his reliable accuracy and in the story of the apple he is reported to have kept a spare bolt on hand in case he missed so he could use it to kill his accuser, the Habsburg bailiff, Albrecht Gessler.

Teachers must guess what Ofqual might validate.

Yet Ofqual persist with the boast that over 90% of their grades are reliable to within +/- one grade. This was reinforced by the Acting Ofqual Chief Regulator, Glenys Stacey, in September at an Education Committee hearing who noted that grades were “reliable to one grade either way”.

This is a peculiar boast since the Ofqual report in August ‘Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim report’ criticised the likelihood of teachers predicting grades accurately to the grade awarded. But Ofqual admits its own validated grades are, in turn, less reliable by +/- one grade. Their validation relies on comparison of a mark made by an ‘ordinary’ examiner to that awarded separately by a 'senior' examiner. Yet teachers are expected to be better than that using the argument “Not only are those approaches more susceptible to potential bias, it is not possible to validate the accuracy of these approaches due to the absence of equivalent authentic data for the purposes of testing”. In predicting grades, teachers must guess which of three grades are likely to be awarded in each case. If they predict a B grade for a student, the exam result could be A or C and the teacher got it wrong. Instead, Ofqual might consider they got it wrong and look again at the inconsistent outcome.

The current approach to examinations is as much hopeless as it is misleading. There are terrible and real consequences and ‘casualties’. One grade lower and a student misses out on a career. See something of my personal experience of this, and a lucky escape, in 1972 when the margins of boundaries were very tight in TEFS 17th August 2017 ‘A-Level Playing Field or not: Have things changed over time?’.

Criticism continues.

In a highly critical posting recently ‘No, Minister. England’s school exams are not ‘the fairest way’ (London School of Economics 20th November 2020), Dennis Sherwood maintains his strong criticism of the existing examination system that Ofqual is so determined to defend. While Scotland and Wales are having doubts and not holding examinations in the summer of 2021, England (and Northern Ireland) are pressing ahead. Sherwood’s observation that “the virus is undoubtedly shredding the level playing-field of learning – if it ever existed in the first place.” is telling as he also shreds the notion that “The Prime Minister and Education Secretary are clear that exams will go ahead, as they are the fairest and most accurate way to measure a pupil’s attainment” (quoted in many media outlets such as FE News on 12th October 2020).

This follows another article by Sherwood on the Higher Education Policy Institute (HEPI) site ‘School exams don’t have to be fair – that’s official’ on 19th November 2020. There he concludes “To me, exam grades should be fair, and a key aspect of ‘fairness’ is reliability. And grades that are ‘reliable to one grade either way’ can never be fair”.

Citing a scathing letter from the Education Committee Chair, Robert Halfon, to Education secretary, Gavin Williamson, on 10th November 2020, He notes “Ofqual was established under the Apprenticeship, Skills, Children and Learning Act 2009 to preserve the integrity of exam qualifications but it has no specific statutory remit to ensure fairness”. This is indeed correct. It seems fairness is not required or a constraint on the examination system.

Accuracy trumps maintaining standards.

Most students might expect that both accuracy and reliability must trump maintaining standards as an outcome. It seems logical that hitting the target without casualties should be the main aim. Yet the notion of accuracy presupposes that there is a target that is visible, definitive and does not move. That is, it is an apple, it can be seen and is placed on one spot. After referring to ‘accuracy’ for some time, Ofqual have more recently avoided using ‘accuracy’ as a term and use ‘definitive’ instead. That is, they define the target at least. But there the idea of accuracy breaks down if someone moves the apple. If it is obscured in some way, then it may not be hit very reliably. The idea of a fairground marksman who is blindfolded comes to mind.

In a paper on the Higher Education Policy Institute in January 2019, Dennis Sherwood asked ‘1 school exam grade in 4 is wrong. Does this matter?’. The reasoning behind this problem with examinations was set out very well by Sherwood earlier this week in WONKHE with ‘PQA’s underlying – and false – assumption’. In the context of a possible move to post qualification admissions (PQA) he stresses that “all marking is inherently “fuzzy”. If marks land close to grade boundaries, then they become less reliable. PQA does nothing to address this.

This idea is not supposition, but taken from data released by Ofqual in November of 2018 ‘Marking consistency metrics – An update’. The report offers little by way of explanation but sets out a lack of reliability and explains that the probability of assessing a grade correctly drops considerably with marks near grade boundaries. This could be as low as 52% for some students in some subjects “The probability of receiving the ‘definitive’ qualification grade varies by qualification and subject, from 0.96 (a mathematics qualification) to 0.52 (an English language and literature qualification)”. The marking is calibrated by seeding and the marking of some whole scripts, or individual items, by senior markers in advance. The apple is defined, but they might move the apple. It is comparison with these ‘definitive’ marks that the inaccurate results emerge.

After stating “The probability of receiving the definitive grade or adjacent grade is above 0.95 for all qualifications”, Ofqual admits in a strange triple negative statement “This is not to say that there are not components or qualifications where the marking consistency cannot be improved”. They conclude “Through identifying appropriate thresholds of acceptability, exam boards should channel additional resource and support to those components or qualifications which most need improving. Exam boards should, additionally, be looking for opportunities to incrementally improve marking”.

The obvious conclusion is that Ofqual should concentrate first on getting their marking much more accurate and reliable across all subjects. By comparison, setting standards by a ‘normalisation’ process is a relatively trivial task. However, the danger lies in setting a minimum standard for university access and then using a tightening of grade boundaries to restrict numbers. This would go back to the stealth tactics used in the 1970’s to define those who Robbins considered were in the ‘pool of ability’. (TEFS 17th August 2017 ‘A-Level Playing Field or not: Have things changed over time?’)

The 'Robbins Principle' stated in 1963 that Higher Education “should be available to all who were qualified for them by ability and attainment". The questions remains today about how we determine the size of the ‘pool’ and who is in the ‘pool of ability’ (Robbins Committee on Higher Education Report, 1963).

Possible solutions.

Last year, Dennis Sherwood asked the question ‘Students will be given more than 1.5 million wrong GCSE, AS and A level grades this summer. Here are some potential solutions. Which do you prefer?’ (HEPI June 2019). In answer he offered several solutions to rectifying the reliability problem.

The range of suggestions included restructuring examinations to make them more objective, do double marking (commonplace in university examinations), allow appeals and remarking more available, training examiners better, tighter checks on marks near grade boundaries and wider boundaries for more subjective topics. Ofqual might consider these but they do not get around a simple reality of there being a degree of uncertainty. Sherwood suggests it would be better to just accept the ‘fuzziness’ of marking in general and that any one mark will be +/- a definable margin of error. This is defined in the scientific sense in that any measurement is an estimate with an inherent confidence limit that is a property of the test. It doesn't mean there is 'error' just that the test isn't giving absolute values. Therefore, simply state that mark and its error or confidence limits in giving an award.

A more detailed explanation is in a longer paper from June 2019 ‘The (un)reliability of public examination grades: evidence, explanation, solutions’.

This comes from what some might call the philosophical school of ‘pragmatism’ that simply identifies a problem and devises a solution. If there is a margin of error that can be estimated, then we should acknowledge this and not set a definitive mark or grade as if it magically represents the full or true ability of a candidate. Indeed, all data on a student should be brought into play. Schools have more information on the candidates, and this should be considered to check the consistency of the examination results. As a scientist, I would be remiss if I ignored earlier data points that were not consistent with the results of one experiment. Indeed, one experiment is hardly definitive. I also tended to approach problems without much in the way of underlying assumptions, other than there may be explanations and solutions we have yet to observe or think about. The pragmatic approach is not constrained by adherence to any past philosophy or dogma. It works by demonstrable results.

The author takes full responsibility for the interpretation of the arguments of others and makes no promise of reliability. Dennis Sherwood is thanked for offering valuable comments on the text and its 'accuracy'.

Mike Larkin, retired from Queen's University Belfast after 37 years teaching Microbiology, Biochemistry and Genetics. He has served on the Senate and Finance and planning committee of a Russell Group University.

Search This Blog

THREE: Accuracy, reliability and the ‘William Tell’ effect

Comments

Post a Comment

Popular posts from this blog

Ofqual holding back information

TEFS has moved - click here for link

Students working in term-time: Commuter students and their working patterns