Skip to main content

THREE: Accuracy, reliability and the ‘William Tell’ effect

To illustrate the inherent uncertainty in unreliable and inaccurate assessments of the ability and attainment of students, I make an analogy with predicting the accuracy of William Tell hitting the apple. He could see where the apple was placed and could hit it with precise accuracy. Tell was also reliable and his record meant that he was expected to hit the apple every time he tried. 

However, a less able fairground marksman might have boasted that he could ‘hit’ the apple, with + or – one apple width accuracy, at over 90% reliability. This means he is likely to be less accurate and much less reliable. One apple width too high is a miss that might as well have missed by a mile. One apple width too low and a new volunteer is needed. Taking bets on this would hardly attract the bookies since they would be more interested in the chances of an actual hit most of the time. The 90% reliability boast might hide inaccuracy too often and the reliability of hitting the actual target could be as low as 40%. 

This was amply demonstrated in a figure from 2018, shown here again, that persisted in the August 2020 report by Ofqual ‘Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim report August 2020’. TEFS concluded that Ofqual merely demonstrates that they “get the grades wrong most of the time, but do this with great precision” (reported by TEFS 18th August 2020 ‘Exams 2020 and the demise of Ofqual, who pays the ferryman?’). 

As an aside, Tell was less confident of his reliable accuracy and in the story of the apple he is reported to have kept a spare bolt on hand in case he missed so he could use it to kill his accuser, the Habsburg bailiff, Albrecht Gessler. 

Teachers must guess what Ofqual might validate. 

Yet Ofqual persist with the boast that over 90% of their grades are reliable to within +/- one grade. This was reinforced by the Acting Ofqual Chief Regulator, Glenys Stacey, in September at an Education Committee hearing who noted that grades were “reliable to one grade either way”

This is a peculiar boast since the Ofqual report in August ‘Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim report’ criticised the likelihood of teachers predicting grades accurately to the grade awarded. But Ofqual admits its own validated grades are, in turn, less reliable by +/- one grade. Their validation relies on comparison of a mark made by an ‘ordinary’ examiner to that awarded separately by a 'senior' examiner. Yet teachers are expected to be better than that using the argument “Not only are those approaches more susceptible to potential bias, it is not possible to validate the accuracy of these approaches due to the absence of equivalent authentic data for the purposes of testing”. In predicting grades, teachers must guess which of three grades are likely to be awarded in each case. If they predict a B grade for a student, the exam result could be A or C and the teacher got it wrong. Instead, Ofqual might consider they got it wrong and look again at the inconsistent outcome. 

The current approach to examinations is as much hopeless as it is misleading. There are terrible and real consequences and ‘casualties’. One grade lower and a student misses out on a career. See something of my personal experience of this, and a lucky escape, in 1972 when the margins of boundaries were very tight in TEFS 17th August 2017 ‘A-Level Playing Field or not: Have things changed over time?’

Criticism continues. 

In a highly critical posting recently ‘No, Minister. England’s school exams are not ‘the fairest way’ (London School of Economics 20th November 2020), Dennis Sherwood maintains his strong criticism of the existing examination system that Ofqual is so determined to defend. While Scotland and Wales are having doubts and not holding examinations in the summer of 2021, England (and Northern Ireland) are pressing ahead. Sherwood’s observation that “the virus is undoubtedly shredding the level playing-field of learning – if it ever existed in the first place.” is telling as he also shreds the notion that “The Prime Minister and Education Secretary are clear that exams will go ahead, as they are the fairest and most accurate way to measure a pupil’s attainment” (quoted in many media outlets such as FE News on 12th October 2020). 

This follows another article by Sherwood on the Higher Education Policy Institute (HEPI) site ‘School exams don’t have to be fair – that’s official’ on 19th November 2020. There he concludes “To me, exam grades should be fair, and a key aspect of ‘fairness’ is reliability. And grades that are ‘reliable to one grade either way’ can never be fair”. 

Citing a scathing letter from the Education Committee Chair, Robert Halfon, to Education secretary, Gavin Williamson, on 10th November 2020, He notes “Ofqual was established under the Apprenticeship, Skills, Children and Learning Act 2009 to preserve the integrity of exam qualifications but it has no specific statutory remit to ensure fairness”. This is indeed correct. It seems fairness is not required or a constraint on the examination system. 

Accuracy trumps maintaining standards. 

Most students might expect that both accuracy and reliability must trump maintaining standards as an outcome. It seems logical that hitting the target without casualties should be the main aim. Yet the notion of accuracy presupposes that there is a target that is visible, definitive and does not move. That is, it is an apple, it can be seen and is placed on one spot. After referring to ‘accuracy’ for some time, Ofqual have more recently avoided using ‘accuracy’ as a term and use ‘definitive’ instead. That is, they define the target at least. But there the idea of accuracy breaks down if someone moves the apple. If it is obscured in some way, then it may not be hit very reliably. The idea of a fairground marksman who is blindfolded comes to mind. 

In a paper on the Higher Education Policy Institute in January 2019, Dennis Sherwood asked ‘1 school exam grade in 4 is wrong. Does this matter?’. The reasoning behind this problem with examinations was set out very well by Sherwood earlier this week in WONKHE with ‘PQA’s underlying – and false – assumption’. In the context of a possible move to post qualification admissions (PQA) he stresses that “all marking is inherently “fuzzy”. If marks land close to grade boundaries, then they become less reliable. PQA does nothing to address this. 

This idea is not supposition, but taken from data released by Ofqual in November of 2018 ‘Marking consistency metrics – An update’. The report offers little by way of explanation but sets out a lack of reliability and explains that the probability of assessing a grade correctly drops considerably with marks near grade boundaries. This could be as low as 52% for some students in some subjects “The probability of receiving the ‘definitive’ qualification grade varies by qualification and subject, from 0.96 (a mathematics qualification) to 0.52 (an English language and literature qualification)”. The marking is calibrated by seeding and the marking of some whole scripts, or individual items, by senior markers in advance. The apple is defined, but they might move the apple. It is comparison with these ‘definitive’ marks that the inaccurate results emerge. 

After stating “The probability of receiving the definitive grade or adjacent grade is above 0.95 for all qualifications”, Ofqual admits in a strange triple negative statement “This is not to say that there are not components or qualifications where the marking consistency cannot be improved”. They conclude “Through identifying appropriate thresholds of acceptability, exam boards should channel additional resource and support to those components or qualifications which most need improving. Exam boards should, additionally, be looking for opportunities to incrementally improve marking”. 

The obvious conclusion is that Ofqual should concentrate first on getting their marking much more accurate and reliable across all subjects. By comparison, setting standards by a ‘normalisation’ process is a relatively trivial task. However, the danger lies in setting a minimum standard for university access and then using a tightening of grade boundaries to restrict numbers. This would go back to the stealth tactics used in the 1970’s to define those who Robbins considered were in the ‘pool of ability’. (TEFS 17th August 2017 ‘A-Level Playing Field or not: Have things changed over time?’

The 'Robbins Principle' stated in 1963 that Higher Education “should be available to all who were qualified for them by ability and attainment". The questions remains today about how we determine the size of the ‘pool’ and who is in the ‘pool of ability’ (Robbins Committee on Higher Education Report, 1963). 

Possible solutions. 

Last year, Dennis Sherwood asked the question ‘Students will be given more than 1.5 million wrong GCSE, AS and A level grades this summer. Here are some potential solutions. Which do you prefer?’ (HEPI June 2019). In answer he offered several solutions to rectifying the reliability problem.

The range of suggestions included restructuring examinations to make them more objective, do double marking (commonplace in university examinations), allow appeals and remarking more available, training examiners better, tighter checks on marks near grade boundaries and wider boundaries for more subjective topics. Ofqual might consider these but they do not get around a simple reality of there being a degree of uncertainty. Sherwood suggests it would be better to just accept the ‘fuzziness’ of marking in general and that any one mark will be +/- a definable margin of error. This is defined in the scientific sense in that any measurement is an estimate with an inherent confidence limit that is a property of the test. It doesn't mean there is 'error' just that the test isn't giving absolute values. Therefore, simply state that mark and its error or confidence limits in giving an award. 

A more detailed explanation is in a longer paper from June 2019 ‘The (un)reliability of public examination grades: evidence, explanation, solutions’.

This comes from what some might call the philosophical school of ‘pragmatism’ that simply identifies a problem and devises a solution. If there is a margin of error that can be estimated, then we should acknowledge this and not set a definitive mark or grade as if it magically represents the full or true ability of a candidate. Indeed, all data on a student should be brought into play. Schools have more information on the candidates, and this should be considered to check the consistency of the examination results. As a scientist, I would be remiss if I ignored earlier data points that were not consistent with the results of one experiment. Indeed, one experiment is hardly definitive. I also tended to approach problems without much in the way of underlying assumptions, other than there may be explanations and solutions we have yet to observe or think about. The pragmatic approach is not constrained by adherence to any past philosophy or dogma. It works by demonstrable results. 

The author takes full responsibility for the interpretation of the arguments of others and makes no promise of reliability. Dennis Sherwood is thanked for offering valuable comments on the text and its 'accuracy'.

Mike Larkin, retired from Queen's University Belfast after 37 years teaching Microbiology, Biochemistry and Genetics. He has served on the Senate and Finance and planning committee of a Russell Group University.

Comments

Popular posts from this blog

Ofqual holding back information

Ofqual has responded to an FOI request from TEFS this week. They held a staggering twenty-nine board meetings since March. Despite promising the Parliamentary Education Committee over a month ago they would publish the minutes “shortly” after their meeting on 16th September, they are still not able to do so. They cite “exemption for information that is intended to be published in the future” for minutes that are in the “process of being approved for publication” . More concerning is they are also citing exemption under the “Public Interest Test”. This means they might not be published, and Ofqual will open themselves up to legal challenges. If both the Department for Education and Ofqual are prevented from being more open, then public interest will lie shattered on the floor and lessons will not be learned.  Ofqual finally responded to the TEFS Freedom of Information (FOI) request to publish the minutes of its board meetings on Tuesday. It should have been replied to by 17th Sept...

Higher Education and the ‘intelligent plumbers’ theory

A common tactic when found out is to divert attention elsewhere. The release of student data from 2018/19 by the Department for Education (DfE) yesterday, ‘Widening participation in higher education: 2020’ and ‘Statistics: further education and skills’ tells the same sorry tale of a wide gap in access to universities between the most and least advantaged students. To divert attention from these stark facts in advance, the government used a diversionary tactic by attacking the effectiveness of universities and thus pointing the blame for poor social mobility someplace else. Advocating improvements in further education, something cut back by the same regime for years, hides the real intention. It seems that class divisions will be further exacerbated and any concession to universities fuelling improved social mobility has been abandoned. But the flawed theory is that at least the elite rulers will get access to intelligent plumbers . Three years ago, I heard a leading ‘You...

Students working in term-time: Commuter students and their working patterns

This article and analysis shows that commuter students are more likely to be employed in term time and also more likely to work longer hours. Two recent studies of commuter students ( one a quantitative and the other a qualitative analysis ) attending six universities in the London area revealed that commuter students were at a disadvantage in terms of outcome when compared to their peers. There is an urgent need for institutions to consider the actual time that their students have to study as the main measure. This is a way to integrate the time pressures of other activities such as commuting and employment that all add up to less time for studying. The general conclusion of the two studies was that “travel time remained a significant predictor of student progression or continuation for UK-domiciled full time undergraduates at three of the six London institutions”. This is perhaps not surprising for someone who spends much of the day travelling and the recommendation is that ...