Skip to main content

THREE: Accuracy, reliability and the ‘William Tell’ effect

To illustrate the inherent uncertainty in unreliable and inaccurate assessments of the ability and attainment of students, I make an analogy with predicting the accuracy of William Tell hitting the apple. He could see where the apple was placed and could hit it with precise accuracy. Tell was also reliable and his record meant that he was expected to hit the apple every time he tried. 

However, a less able fairground marksman might have boasted that he could ‘hit’ the apple, with + or – one apple width accuracy, at over 90% reliability. This means he is likely to be less accurate and much less reliable. One apple width too high is a miss that might as well have missed by a mile. One apple width too low and a new volunteer is needed. Taking bets on this would hardly attract the bookies since they would be more interested in the chances of an actual hit most of the time. The 90% reliability boast might hide inaccuracy too often and the reliability of hitting the actual target could be as low as 40%. 

This was amply demonstrated in a figure from 2018, shown here again, that persisted in the August 2020 report by Ofqual ‘Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim report August 2020’. TEFS concluded that Ofqual merely demonstrates that they “get the grades wrong most of the time, but do this with great precision” (reported by TEFS 18th August 2020 ‘Exams 2020 and the demise of Ofqual, who pays the ferryman?’). 

As an aside, Tell was less confident of his reliable accuracy and in the story of the apple he is reported to have kept a spare bolt on hand in case he missed so he could use it to kill his accuser, the Habsburg bailiff, Albrecht Gessler. 

Teachers must guess what Ofqual might validate. 

Yet Ofqual persist with the boast that over 90% of their grades are reliable to within +/- one grade. This was reinforced by the Acting Ofqual Chief Regulator, Glenys Stacey, in September at an Education Committee hearing who noted that grades were “reliable to one grade either way”

This is a peculiar boast since the Ofqual report in August ‘Awarding GCSE, AS, A level, advanced extension awards and extended project qualifications in summer 2020: interim report’ criticised the likelihood of teachers predicting grades accurately to the grade awarded. But Ofqual admits its own validated grades are, in turn, less reliable by +/- one grade. Their validation relies on comparison of a mark made by an ‘ordinary’ examiner to that awarded separately by a 'senior' examiner. Yet teachers are expected to be better than that using the argument “Not only are those approaches more susceptible to potential bias, it is not possible to validate the accuracy of these approaches due to the absence of equivalent authentic data for the purposes of testing”. In predicting grades, teachers must guess which of three grades are likely to be awarded in each case. If they predict a B grade for a student, the exam result could be A or C and the teacher got it wrong. Instead, Ofqual might consider they got it wrong and look again at the inconsistent outcome. 

The current approach to examinations is as much hopeless as it is misleading. There are terrible and real consequences and ‘casualties’. One grade lower and a student misses out on a career. See something of my personal experience of this, and a lucky escape, in 1972 when the margins of boundaries were very tight in TEFS 17th August 2017 ‘A-Level Playing Field or not: Have things changed over time?’

Criticism continues. 

In a highly critical posting recently ‘No, Minister. England’s school exams are not ‘the fairest way’ (London School of Economics 20th November 2020), Dennis Sherwood maintains his strong criticism of the existing examination system that Ofqual is so determined to defend. While Scotland and Wales are having doubts and not holding examinations in the summer of 2021, England (and Northern Ireland) are pressing ahead. Sherwood’s observation that “the virus is undoubtedly shredding the level playing-field of learning – if it ever existed in the first place.” is telling as he also shreds the notion that “The Prime Minister and Education Secretary are clear that exams will go ahead, as they are the fairest and most accurate way to measure a pupil’s attainment” (quoted in many media outlets such as FE News on 12th October 2020). 

This follows another article by Sherwood on the Higher Education Policy Institute (HEPI) site ‘School exams don’t have to be fair – that’s official’ on 19th November 2020. There he concludes “To me, exam grades should be fair, and a key aspect of ‘fairness’ is reliability. And grades that are ‘reliable to one grade either way’ can never be fair”. 

Citing a scathing letter from the Education Committee Chair, Robert Halfon, to Education secretary, Gavin Williamson, on 10th November 2020, He notes “Ofqual was established under the Apprenticeship, Skills, Children and Learning Act 2009 to preserve the integrity of exam qualifications but it has no specific statutory remit to ensure fairness”. This is indeed correct. It seems fairness is not required or a constraint on the examination system. 

Accuracy trumps maintaining standards. 

Most students might expect that both accuracy and reliability must trump maintaining standards as an outcome. It seems logical that hitting the target without casualties should be the main aim. Yet the notion of accuracy presupposes that there is a target that is visible, definitive and does not move. That is, it is an apple, it can be seen and is placed on one spot. After referring to ‘accuracy’ for some time, Ofqual have more recently avoided using ‘accuracy’ as a term and use ‘definitive’ instead. That is, they define the target at least. But there the idea of accuracy breaks down if someone moves the apple. If it is obscured in some way, then it may not be hit very reliably. The idea of a fairground marksman who is blindfolded comes to mind. 

In a paper on the Higher Education Policy Institute in January 2019, Dennis Sherwood asked ‘1 school exam grade in 4 is wrong. Does this matter?’. The reasoning behind this problem with examinations was set out very well by Sherwood earlier this week in WONKHE with ‘PQA’s underlying – and false – assumption’. In the context of a possible move to post qualification admissions (PQA) he stresses that “all marking is inherently “fuzzy”. If marks land close to grade boundaries, then they become less reliable. PQA does nothing to address this. 

This idea is not supposition, but taken from data released by Ofqual in November of 2018 ‘Marking consistency metrics – An update’. The report offers little by way of explanation but sets out a lack of reliability and explains that the probability of assessing a grade correctly drops considerably with marks near grade boundaries. This could be as low as 52% for some students in some subjects “The probability of receiving the ‘definitive’ qualification grade varies by qualification and subject, from 0.96 (a mathematics qualification) to 0.52 (an English language and literature qualification)”. The marking is calibrated by seeding and the marking of some whole scripts, or individual items, by senior markers in advance. The apple is defined, but they might move the apple. It is comparison with these ‘definitive’ marks that the inaccurate results emerge. 

After stating “The probability of receiving the definitive grade or adjacent grade is above 0.95 for all qualifications”, Ofqual admits in a strange triple negative statement “This is not to say that there are not components or qualifications where the marking consistency cannot be improved”. They conclude “Through identifying appropriate thresholds of acceptability, exam boards should channel additional resource and support to those components or qualifications which most need improving. Exam boards should, additionally, be looking for opportunities to incrementally improve marking”. 

The obvious conclusion is that Ofqual should concentrate first on getting their marking much more accurate and reliable across all subjects. By comparison, setting standards by a ‘normalisation’ process is a relatively trivial task. However, the danger lies in setting a minimum standard for university access and then using a tightening of grade boundaries to restrict numbers. This would go back to the stealth tactics used in the 1970’s to define those who Robbins considered were in the ‘pool of ability’. (TEFS 17th August 2017 ‘A-Level Playing Field or not: Have things changed over time?’

The 'Robbins Principle' stated in 1963 that Higher Education “should be available to all who were qualified for them by ability and attainment". The questions remains today about how we determine the size of the ‘pool’ and who is in the ‘pool of ability’ (Robbins Committee on Higher Education Report, 1963). 

Possible solutions. 

Last year, Dennis Sherwood asked the question ‘Students will be given more than 1.5 million wrong GCSE, AS and A level grades this summer. Here are some potential solutions. Which do you prefer?’ (HEPI June 2019). In answer he offered several solutions to rectifying the reliability problem.

The range of suggestions included restructuring examinations to make them more objective, do double marking (commonplace in university examinations), allow appeals and remarking more available, training examiners better, tighter checks on marks near grade boundaries and wider boundaries for more subjective topics. Ofqual might consider these but they do not get around a simple reality of there being a degree of uncertainty. Sherwood suggests it would be better to just accept the ‘fuzziness’ of marking in general and that any one mark will be +/- a definable margin of error. This is defined in the scientific sense in that any measurement is an estimate with an inherent confidence limit that is a property of the test. It doesn't mean there is 'error' just that the test isn't giving absolute values. Therefore, simply state that mark and its error or confidence limits in giving an award. 

A more detailed explanation is in a longer paper from June 2019 ‘The (un)reliability of public examination grades: evidence, explanation, solutions’.

This comes from what some might call the philosophical school of ‘pragmatism’ that simply identifies a problem and devises a solution. If there is a margin of error that can be estimated, then we should acknowledge this and not set a definitive mark or grade as if it magically represents the full or true ability of a candidate. Indeed, all data on a student should be brought into play. Schools have more information on the candidates, and this should be considered to check the consistency of the examination results. As a scientist, I would be remiss if I ignored earlier data points that were not consistent with the results of one experiment. Indeed, one experiment is hardly definitive. I also tended to approach problems without much in the way of underlying assumptions, other than there may be explanations and solutions we have yet to observe or think about. The pragmatic approach is not constrained by adherence to any past philosophy or dogma. It works by demonstrable results. 

The author takes full responsibility for the interpretation of the arguments of others and makes no promise of reliability. Dennis Sherwood is thanked for offering valuable comments on the text and its 'accuracy'.

Mike Larkin, retired from Queen's University Belfast after 37 years teaching Microbiology, Biochemistry and Genetics. He has served on the Senate and Finance and planning committee of a Russell Group University.

Comments

Popular posts from this blog

Qfqual builds a concrete wall: UPDATED

UPDATE 8th August 2020 Things are moving fast today with severe criticism mounting about Ofqual and SQA, and urgent action is needed. TEFS has laid out ten points that should be considered to reverse out of the crumbling mess. Fairness should replace 'maintaining standards' as the primary objective. The government must cease trying to defend a system that acts as a barrier to the less advantaged. Since posting yesterday, things have been moving fast. Today the Guardian put the examinations issue in large print on its front page with ‘Nearly 40% of A-level result predictions to be downgraded in England’ . This conclusion came about after some great detective work by former medical statistician, Huy Duong, who analysed the data available and reconciled this with the Ofqual announcement that there could have been a 12% inflation in higher grades. It seems that Ofqual have been caught red handed and "Duong’s findings were privately confirmed to the Guardian by ex

A radical overhaul of examinations is needed as soon as possible: UPDATE

UPDATE 23rd March 2021 Since this idea was posted in January, there has been considerable thought across the sector about what would be best for the future. These are very well laid out in a collection of short essays reported last week by the Higher Education Policy Institute (HEPI). The twelve essays, from different authors and different perspectives, in  ‘Where next for university admissions? ’ are edited by Rachel Hewitt  who sets out the many pitfalls surrounding examinations and university admissions. It seems there are those in favour of post qualification admission (PQA) to university as it should help the least advantaged students. However, arguments against this are presented that means caution must be taken. A powerful response to the HEPI report by the  'The Fair Access Coalition: 10 requirements for a fair admissions process' adds further to the debate. The suggestions are sensible but falls short on demanding adequate resources for students throughout their studi

The next labour of Ofqual is announced: Social mobility UPDATE

UPDATE 1st March2021  Since writing this post, there has been valuable analysis added to the worsening situation by Lee Elliot-Major, Chair of Social Mobility at Exeter University and former head of the Sutton Trust. His article in The Guardian today, ‘How do we ensure disadvantaged kids don't lose out in England's new exam system?’  concludes that “it will be long after this summer’s exam grade battles that we will comprehend the full consequences this pandemic has had on young people.” That could be an understatement as the idea of ‘social mobility’ unravels fast. He cites a recent research publication with colleagues at the LSE Centre for Economic Performance  entitled  ‘Unequal learning and labour market losses in the crisis: consequences for social mobility’ . This is a detailed and rigorous analysis and survey that should set alarm bells ringing in government in the run-up to the budget this week. The evidence is stark as the “education and labour market losses due to C