10 things I still hate about data - Sig+ for School Data

I haven’t blogged for a while. But then I turned 50 (today!) and thought: what better way to spend a significant birthday than to have a proper rant about stuff that annoys me? I sat down to write a list (I have to sit down to write a list. I’m 50) and realised, somewhat unsurprisingly, there is still plenty of nonsense that deserves a good old public berating. Some are old classics – I’m looking at you ‘flightpaths’ – and some haven’t even happened yet, but they are all equally annoying. So, without further ado (I can say ‘ado’ now. I’m 50), here are 10 things I still hate about data.

Target Setting

Everyone loves a target. The problem is that everyone wants a good one. Pupils and parents see targets as a done deal: the higher the target the higher the grade the pupil will achieve as if the grade written on a piece of paper is fated. The bigger issue with targets is the way in which they are often derived. Secondary schools generate targets by drawing a straight line between KS2 results and GCSE grades and we are then into the murky territory of GCSE grades being used across KS3 and KS4. These lines usually represent some sort of average gradient; the average grades achieved by pupils with particular start points. But the problem with averages is that only half of pupils meet or exceed them; the other half will fall short. Up and down the country, every department in every school is expecting what can only be achieved by 50%. The next problem is trying to predict what grade pupils are likely to achieve so we can have an ongoing indication of which pupils are on-track or not on-track to meet their targets. If there’s one thing that the disruption of 2020 and 2021 has taught us, it’s that predicting GCSE outcomes is not exactly straightforward. Even without any inflation, it’s just not possible for all pupils to achieve the grades that teachers think pupils deserve. And let’s not forget primary schools. Here, there really is just one target and that’s to meet expected standards. ‘Working Towards’ is not a target – we cannot target pupils to fall short of expected standards even if we think that they will. And we certainly can’t track towards targeted scaled scores (unless, of course, pupils sit multiple practice tests across Year 6, but even that is no guarantee of outcome). Let’s be honest about targets. At the risk of sounding all Mr Popper’s Penguins, if their purpose is problematic, and they are simply predictions based on probability, then perhaps ponder the principle of the process. If you want to read more on why targets are a problem, here’s Ben Newmark’s blog post on the subject.

Flightpaths

Assessment without levels is a lie. Levels are alive and well and working overtime in a school near you. Sure, they might be called steps, or stages, or bands, but if it quacks like a duck. Levels were removed for a number of good reasons: they labelled children, they told us nothing about what pupils could or couldn’t do, they implied comparability between subjects, they encouraged pace at the expense of depth, they gave the illusion of linear progression. An entire accountability system was built around that last point, a system that demanded all pupils make two levels of progress across key stage 2 and three levels across key stage 3 and 4, ignoring the fact that progressing from level 1 to 3 was not the same as progressing from 3 to 5. It got even more mad when we started talking about pupils making one level every two years, or one and a half sublevels every year, or a point per term. In the end, we’d gone so far down the rabbit hole, we’d convinced ourselves that 3b+ was a real thing. So, levels were ditched but inevitably, in the vacuum left by their absence, all manner of crazy systems bloomed. Secondary schools have their GCSE flightpaths, with grades often split into -/=/+ subdivisions. These grades may be ‘working at’ – the grade the pupil would supposedly get if they took the exam now – or ‘working towards’ – the grade they are on-track to achieve at the end of KS4. Some schools have both, and some have targets as well as predictions, and some subtract one from the other to create some kind of progress measure, which is basically alchemy. Meanwhile, primary schools have 3, or 4, or 6, or even 10 step-per-year approaches, which are usually variations on the ’emerging, developing, secure’ theme (emerging means autumn, by the way), perhaps with a ‘mastery’ band reserved for the end of the year. Because, you know, mastery is what bright kids do after Easter.

The problem is progress. Despite knowing deep down that progress cannot really be measured – unless you use standardised assessment and even then it’s problematic – schools really want progress measures and are willing to suspend their disbelief in pursuit of that goal. Unfortunately, progress measures break assessment every time. They broke levels, they broke p-scales, they broke those early years development matters age/stage bands (hint: they are not discrete, they overlap!), and they will break your assessment system. I suggest simply recording whether pupils are working below, within, or above curriculum expectations. You really don’t need an 86 step flightpath. Honestly.

Measuring the progress of pupils with SEND

Even those schools that have moved away from levels-style systems for general assessment, still feel the need – or perhaps are under pressure – to maintain such approaches for pupils with SEND. “We have to show the small steps of progress” is a common refrain. But what is a small step? What is a common unit of progress for pupils with SEND? Even Ofsted now admit they got it wrong when it came to SEND data: ‘Because of the often vastly different types of pupils’ needs, inspectors will not compare the outcomes achieved by pupils with SEND with those achieved by other pupils with SEND in the school, locally or nationally‘ (paragraph 362, Ofsted Handbook). If we can’t accurately measure the progress of pupils in general, how can we possibly come up with a common system to measure the progress of pupils with SEND? It will inevitably involve some sort of flightpath with a large number of mini-steps and a hardwired expected rate of progress. This is exactly why P-scales became damaged beyond repair, and sadly there are already schools that are breaking up pre-key stage bands into sublevels and attempting to use them for tracking and measuring progress. But we all know that ‘SEND’ is not some homogeneous group that can be lumped together for convenience – what constitutes good progress for one pupil may not be what we’d expect of another. Consider this from the Engagement Model: ‘Progress for these pupils can also be variable. They may make progress for a period, but then either plateau or lose some of the gains they have made, before progress starts again. These patterns of progress are typical for pupils who are not engaged in subject specific study. Preventing or slowing a decline in the pupils’ performance may also be an appropriate outcome of intervention.’ Think about that: good progress may be defined as declining at a lesser rate than would have occurred without intervention. There is no such thing as linear progression and this is especially the case when it comes to pupils with SEND, but there are agencies out there that still believe, or pretend, that this is the case and it is painful that many schools are essentially having to make data up to complete a form to secure funding. Of all the points in this top 10 list, this is probably the one that annoys me the most.

The whole SEND vs EAL progress thing

I’m not advocating a return to CVA – CVA was overcomplicated, made allowances for the relatively low achievement of certain groups, and possibly resulted in a widening of the gap between the most disadvantaged and others – but there are big problems with the current system of VA measures. Perhaps the biggest of these issues is that pupils with similarly low start points but with very different characteristics (eg SEND and EAL) are compared to one another. Because the system takes no account of context, it cannot differentiate between pupils except on the basis of prior attainment, and if pupils have the same prior attainment they will be placed into the same prior attainment group (PAG). The end of key stage benchmarks that these pupils are compared against at KS2 and KS4 are therefore the average of the outcomes of a potentially extremely diverse group. These benchmarks, therefore, end up being too high for one subgroup of the PAG (i.e. those with SEND), and way too low for the other subgroup (EAL). This is because EAL pupils tend to have an inflationary effect on benchmarks; pupils with SEND have the opposite effect, and this can result in some extraordinary pupil-level progress scores. Indeed, the highest KS2 progress score I’ve ever seen was for an EAL pupil that had a KS1 average score of 6 and therefore slotted into a low PAG with a KS2 maths benchmark of 85. They ended up scoring 118 on the maths test and their progress score was +33. This added an extra point onto every pupil in the cohort, which is enough to make the difference between cohort progress score that is categorised as ‘average’ and one that is ‘significantly above’. Meanwhile, pupils with SEND tend to get negative progress scores and it is almost impossible for such pupils to make ‘above average’ progress if they do not sit the test. Rather than being omitted from the cohort measures, they are included with a nominal – i.e. made-up – score. When it comes to accountability measures, it doesn’t really pay to have pupils with SEND even though their excellent progress can be readily demonstrated in other ways.

The whole reception baseline infant junior school thing

In the 2017 primary assessment consultation, questions were asked about how progress could be measured for infant, first, junior, and middle schools in future, when KS1 assessment is ditched and the reception baseline takes over. The options were as follows:

A reception to key stage 1 progress measures for infant and first schools, and key stage 1 to 2 progress measures for junior and middle schools. This would require maintaining statutory key stage 1 teacher assessments for pupils in infant and first schools.
Hold all schools to account on the basis of reception to key stage 2 measures, hopefully encouraging greater collaboration between infant, first, junior and middle schools.

Needless to say, neither were popular and when the DfE published their response, they answered all questions except this one. We had to wait months for a decision and when it came it really was quite bizarre – the DfE has decided to do nothing. All through primary schools (i.e. those with pupils from reception to year 6) will be held to account for the progress made by pupils from their reception baseline to KS2 results. All other types of schools – those non-all-through schools – will have attainment measures alone and ‘will have responsibility for evidencing progress [to Ofsted and others] based on their own assessment information‘. Putting aside the fact that Ofsted is saying they will not be looking at school’s internal data (so, you can’t show them the results from standardised tests, for example) this means that approximately 3000 schools (a whole quintile’s worth) will not be included in the national progress measures. It also means that whilst Infant and first schools have a statutory duty to administer the reception baseline, they have no further stake in the game because it will not be used measure the progress their pupils make. Neither will those baseline scores be used to calculate the progress of pupils in junior and middle schools. In fact, those baseline scores will only be used in cases where an infant school pupil at some point finds their way into an all-through primary school. A weird and surely unsustainable situation.

The KS1-2 primary progress measures that haven’t happened yet

In 2016, pupils at KS1 were assessed under the new national curriculum and were assigned new assessment codes: BLW (working below the expected standard, either on p-scales or pre-key stage), WTS (working towards the expected standard), EXS (working at the expected standard), GDS (working at greater depth within the expected standard). The DfE collected these KS1 assessment codes; they did not collect the scaled scores from the KS1 tests. This cohort reached the end of KS2 in 2020 and they should have been the first group of pupils to have a measure of progress based entirely on the new national curriculum. This, of course, didn’t happen, and it didn’t happen in 2021 either, which means we are still waiting to find out the fine details of this progress measure. No doubt the concept will remain the same: each pupil’s KS2 result will be compared to the national average score of pupils in the same prior attainment group. But here’s the problem: fewer KS1 outcomes mean fewer prior attainment groups, and fewer prior attainment groups (PAGs) means a far blunter instrument. With levels, there were 24 PAGs at KS2; with the new KS1 assessments there just aren’t that many possible combinations and without the KS1 test scores it’s not possible to further differentiate (unless the use phonics scores. Please don’t use phonics scores). Even worse, it’s likely that a majority of pupils nationally will fall into the same PAG (i.e. Pupils assessed as EXS in reading, writing, and maths). This is no basis for a reliable and meaningful accountability measure. And of course, we still have the age-old problem of the baseline (in this case KS1 results) being based on teacher assessment, which is prone to bias and distortion. Expect these measures, when they happen this year, to be garbage and the DfE should consider whether they’re really worth pursuing.

The use of teacher assessment in primary accountability measures

As mentioned above, and in more detail in this recent blog post, teacher assessment is prone to bias and distortion, and yet almost the entire primary accountability system relies on it. The issues at KS1 are well known and widely discussed, but we have similar issues with phonics (yes, THAT graph), early years foundation stage (another weird graph), and writing assessment at KS2. Note that KS2 writing results are used to hold primary schools to account, but do not form part of the baseline for progress 8, where only KS2 scores in reading and maths are involved. This probably tells us something about how the powers that be perceive the accuracy of those data.

Both the Making Data Work report and the final report of the Commission on Assessment Without Levels warn of the dangers of using assessment for multiple purposes, of expecting it to be a useful tool for teaching and learning and an accurate account of standards, whilst also using it to hold teachers to account. In the various graphs depicted in the sources linked above, we see the inevitable outcome of this regime – data is bent out of shape in order to paint a particular picture. In some cases, where the assessment is purely a result, it may be inflated. In other cases, where it is to be used as a baseline, it may be depressed. In the case of KS1 results, the assessment is in tension – it is both a result and a baseline – and is pulled back and forth as if caught in a tug-of-war between competing purposes. In such a scenario, assessment may be reduced to a Goldilocks approach: not too hot, not too cold, just right. We can have accurate teacher assessment or we can use it for accountability. That’s the choice.

The high stakes nature of assessment in primary schools

Really this is a continuation of the above but gives us an opportunity to consider an alternative. Currently, there are seven – Yes! Seven! – statutory assessments taking place in primary schools: reception baseline at the start of reception, early years foundation stage profile at the end of reception, phonics in year 1 and year 2, key stage 1 tests and teacher assessments at the end of year 2, multiplication tables check in year 4, and key stage 2 tests and teacher assessments at the end of year 6. All of these are fairly high stakes – with the stakes increasing towards KS2 – and many are prone to distortion, as discussed above. In addition, pupils may undergo a great deal of practice in the run-up to tests, which ramps up the stakes still further. Consequently, current data is probably not a fair and accurate representation of pupils’ attainment and progress. This recent report from the EDSK Think Tank offers a compelling and radical alternative: scrap the current array of assessments and replace them with online adaptive tests every 2 years. These would be lower stakes, take less time, could be administered easily during the course of normal teaching, would not require all pupils to take the test at the same time, would provide more regular checks on pupils’ progress, and are more difficult to practise for (no past papers; adaptive so pupils are asked different questions). A further benefit is that the tests can contain national reference questions that samples of pupils would attempt every year thus allowing accurate tracking of national standards. This approach has already been adopted in many countries including Wales, Denmark, and Australia. Why not England?

Scaled scores vs standardised scores

I am still having this conversation on a regular basis and I blame the DfE for choosing a scale for KS1 and KS2 tests that looks almost exactly like the scores derived from the various commercial tests that are commonly used in schools (e.g. NFER, Rising Stars, Star, GL). KS2 tests generate scores in the range of 80 to 120, with a score of 100 representing the expected standard. Pupils achieving 100 or more have met the expected standard. A standardised test will generate scores in the range of around 60 to 140 (they can go lower and higher) with 100 representing the average score. That final point – 100 represents the average – is the important bit because that is very different from the meaning of 100 on a KS2 test, where 100 represents the standard. Those tests from the likes of NFER and Rising Stars are norm-referenced and any pupil that takes one of those tests is compared to the ‘norm group’ – a large, representative, national sample – and their standardised score is a reflection of their position within that group. If a pupil achieves a score of 100 or more they are in the top half of the norm group; if they achieve a score below 100, they are the lower half. We assume the norm group to be representative of the national population and so take their score to indicate the pupil’s position within the national. It doesn’t matter how hard or easy the test is, the scores always have the same meaning. Note that norm-referenced tests do not have a pass mark.

Contrast this with KS2 tests where 100 represents a standard rather than the average. Unlike a standardised test, where only 50% of pupils can achieve a score of 100 or more (because it represents the average) at KS2 no such quota exists, and in that way it is more akin to a driving test where everyone can pass as long as they meet the standard. KS2 tests are therefore closer to a criterion-referenced test. In the KS2 tests taken in 2019, 73% of pupils achieved a score of 100 or more in reading (i.e. met the expected standard) and 79% did so in maths. So, on one type of test (e.g. NFER), we can have no more than 50% achieving a score of 100 or more, and yet on the other type of test (i.e. KS2) 79% can reach that threshold, and that number is likely to rise in the next few years. If we wanted a standardised score that better approximates to the KS2 expected standard, we, therefore, need a lower threshold. A standardised score of 88 captures pupils in the top 79% nationally, and therefore approximates to the expected standard for maths; a standardised score of 91 captures the top 73% nationally and therefore approximates to the expected standard in reading. In reality, we probably want to up those a bit because we risk counting too many borderline pupils but clearly attempting to use a threshold of 100 is a useful predictor as it will only identify those pupils in the top half, and it’s more than the top half that achieve expected standards. In short, standardised scores and scaled scores are not the same things, they are not directly comparable, and it is immensely frustrating that they are so similar in appearance.

Performance Tables

I’m not anti-accountability, I just think the performance – or league – tables have a pernicious effect on education and foster competition between schools. . The high stakes nature of the tests, the issues with progress measures, the risk of distortion of data, the lack of context – all these things combine to create a set of unreliable and poorly understood data that is supposed to help parents make informed choices of schools for their children. But how many people really know what a progress score of -0.85 means, and why one school is classified as average and another as above average despite having the same results? Who looks at the confidence interval and works out that one or two more pupils getting higher grades could have made all the difference to the school’s overall performance band? Or realises the reason why that school has lower headline figures is because of its high proportion of pupils with EHCPs? What about the SEND resource, whose results feed into those of the main school rather than being separated out? Do we think about mobility when interpreting a school’s data? And the impact of moderation on writing results? Does anyone question why that junior school has twice the proportion of high prior attaining pupils as the other (all-through) primary schools in the area? Does the local paper think about these things when they download the results and print them in rank order in a double-page spread? Does the LA take these things into consideration when identifying ‘schools causing concern’? Does Ofsted do the same when prioritising schools for inspection?

I’m not anti-accountability but I do think performance tables should be scrapped. Or at least overhauled to focus more on context and provide information in the form of narrative rather than numbers. This is what Ofsted has attempted to do with the IDSR, which is infinitely better than the RAISE reports that preceded it (apart from the obsession with quintiles). The DfE should at least consider a similar strategy for their Compare Schools website. Data without context or explanation is dangerous and the performance tables as they currently stand, are not doing schools any favours. Time for a change.

—-

That’s it, that’s my list of 10 things I hate about data. It’s not an exhaustive list and I could go on but this post is long enough. I’ve been writing this for 7 hours already. I should probably go and enjoy what’s left of my 50th birthday.

Thanks for reading (if you got this far).