Defence against peverse incentives - Sig+ for School Data

I recently attended the JUSCO (junior school collaboration) conference in Birmingham organised by Chris McDonald (@chrismcd53). It was a great day packed with interesting talks and heated debate; and if you had to use one word to sum up the feelings in the room it would have to be ‘frustration’. This feeling was perhaps best encapsulated in Dr Rebecca Allen’s talk (@drbeckyallen) in which she showed the stark contrast in progress measures between all through primary schools and junior schools and postulated that “either there is stuff that’s going on in your schools that really isn’t as helpful as it could be […] or there’s something that’s gone wrong with the way the government is measuring school performance”. Becky then went on to show the contrast in inspection outcomes between infant and junior schools where the former are 2.8 times more likely to be judged outstanding than the latter, and perhaps unsurprisingly there is a far greater prevalence of RI and inadequate judgements amongst junior schools than amongst infant schools. Inevitably much of discussion that followed concentrated on the direct impact of over inflation of KS1 results by Infant schools, but an arguably bigger impact results from depression of KS1 results by primary schools, where perverse incentives exists to try to make results as low as possible. Junior schools, with no control over their pupils’ start points, end up unfairly compared to a national baseline that is engineered to maximise progress. In an attempt to illustrate the issue I created the following diagram. I call it the swirling vortex of despair.

It shows how junior school pupils are at a huge disadvantage in the progress race because the school does not have control over the baseline, and how pupils that make good progress in reality end up with negative scores when compared against supposedly similar pupils nationally. It’s like entering a fun run only to discover that the other competitors are elite athletes in disguise.

But this is not all about junior schools. The current system of measuring progress from KS1 to KS2 is hugely flawed and it is deeply concerning that such high stakes are linked to such bad data. The combination of ill-defined, crudely scored, best-fit sublevels at one end and a mix of test results and weird, clunky nominal scores at the other hardly makes for an accurate measure of progress. Add in those perverse incentives to keep the baseline as low as possible whilst inflating KS2 writing teacher assessments and finding ways to exclude less able pupils from measures and we have a mess of system that favours the most creative (or the least honest). And it’s set to get worse in 2020 when the current year 3 with their new format KS1 results get to the end of KS2. The decision not to collect KS1 tests scores seems a missed opportunity when we consider what we will probably end up with. Instead of a refined series of start points based on scaled scores, we will have a handful of prior attainment groups, each containing tens of thousands of pupils, all of whom will have the same KS2 benchmarks. An avoidable disaster waiting to happen.

And so we need a better baseline and this is the hot topic in the recently launched consultation on the future of primary assessment. Most seem to favour a baseline taken early in the reception year and this is most likely the direction of travel. After all, surely it makes sense to measure progress from when pupils start primary school rather than from a point 3/7ths of the way through. Whatever the start point, any future baseline assessment needs to be principled, robust, and should be refined enough to provide a suitable number of prior attainment groups. Unfortunately, and inevitably, those perverse incentives to ensure a low start point will still exist so how do we avoid them?

Moderation

Continue with the current arrangement of moderating a sample of schools each year. I would argue that this has not proved to be particularly effective. If it had been then we wouldn’t have all these issues and I wouldn’t be writing this blog post. It’s probably time to consider other options. Alternatively moderation could be carried out after submission of data, which might help ensure schools err more on the side of caution. More likely though it would just create resentment.

School-to-school support
This could take a number of forms: schools moderating each other’s baseline assessments (this already happens a lot anyway), teachers from a neighbouring school invigilating the assessment in the classroom (think national lottery independent adjudicator with a clipboard), or actively administering the assessment. I’m not sure how popular the latter would be either with staff or with children.

Use of technology
If pupils were to do the assessment via an iPad app there are benefits in terms of instant data collection and feedback, which is useful for the user. Plus – and here’s the sinister bit – algorithms can spot unusual patterns (think betting apps), which can help discourage gaming. However, there are no doubt access issues for some pupils and what if they struggle to complete tasks at the first attempt? Do they get another go? Plus it means the purchase of a lot of iPads. I recall that one of the six providers of the last attempt at a baseline assessment had such a solution and evidently it wasn’t particularly popular – it didn’t make it to the final 3 – but that doesn’t mean it’s not worth another look.

Random checks
This would probably only work if the assessment was carried out in all schools on the same day. I’m assuming this won’t happen. It is more likely that assessment will be carried out over a number of days, which would mean schools submitting the dates of assessment in advance like an athlete declaring their whereabouts. Also, who would carry out random checks? This is probably a non-starter. It would be massively unpopular.

Data analysis
Unlike levels, which were broad, vague and non-standardised, and therefore lacked an accurate reference point (yes, 2B was the ‘expected’ outcome but no one could really decide what a 2b was), a standardised assessment based on sample testing will provide a more reliable measure. Schools or areas with consistently low baseline scores, where all or nearly all pupils are below average, may warrant further investigation.

I understand that all of this sounds rather big brother but the alternative is we carry on as we are with unreliable progress measures against which critical judgements of school performance are made. If we are going to have progress measures – and who wants to have their performance based on attainment alone – then it absolutely has to be based on credible data. That means having an awkward conversation about gaming arising from perverse incentives and what steps can be taken to avoid it, because the current situation of high stakes performance measures, floor standards and coasting thresholds based on utterly unreliable data is unsustainable.