The rise and fall of criterion referencing

Posted on October 15, 2016 by Crispin Weston

Why criterion referencing got itself a bad name and why this does not mean that it should be abandoned

My previous six posts have examined the position on educational purpose taken by Professor Biesta. I have concluded that when he (like many of his child-centred colleagues) says that we should focus more on purpose, he does not mean to clarify but rather to obfuscate that purpose. He means to place responsibility on individual teachers to decide what their various, implicit and often meaningless purposes should be. This leaves no possibility of taking systematic action to achieve such objectives or of giving any clear account to the rest of society on how effectively this has been done. It is a model that sits uncomfortably with Professor Biesta’s professed desire to improve democratic accountability. In this post, I turn to the reasons why Daisy Christodoulou also opposes the explicit description of educational purpose.

Many would rate Daisy Christodoulou, ideologically speaking, as the polar opposite of Gert Biesta. Yet when it comes to stating explicitly the purposes of education, Biesta and Christodoulou share the same position. Professor Biesta does not see a need to state the purpose of education in meaningful terms because he does not think teachers should be accountable for delivering purposes imposed on children from without. Daisy Christodoulou does not see a need to state the purpose of education because she thinks such purposes are not only legitimate but also obvious. Even if we might have some trouble defining exactly what we mean by “soft skills”, Christodoulou argues, no-one could object to our students becoming numerate and literate—so why do we not just get on with achieving these obvious goals?

To attain an objective, it helps if it is precisely defined. If on a military exercise, you share with your colleagues only a vague notion of the location of the enemy position to be captured, you should not be surprised when your midnight attack ends up in a shambles. Bill Schmidt’s work on “curriculum coherence” suggests that American children are taught “long laundry lists of seemingly unrelated, separate topics” which are “highly repetitive, unfocused, unchallenging, and incoherent” instead of what we should be teaching which is “really a handful of basic ideas”. Even when it comes to something as apparently clear-cut as “numeracy” or “maths”, we seem to have very different ideas of what this involves. If we are to ensure that teachers are aiming at the right target in the first place, we need to define our objectives at more granular level: to describe in detail what a numerate person looks like, breaking down high level objectives into their constituent parts.

Such a description of granular learning objectives is what was attempted by the model of criterion referencing that was formally introduced in the 1988 Education Act and formally abandoned in 2013. Daisy Christodoulou has played a significant part in attacking criterion referencing as a member of the Commission for Assessment without Levels and in her blogs and presentations.

Our model of criterion referencing was certainly flawed—but the problem lay not with the principle that we should attempt to define our educational objectives. The problem lay with the particular way that it was done.

The first problem was the brevity and consequent lack of clarity of the rubrics by which criteria were defined. In his evidence to the 2011 Framework for the National Curriculum, Professor Paul Black, the Chairman of the original 1987 Task Group for Assessment and Testing (TGAT), suggested that attainment targets should be described in longer, discursive paragraphs rather than short phrases (p.43).

A second problem was frequently a lack of clarity in the underlying concept that the rubric described. Daisy has quoted the example given by Dylan Wiliam regarding the comparison of fractions. A criterion that required that students should be able to “compare two fractions to identify which is larger” would not indicate a clear level of attainment because the difficulty of this task depends very largely on the particular fractions that are to be compared. Research by K M Hart in 1981 suggested that 90% of 14–15 year olds could tell which of 3/7 or 5/7 was the larger while only 15% could do the same for 5/7 and 5/9. Although this particular criterion was invented by Dylan Wiliam for the specific purpose of illustrating what a bad criterion reference might look like, it is fair to say that many statements of attainment in the National Curriculum were drafted in a similar way, giving very little idea of the difficulty of the performance that was being required.

This problem could be solved by the provision and analysis of exemplars, as was specifically recommended by Professor Black’s original TGAT report: “We recommend that attainment targets be exemplified as far as possible using specimen tasks. Such tasks can then assist in the communication of these targets” (para.56). This was poorly done, if at all. Each SoA in the 1988 curriculum was accompanied by a single example, which was brief, included no student response to give an impression of the standard of performance expected, and was in many cases inconsistent with the rubric of the SoA.

As well as communicating the meaning of a criterion, exemplars may also be used to verify whether the standard of work expected is clear. The sort of inconsistency of anticipated performance that is identified by Professor Wiliam would immediately become apparent if not only exemplars but also the application of the criterion to the work of real students was monitored and correlated. Needless to say, this was never done.

A third problem was the use of binary measurement metrics: either a student had “got it” or they hadn’t—but it was not allowed that they might lie anywhere between. As Daniel Koretz says in Measuring Up, such binary criteria “are just cut scores on a continuum of performance” (Kindle location 3362). As teachers naturally wish to see their students do well, the latter tended to be credited with mastery on the basis of the most superficial performances, aggravating grade inflation and trivializing the curriculum.

A fourth problem was that relationships were asserted between different criteria, which were organised into scales of ascending difficulty, without any empirical research being used to establish that, even if a consistent understanding could be assumed for what the level 3 Statement of Attainment really meant, it was at least more difficult than level 2.

The fifth and perhaps the most serious problem of all, is what I would term the logistical problem. The 1988 National Curriculum introduced 297 Statements of Attainment for Maths and 407 for Science. Teachers could not realistically track student performance against so many different criteria and so it was no surprise that the 1993 Dearing Review, complaining about the “meaningless ticking of myriad boxes” (p.62), drastically cut the number of attainment targets.

In solving the fifth problem, Dearing introduced the sixth, which was one of over-simplification. The new high-level criteria—what became known as “levels”—no longer operated at an appropriate level of granularity to support teachers who needed to manage formative assessment. Instead, student performance had to be rated using a “best fit” methodology, which often credited students with levels of attainment that did not reliably represent their true standard of performance.

The recommendations of Professor Black’s original TGAT report were largely ignored in the subsequent implementation of the National Curriculum. The whole concept of criterion referencing was misapplied and subsequent attempts to mitigate the original problem only succeeded in compounding the mess. Twenty-five years later, the approach has been abandoned, apparently on the basis that that because we did something really badly, it is therefore not worth trying to do it well. When I mentioned criterion referencing at the Commons Education Select Committee conference, a delegate from Ofqual gave me a look of pity: criterion referencing has become a dirty word.

No-one seems to have analysed what went wrong (as I have done above) or how it might be fixed (as I shall do in a future post). No one appears to be interested in why we still need to describe our educational objectives with precision and at a granular level. This will be the subject of the ninth part in this series.

8 thoughts on “The rise and fall of criterion referencing”

Michael ALLCOCK on October 15, 2016 at 9:02 pm said:

My response relates to the delivery of Mathematics in prison, but I hope will have resonance to other subjects and locations. I will relate my response to two sections taken from your article.
“without any empirical research”
“Even if we might have some trouble defining exactly what we mean by “soft skills”, Christodoulou argues, no-one could object to our students becoming numerate and literate—so why do we not just get on with achieving these obvious goals?”
The first question is why active research is not taken seriously. I still on a daily basis have to document how I cater for learner styles, when research states this is not relevant. I have used the quotes from Frank as evidence.
“I found no hard evidence that students’ learning is enhanced by teaching tailored to their learning style. A comprehensive American study concluded in 2009 as follows:
there is no adequate evidence base to justify incorporating learning styles assessments into general
educational practice… limited education resources would better be devoted to adopting other educational practices that have a strong evidence base
Pashler et al, 2009:105”
The second relates to soft skills and a list of objectives. Objectives can be turned into a set of tick boxes that can be monitored, resulting in a management tool for performance. The result could be the stifling of creativity. The question is do the leading countries on the pisa list for Mathematics delivery focus on objectives or the delivery of problem solving skills?.

Reply ↓
- Crispin Weston on October 15, 2016 at 9:26 pm said:
  
  Michael,
  
  Thanks for your comment. I agree that the continued popularity of learning styles, against all the evidence, gives pause for thought. It suggests to me that large parts of our education practice is still pre-rational.
  
  As for your second comment, I have not yet fully developed my argument – but I will make the case that “objectives” need not become a set of tick boxes, nor stifle creativity. Nor should there be any dichotomy between a focus on objectives and the delivery of problem solving skills – the teaching of problem solving skills should, on the contrary, be one of our most important objectives. We need to describe objectives in ways which do not involve distortions and simplifications, nor encourage more confidence that is justified in stating what works – we need to be honest about the levels of uncertainty involved. That, at least, is where i am headed. I hope you will be able to follow future chapters.
  
  Best, Crispin.
  
  Reply ↓
supercollider on October 16, 2016 at 8:32 am said:

The interesting thing about the use of the latest NC in primary schools is the way that the statements are used as criterion referenced objectives even though they are statements of what should be taught.

For example, in reading the same comprehension skills are expected for all year groups with very little progression. Many schools are then sub levelling these statements into 3 strands per year without any definition of what these labels would represent. This approach is so endemic that Capita SIMS has a marksheet to record this ‘progress’. It finally dawned on me that we have been asking our staff to assess reading attainment without any definitions. Even more amazing is the fact that they have tried to do that.

My understanding is that Tim Oates was trying to change this inaccurate label led teaching and assessment. All that has happened is that schools have tried to overlay their model of assessment over a curriculum that wasn’t designed that way.

There is also a struggle with the concept of end of year statements of achievement in maths that teachers are attempting to use summatively during the year despite not teaching or revisiting the component objectives. This raises questions about how teachers imagine learning takes place.

Reply ↓
- Crispin Weston on October 16, 2016 at 8:42 am said:
  
  Thank you for the comment. You give an impression of general chaos, which is really not surprising, given that the system of levels was abolished without anything being put in its place. Teachers must have some idea of their objectives, and if you abolish any formal attempt at definition, it is not surprising that people resort to making it up as they go along.
  
  I (like you?) sympathise with much of what Tim Oates was trying to do, but I think that (as so often in education) the details were bodged and there was insufficient opportunity for challenge and debate. It is much easier to criticise what has failed than to replace it with something that works.
  
  I will have more to say on the relationship between what needs to be taught and what needs to be assessed, on the perceived problems of teaching to the test, and on the perceived distinction between formative and summative assessment. So I hope you will keep on commenting on the argument I advance, from the point of view of what you see in primary schools.
  
  Crispin.
  
  Reply ↓
FlattenedAnt on October 16, 2016 at 9:51 pm said:

Hi Crispin, As your series unfolds, I keep thinking more and more about a little analogy or metaphor in regard to paradigms. Metaphors are always tricky, but I am going to give this one a shot. A paradigm is like a monster black hole. Imagine that we exist inside such a paradigmatic black hole, perhaps somewhere halfway between the center of the black hole and its event horizon. Above us, although we are not aware of it, is the event horizon, the absolute limit to what the paradigm allows us to think. In fact, we do not even believe we are in a black hole, because we see all these fragments of space, photons, particles and information — bits and pieces from the universe outside the black hole — raining down on us. Actually it would probably be more like standing in a boat directly under the Niagara Falls. The funny thing is that every time we try to think out of the box, so to speak, our thinking tries to escape the black hole, along with everything else trying to escape from the black hole. However, due to the gravity well, our thinking just ends up falling back down towards us, only now it is mixed in with the information from the universe. And this gives us the illusion that we are thinking something innovative, something clever, perhaps even wise.

It is why all those postmodernists, although they claim to be in a new paradigm, are not. They cannot break away from the narrow views that the Western Rational Tradition dictates. Sure, they sense something is wrong, but that what is radically different eludes them. It is why you can feel Biesta’s thinking moving up, up, up, towards the event horizon, and then falling back down. He cannot escape the Rational Paradigm either. You cited Bill Schmidt: “American children are taught ‘long laundry lists of seemingly unrelated, separate topics’ which are ‘highly repetitive, unfocused, unchallenging, and incoherent’ instead of what we should be teaching which is ‘really a handful of basic ideas’. Schmidt cannot even realize, due to the black hole’s effects, that his handful of basic ideas is just a reflection of the curriculum of unrelated separate subjects. The paradigm is why John Dewey could only come up with ““the educational process has no end beyond itself”. I think that Dewey came pretty close to the event horizon, but ultimately was pulled back down.

Wasn’t it Freud who proposed the evolution of human thinking through three phases: animalism, religion, and, currently dominant, science (the phases do not replace each other, but rather the former remain integrated in our cultures to various degrees). Three paradigms. Three black holes. And we are stuck in the “Western Rational Tradition” paradigm, the “Science” paradigm, . Knowing how far humankind has come, it would be crazy to believe that there is not another paradigm after “Rationality & Science” waiting for us.

What would you say, Crispin? As you proceed to your 9th installment, I think you might have to check your position. And because I have been reading your posts on and off for the past few years, I have developed the feeling that you are stuck in the “Rationality & Science” black hole. I feel it in the words you choose, in your discourse. So, I guess that, yes, you have analyzed “what went wrong” as education started going postmodern, but your formulation means to me that you haven’t defined the fundamental problem with education. Do you see what I mean? You want to “describe our educational objectives with precision and at a granular level”, but you do not seem to be questioning the basic idea of education. And this is why I am all the more curious about your “fix”! Are you going to go beyond the event horizon? I would say, Crispin, that you have to make sure you go beyond the event horizon. Are you going to do it?

I’m trying to think of an analogy here. The internal combustion engine? Modern society has come to believe and trust in the internal combustion engine so much that it is hard to conceive of alternatives. We know there is something fundamentally wrong with it (relatively inefficient, high emissions especially with the astronomical number of vehicles on the world’s roads), yet we hold on to it as if our very existence on planet Earth depended on the thing. Can you sense the black hole at work? So Volkswagen cheats and fudges to hide the fundamental wrong, and we say, “Bad boy!” Then BP comes out with a fuel they claim is “cleaner”. Then somebody invents an even better fuel injection, and commercials are made to highlight this facts. These are all “fixes” and do not see the internal combustion engine as a problem. And even when we start realizing it is destroying our planet, we still find ourselves saying, “We have no real alternatives. Let’s just keep using the internal combustion engine.”

So maybe we should not even be looking for the problem WITH education; perhaps education in the way we know it the problem itself. And the problem seems to be hiding in a seemingly banal idea, which has been expressed in various ways by various educators and thinkers. Here is how Hans-Jochen Gamm (1973) said it (my translation from the German): “Pupils perform learning work today; citizens perform paid work tomorrow.” That was in 1973, and in the meantime we have begun to talk eagerly and excitedly about the importance of lifelong learning as if it were the discovery of the century (sure, for people in a certain paradigm it was something new). So the idea of “finished, polished” (thanks to educational input) citizens (the products of good education) is fading. Yet our accepted paradigm still does not allow us to imagine the inverse, for example: pupils can perform real work; pupils can act on and change the world around them.

I’m still waiting patiently to hear what your vision is, Crispin!

Reply ↓
- Crispin Weston on October 22, 2016 at 7:18 am said:
  
  Hello again FlattenedAnt,
  
  I am not sure your new comment adds so much to your last one. More of the same, I would say, and no answer to my response to your last post (https://edtechnow.net/2016/09/24/paradigms/#comment-23668), in which I suggested that pedagogical problems were not “wicked” at all, but tame. Maybe you could look at my 2013 post, “In the beginning was the conversation” (https://edtechnow.net/2013/02/25/conversation/).
  
  Nor did you take on board my central argument in “Choose your paradigm” (https://edtechnow.net/2016/09/24/paradigms/) about how Kuhn suggested that paradigms were deprecated because they did not give a satisfactory account of the world – a combination of losing their predictive power and/or not being internally consistent. In Freud’s model, we move from animism through religion to science because each of the earlier models became unsatisfactory in these terms, not because we choose to move to a different paradigm as an act of subjective will or aesthetic preference. Your analogy of the black hole, in which the event horizon can never be crossed, is false: we move beyond our paradigms all the time. The reason that Dewey did not manage to escape the scientific paradigm is not that the scientific paradigm is not falsifiable (on the contrary, I argue in the Elephant in the Room (https://edtechnow.net/2016/09/20/elephant/) that it is *defined* to be falsifiable) but because Dewey was talking nonsense.
  
  Read again my “Should education socialize” (https://edtechnow.net/2014/10/29/should-education-socialize/) that you found so dull the first time round. Again, you will see that I believe the very opposite of what you suppose, that education produces “finished, polished” products. Knowledge underpins creativity (see this good TEDx talk by Tim Leunig https://t.co/0XBAPnzva7) so a scientific education creates the means of creating change, even of refuting itself.
  
  I am reminded of a short story by HG Wells about a man who became obsessed by finding a lost utopia which he thought he had found in a dream, reached through an inconspicuous door that would reveal itself again if only you looked in the right place in the right way. So he gave his life over to hunting for this door, until he was found dead at the bottom of a disused lift shaft. I think your demand for a new vision represents a similar sort of hunt. Science is a process that will continue to amaze us and overturn our lazy assumptions – but it does this through evidence and debate. The idea that we can or should elect our own pleasing vision without the tedious business of offering a justification is a seductive romance, of a sort to which too many teachers and educationalists are far too susceptible.
  
  Which provides me with an opportunity to plug my next post (not yet published) on pedagogical romanticism!
  
  Crispin.
  
  Reply ↓
julietgreen on October 28, 2016 at 8:28 pm said:

I’m following this with interest, Crispin. I do fall into the category of those who dismiss criteria-referenced assessment, but only for the faults you outline above. It’s hard to see how they can be overcome. One thing I did want to point out is that grade deflation is, in my experience, even more prevalent than grade inflation. Both are simply the natural outcome of criteria-referenced assessment but particularly so when the stakes are high in terms of accountability.

Reply ↓
- Crispin Weston on October 30, 2016 at 12:08 pm said:
  
  Thanks Juliet. Sorry for the slow reply.
  
  I hint above at some of the solutions to these problems that can be facilitated by digital technology: in particular, solving the logistical problem (no.6) and intrinsic inconsistency (no.2) by the analysis and correlation of attributions.
  
  As for wandering standards (either inflationary or deflationary), my argument will go as follows. First exhibit: my solution to no.2 which ensures consistent application (which I suggest is the same as clarity of definition) by reference to the performance of large numbers of students. Second exhibit: the understanding that a criterion is not a binary cut-score, but a dimension against which capability is measured, leading to a continuum of performance. The teacher’s objective is not to get everybody to a particular point on that continuum, but to shift everyone’s performance to the right, graphically speaking. Consequence: describing your learning objectives is perfectly consistent with norm referencing, which is the technique by which standards are most reliably maintained because (though I am not an exam expert) I suspect that the variation between one year’s results and the next will be more plausibly explained by the relative difficulty of the exam and not by the fact the nation’s 16 year olds are, as a group, more or less able that the nation’s 17 year olds. Indeed, it is impossible to create numerical norms unless you understand what the norm is measuring.
  
  As for the distortions introduced by high-stakes, I think we agree that the answer is to aggregate the results of more continuous, low-stakes monitoring, rather than relying exclusively on single end-of-course snapshots.
  
  These are the sorts of solution that I will get to at the end of my series.
  
  Thanks, Crispin.
  
  Reply ↓