Baseline 2018: BERA task force commentary

BERA Baseline Briefing Document: The validity and utility of the proposed baseline test in England

This BERA Baseline Briefing document sets out the case against using a baseline assessment test of pupils in reception to create a new progress measure to hold schools in England to account at the end of KS2.
BERA convened an Expert Panel to consider whether the evidence from the assessment literature could justify such a test being used for these purposes. The conclusion of the Expert Panel is that it cannot. This document is intended to inform public debate by making accessible the reasons why the proposals are flawed. In the Panel’s view they will not lead to accurate comparisons being made between schools, as policymakers assume. They will not work in the best interests of children and their parents.

Validity

In coming to this conclusion the Panel have paid particular attention to the proposed test’s validity:
Validity refers to the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests (AERA Standards, 2014, p.11).
The first requirement in evaluating the validity of any test is to establish its purpose. Only then can we judge whether it achieves this purpose and is thus a valid test.
Unusually for a national test, the new baseline test in reception has only one overriding purpose1. This is to provide data on the achievement levels of pupils on entry into their reception class in primary schools which seven years later will be used to make ‘value-added’ comparisons among schools. This will be made by comparing the baseline assessment to the same pupils’ Key Stage 2 scores in Year 6.
In its evaluation the Panel considered whether this is a legitimate purpose and whether the proposed test will be able to achieve it ‘fairly and accurately’ as Justine Greening argued (DfE, 2017). The specific questions raised were
1. Is it legitimate to use baseline assessment for school accountability purposes seven years hence?
2. Will the proposed tests be accurate or fair?
3. What recognition is being given to contextual factors in the interpretation of the data?
4. Will this form of accountability lead to useful comparisons of schools?
5. What is the likely impact of these accountability measures on pupils and schools?
6. Are there better alternatives to baseline testing?
1 Paul Newton has identified at least 18 purposes for which assessments can be used, ranging from monitoring national standards to providing teachers with information about individual pupils (Newton, 2007).

Executive Summary:

The Panel have three major concerns.
1. Children are being exposed to tests that will offer no formative help in establishing their needs and in developing teaching strategies. The morality of this is questionable as well as the use of time, money and resources for an assessment that is of little or no direct value to those involved in it. We know of no other assessment systems internationally which offer so little formative feedback and for which the data lie dormant for a substantial period (seven years in this case). The ethical case for this practice has not been made.
2. Any value added calculation on which the school is held accountable will be highly unreliable. Any 20 minute test of four year olds will be strongly affected by age, first language and home background. To make no adjustment for these factors and to use the combined scores in order to determine a baseline from which to judge the school violates recognised international testing standards (AERA et al, 2014; Raudenbush, 2004). It also ignores decades of school effectiveness research and evidence from the DfE’s own past work on contextual value added indicators. With no recognition of the effects of other background factors on progress, the baseline test cannot allow ‘like with like’ comparisons of schools to be made.
Any presentation of school value added scores should recognise the inherent statistical unreliability by indicating the confidence intervals around them (Foley & Goldstein, 2012). The confidence intervals would reveal that making fine (rank order) distinctions between schools in the form of ranked league tables is invalid. As we argue in more detail below, experience using secondary school data suggests that the utility of the data for making school choices and holding schools to account will be extremely limited. This fundamentally arises because the initial tests themselves will have low reliability; and the long time gap between baseline and KS2 further reduces the predictability from the former to the latter. This problem is exacerbated by the small size of pupil cohorts in primary schools and the extent of and variation in rates of pupil mobility between primary schools. Given the proposed high-stakes nature of this assessment there may also be the added complication of schools ‘playing the system’ – would a lower baseline score advantage the school in any future value added calculation?
3 This is an untried experiment. To properly evaluate the new proposed baseline system, one would have to wait until at least 2027 when the first cohort has taken KS2 tests. Without such evidence, we argue that it would be unethical to impose such a system on the schools. This is not a secure basis on which to judge what the school offers current four-year olds.

Section 1. Is it legitimate to use baseline assessment for school accountability purposes seven years hence?

1.1 The ethics of testing young children for accountability purposes.
The Panel asked: Is it ethical to ask very young children to sit a baseline attainment test, on entrance to school, from which neither they nor their teachers will receive any direct benefit? Assessment of young children may be justified if the purpose is diagnostic or formative, and used to support a child’s learning. This was the rationale of the Early Years Foundation Stage assessments and other diagnostic tools such as PIPS.
But responses to the DfE’s consultation on the baseline tests suggest that many early years practitioners believe it is unethical to test children who have just arrived in school, often from very diverse backgrounds and who may be settling in to an unfamiliar environment. (Bradbury and Robert-Holmes, 2016). The decision to conduct the test as soon as possible in the autumn term heightens these concerns.
The ethical issues become particularly acute when the results will be used for school accountability purposes rather than to support individual children’s learning. It is noteworthy that the Centre for Educational Measurement (CEM), which was one of the three baseline pilot schemes, withdrew from the test development bidding process on the grounds that it was ‘ verging on the immoral’ to use the test for accountability purposes alone ( quoted in Bradbury et al., 2018)
This ethical concern is supported by findings from America, where there has been extensive research into early years assessments, particularly in relation to the concept of school readiness (Shepard, 1998; LaParo & Pianta, 2000; Kim & Suen, 2003) The general finding is that ‘Instability is more the case than not in early childhood development, and tests of accountability that overlook the implications of this variability will mislead policy makers , the public and children’s teachers’ (Meisels & Atkins-Burnett, 2006, p.543).
1.2 Should 4 year olds be tested for accountability purposes?
The Panel asked: Will the testing process meet the requirements that fair testing involves?
The research evidence suggests that in any early years assessment system multiple assessment measures are required and results should always be interpreted with caution (Meisels & Atkins-Burnett, 2006). A survey of 44 studies by Kin and Suen (2003) concluded: ‘the predictive power of any early assessment from any single study is not generalizable, regardless of the design or quality of the research. ’ (p.561).
This variability was evidenced in the initial pilot of baseline testing in which three different assessment methodologies were used – observational (Early Excellence), a teacher-led assessment that was computer based for the main areas (BASE – CEM), and a resource-based assessment with a mixture of tasks and observational checklists (NFER). The outcome of this was that the three sets of results were judged to lack sufficient comparability to create a fair starting point from which to measure pupils’ progress (STA, 2016, p20). Switching to a single provider with a single assessment methodology only masks the problem of how a particular test format will determine the results, particularly with young children with little or no experience of test taking. If a different test were used the results for many children would differ, as the Government’s own study illustrated.
As the results for individual children would differ on different tests so too will their predictive ability given the stated intention of comparing schools effectiveness in ‘value added’ calculations seven years later.
1.3 Can baseline tests in Reception be used to calculate school value added at KS2?
The Panel asked: will the test be fit for purpose?
Modern validity theorising centres on construct validity, what is the construct, domain or skill that is being tested? Only when we know this can we decide whether the assessment is fit for purpose. Two major threats to validity are not adequately sampling the domain and assessing elements that are not part of the construct2. This raises particular issues in the case of a baseline test designed to be used for accountability purposes.
In explaining why the government had ruled out observational tests, early years minister Nadhim Zahawi, (TES 6/3/2018) said: ‘The data from the baseline needed to correlate with key stage 2 assessment so that ‘like for like’ comparisons could be made’. Such a statement is inherently confused as correlations will exist even for unlike tests, and could have been explored using data from observational tests.
In fact, it is not yet clear how close an alignment is really intended between a reception baseline and end of key stage 2 tests, nor what the consequences are of making this the goal. The test developer, NFER, had previously introduced a ‘practical, child-friendly baseline assessment’ using resources ‘such as counting bears, plastic shapes, number cards and picture cards’. But this assessment approach was not designed principally to correlate with the far narrower formal testing that takes place at Key Stage 2. If policy makers insist on close alignment with concepts tested at Key Stage 2, then the baseline tests may well be narrowed. (This happened with the original Key Stage 1 tasks ).
As things stand we will not know for seven years how valid any proposed alignment is. Nonetheless, what is tested at baseline has to be first and foremost an aspect of cognitive development appropriate for that age. Certainly, the content of the baseline test should not be based on or treated as preparation for the content of the Key Stage 2 tests.
1.4. Is the development brief for the test appropriate?
The development brief for the tests (DfE, 2017 ) stipulates that the test will be 20 minutes long, will be accessible to 99 per cent of the cohort and will offer a wide spread of marks with no more than 2.5 per cent of takers getting full marks.
The Panel asked: Are these proposals appropriate?
One of the threats to validity is the way tests are scored and marks aggregated (Crooks, Kane & Cohen, 1996). Crooks et al. detail these threats in terms of:

  • Scoring fails to capture some important qualities of task performance
    Undue emphasis on some criteria, forms or styles of response
    Lack of intra-rater or inter-rater consistency
    Scoring too analytic or holistic
    Aggregated tasks are too diverse

Validity theory often adopts Samuel Messick’s (1989) classic ‘threats to validity’ of construct under-representation and construct-irrelevant variance’. An example of the former would be focusing a language test on writing and ignoring speaking, with an example of the latter being a maths test which requires such a high level of reading skill that it rewards good readers rather than good mathematicians. Inappropriate weights given to different aspects of performance (p.270)
A proposed 20 minute test that samples the three areas of literacy, numeracy and (if as proposed) self-regulation is likely to fall foul of this checklist by being too diverse to meaningfully aggregate. An overall score must vary by whatever weightings are given to the three constructs included in them3.
The intention to add these up to produce a simple overall score ignores the fact that children may show different performance in different domains and that some domains may be better predictors of later Key Stage 2 results than others. In fact, past research has already indicated that the three domains of early literacy, numeracy and self-regulation are indeed distinctive and better treated separately in predicting children’s later attainment. (See, for example the major longitudinal DfE funded Effective Pre-school, Primary and Secondary research (EPPSE3-16+) which followed children’s attainment, progress and social behavioural development across successive phases of education, including from baseline at reception to Key Stage 1 and later to Key Stage 2 (Sammons et al, 2002; 2003; 2004; 2008a; 2008b Sylva et al, 2004; 2006)) . The intention to produce one overall score is therefore misguided.
The unreliability inherent in a 20 minute test of young children on a range of skills has not been estimated and reported by the test developer, nor have any proposals been forthcoming as to how predictive validity will be investigated and reported across different years, even though the evidence is that tests may have different predictive validity for different groups of pupils (Tymms et al, 2014 p21, p 31) . In addition, past research commissioned by the Qualifications and Curriculum Authority suggested different baseline tests vary in the extent to which they identify variation in performance between various pupil groups (Sammons, Sylva & Mujtaba, 2000).

Section 2. Will the proposed tests be accurate or fair?

2.1 How reliable will the baseline tests be?
The Panel expect the baseline tests to show low levels of reliability as a result of the format of the tests, the inexperience of young children in test taking, and the inevitable variations in administration as teachers seek to explain what is required to naïve test takers and to ease children’s anxieties.
Reliability is about ‘the consistency of outcomes that would be observed from an assessment process were it to be repeated ….Reliability is about quantifying the luck of the draw’ (Newton, 2009, p.51). The level of reliability is affected by whether pupils would have received different results if they had taken it on a different day, taken a different version of the test, had a different assessor or been introduced to the test in a different way.
The baseline test poses considerable reliability problems as the test takers are young and inexperienced. Many will not have taken such a test before and may still be anxious about starting school. Those administering the test may themselves be unfamiliar with this form of assessment and there will be inconsistencies between teachers and schools in the level of support they offer different children. Unlike an exam for older children, test administration will need to be individualised and this could itself prove a source of considerable inconsistency both between
3 The test could offer a profile across the three constructs but the sampling would be so limited in each that it would lack reliability. children and between different classes and schools. This unreliability will lower the predictability of the KS2 test scores.
2.2 The impact on the learner
It cannot be assumed, as the policy seems to be doing, that a short baseline test will have no impact on the test taker. Such tests always have an impact, whether directly or indirectly. This baseline test is designed to sum up, in 20 minutes, what a child is bringing into school. Background factors such as deprivation, home language and age (Stiggins,2000) may mean some children will have limited success on the tests and be disproportionally stressed. This may lead to labelling of the children who struggled the most – something that could become self-fulfilling (Hart et al. 2004; Boaler,2009 ). This is a particular risk if teachers are encouraged to make premature judgments about children’s abilities and their family context from the test-taking process. The NFER and DfE must systematically evaluate the impact of the test on pupils, teachers, parents and schools to ensure that no negative consequences flow from its introduction.
2.3. Can fair judgements be made using the baseline data?
The Panel asked: How will the results be interpreted and used? The widely accepted definition of validity as ‘the degree to which evidence and theory support the interpretations of test scores for proposed uses of tests’ (AERA Standards, 2014, p.11) emphasises the need for careful interpretation of the results. A test can be well-constructed and scored but the results can then be misunderstood or misused. For example, some media have reported that Year 6 children scoring at level 3 in reading are ‘unable to read’ or ‘illiterate’ – an inaccurate and misleading interpretation.
This is a particular risk in the case of test data collected from the very young. Scoring of the test must adjust properly for the age of the child, for the simple reason that at such a young age a few months difference may lead to pronounced developmental differences (six months at age four is a considerable developmental period).
2.3.1 The effects of child age by month.
The proposed baseline reception test will be taken by children who vary in age by up to 12 months. This is a very important consideration because it is widely recognised that there are important age related developmental effects that are especially striking for young children. Well constructed tests for the very young are therefore frequently age standardised to make explicit control for every month in age difference.
Typically autumn born children show a strong advantage in attainment over their younger summer born peers (differences being most marked between the oldest September born and youngest summer born children who are nearly a year different in age). An early value added study of pupils’ progress from reception baseline to the end of Key Stage 1 conducted for formative school improvement purposes (not for accountability) revealed large age effects for all Key Stage 1 areas covered (reading, writing, maths and science) (Sammons & Smees, 1998). Importantly the authors noted that ‘Older pupils did better in all areas than younger members of the year group. Because prior attainment [at reception entry] is controlled, this means that older pupils made more progress over the infant years, as well as having higher initial attainments at entry.’ (Sammons & Smees, 1998, p 398).
This problem remains intractable as more recent research comparing children’s performance on entry to school and the progress made in the first year of school reveals (Tymms et al., 2014). The authors show in Table 9 (p33) that the correlations with child age are found to range between 0.21 for phonological awareness to 0.30 for Early mathematics. For the total score the age correlation is 0.31 representing a substantial effect of age. Indeed Tymms et al (2014) also note in relation to the age effect for personal social and emotional development, ‘The older the child on starting school, the higher the ratings tend to be on each item. The effect sizes are modest but clear; the older children were seen to concentrate more, to feel more comfortable, to communicate better, to have better relationships and so on. There was a fairly constant effect across all items.’ (Ibid, p 36) Thus to ignore age effects on attainment may prove misleading, while statistical control for age effects is highly desirable.
2.3.2 Age cohorts in primary schools. For the Government’s stated school accountability purposes, the age effect is particularly problematic since pupil cohorts in primary schools are quite small (often only 1 or 2 classes of children) and the distribution of younger and older children can be quite uneven. Schools that serve more children who are young for their year at entry may appear to have less favourable effects on children’s later attainment in Year 6 than schools serving more children old for their school year. Unless age effects are controlled at both baseline and for outcome tests at Key Stage 2 it is not possible to establish how schools’ VA results will be affected by the proportion of younger or older children in their cohorts.

Section 3 What recognition is being given to contextual factors in the interpretation of the data?

3.1 The impact of pupil and teacher mobility
For accountability purposes, the interpretation of the baseline data will first happen at school level at the end of KS2. Yet in measuring progress over the 7 years from Reception to the end of primary school there are no proposals to take into account the length of time a pupil has been in the same school, the time a Head Teacher is in post and accountable for the pupils’ progress and the rate of teacher mobility.
Mobility in school, both in relation to teachers and pupils, has been a long-discussed issue. Already Infant and Junior schools have been discounted from this proposed accountability measure because the continuity for pupils and the accountability for heads is disjointed when pupils move from one such setting to another (In 2017 ,the proportion of pupils who would be affected in Infant and Junior maintained schools in England was 13%).
Research also tells us that nationally about 20 per cent of pupil moves in England happen at non-standard times of year (Sharma, 2016). This also varies by region: in London nonstandard admissions are 20% higher than other regions in England. There is not only an inconsistent rate of mobility across areas of England and schools within the area, but there are differences in the context of pupils who are more mobile. Schools with very low percentages of free school meal pupils generally have very low levels of pupil mobility (Rodda et al, 2013). By contrast, schools in disadvantaged and diverse areas tend to experience more pupil mobility. If disadvantaged pupils are over represented in mobile pupils, so are SEND pupils while the mobile population is more ethnically diverse than the overall pupil population (Ibid)
Considerable work needs to be done to understand the impact of using a measure of pupil progress where there is disproportionate pupil mobility across schools and areas of the country. The likely difference in context for these children and their learning needs will depend on how these BPS baseline scores are treated when children move school. If mobile pupils were taken out of the progress measure in all schools, then there would be a reduction of a couple of pupils per class missing from the accountability measure. However, in some schools there could be many more. If the Reception baseline results follow a pupil who moves school during the primary phase, as is the current practice with the measure between KS1 and KS2, then a school will be held accountable for a pupil’s progress from a starting point that they do not know and may only have a relatively short part of the school life to make up progress.
For all these reasons, baseline to Key Stage 2 pupil progress scores are unlikely to translate very easily into simple judgements about what the school has added. Similar questions are raised by the movement of head teachers. A recent study reported that while 84% of Primary School Head teachers remained as Head in the same school from year to year, retention rates are falling. They are lower when schools are deemed inadequate by Ofsted, become academies or MATs or have a higher percentage of low attaining pupils. Against this background the study outlined the need for strategies to “retain effective head teachers within the profession and to build a stronger pipeline of new head teachers.” (Lynch et al, 2017). Across seven years therefore the chances of a change of head will be quite high, and some schools may experience several changes. As with pupil mobility more work needs to be done to understand the usefulness of an accountability measure that has such a far-reaching end point. What will it measure, if many Heads and school leaders as well as teachers and pupils may not stay with a school for the full seven year period envisaged from baseline in reception to Key Stage2?
3.2 The impact of socio-economic and family factors
There is strong evidence that other child characteristics also affect both attainment and relative attainment in value added measures. Different effects have been found not just for age but for the early years home learning environment, parents’ educational levels, family socio-economic status, family income and neighbourhood disadvantage as well as English as an Additional Language (see research on the Millenium cohort study as well as other longitudinal research funded by the DfE such as the EPPSE research (Melhuish et al, 2008; Sammons et al 2002; 2003; 2004; 2008 a, 2008b; 2008c; 2015).
Since 2010, the decision by successive governments to ignore such effects by dropping the contextualisation of schools’ results in accountability comparisons does not make the effects disappear. They remain but become a source of unmeasured bias. The Tymms et al (2014) study found significant effects for disadvantage related to the disadvantage of the neighbourhood a child lived in (from the neighbourhood IDACI score) but did not study the effects of family income using a child’s free school meal (FSM ) status. The Sammons and Smees (1998) study on baseline assessments revealed significant effects on both attainment at baseline and value added attainment related to a child’s FSM status over and above the effects of age. Ignoring such effects makes the proposed use of the reception baseline test for school accountability particularly inappropriate since comparisons will not take proper account of differences between schools in the characteristics of their pupil intakes. This will systematically favour schools serving fewer disadvantaged pupils and penalise schools serving higher numbers of disadvantaged children.

Section 4.  Will this form of accountability lead to useful comparisons of schools?

England, relative to most other countries, employs unusually high-stakes accountability procedures in its education system. Schools are judged both by the results of national tests and examinations, including whether performance targets are met, and by the outcomes of systematic school inspections by Ofsted4. Both these systems carry serious consequences for schools, particularly if they perform below target thresholds or are judged by Ofsted to be in need of improvement.
4.1 The utility of school performance data for parental choice of school.
The government justifies the use of data to rank schools and produce league tables in the interests of parental choice. While there is little research at primary school level on the use of school comparisons (league tables) to inform parental choice of schools, there is extensive research at secondary level that helps to draw out the implications of using progress measures from reception to KS2 for such a purpose. This is crucial since it is quite clear from the DfE response to the consultation on baseline testing (DfE, 2017, p15) that this is the main purpose of the proposals.
It is generally recognised that the only proper way to make comparisons between schools is to make adjustments for the prior attainments of their pupils when they enter those schools and also to control for other relevant characteristics of pupil intakes. Such adjustments lead to what are known as ‘value added’ comparisons and this is what is proposed for the baseline tests: school level attainment at year 6 will be adjusted for using the reception baseline assessments but without controlling for any contextual factors such as age, FSM, EAL or SEN status.
Leckie and Goldstein (2012) used the national pupil database to make an in-depth analysis of the adequacy of secondary school value added rankings to inform parental school choice. In particular they demonstrated that school comparisons at year 11, adjusted for KS2 attainment at year 6, led to value added league tables that were in fact of very limited use for making choices between schools.
They concluded that attempting to rank schools in this way is unsatisfactory for several reasons.
 First, the value added score itself is subject to considerable statistical uncertainty resulting from the limited number of students who make up the school population and upon which this can be based.
 Second to be of any use for choice one must extrapolate the results from a cohort of students who entered the schools 6 or so years prior to the current year of entry that concerns a parent5.
4 West, Mattei & Roberts (2011) observe that, while there are various types of accountability, in education in England it is the managerial and market forms that dominate. This stems from policies based on choice and competition which necessitate standardised public data to aid comparison and choice. Other countries have been more reluctant to adopt such policies (Mattei, 2012).
5 Leckie and Goldstein (2011) presented a simple way of describing these uncertainties using graphs that gave a clear pictorial comparison by allowing the user to vary the factors affecting the uncertainty of the actual value added scores and by looking at comparisons of different schools. See Appendix 1

In addition it is often the case that specific subgroups of children, such as initially low achieving ones, are of interest, in which case any school comparisons become less reliable since they will be based on fewer numbers.
All of this raises the question, given a set of VA school effects at any given time, what is the prediction for children starting school next year? To make such a prediction essentially you need to add to the current ‘error’ as expressed in the usual confidence intervals, the uncertainty of prediction from the past cohort to the one starting now. Leckie and Goldstein (2011) found that the estimate of the school level correlation across 2 cohorts 5 year apart was just 0.64 – so combining these uncertainties gives you a very weak prediction. Likewise, for pure accountability purposes it is not useful to refer to events 7 years in the past and based on simplistic measures.
Comparable analyses do not yet exist for reception baseline tests, as far as we know, yet the analogous uncertainties are almost certainly greater with much wider confidence intervals that would have to be placed around any school comparisons. First, the time gap between reception and year 6 is already greater than between year 6 and year 11 and secondly the reliability of the baseline tests will be less than that for the Key Stage 2 tests used by Leckie and Goldstein because of the inherent measurement error associated with the proposals as outlined above. This will be compounded by the fact that the age cohorts are much smaller in primary than secondary schools. Given all these difficulties, it would seem that the final outcome will be of little use for parental choice.
For all these reasons, we argue that it is irrational to pursue baseline assessment for the stated purpose of school value added comparisons without a proper study by independent researchers of the likely utility of the effort. The present plans do not take note of past DfE work or academic studies of school effectiveness. Pressing ahead with baseline testing to make school comparisons is likely to prove a waste of time and money given the many problems we have enumerated.

Section 5. What is the likely impact of these accountability measures on pupils and schools?

5.1 Delaying feedback
The policy intention appears to be to hold the test data until the cohort reaches Key Stage 2. It is unclear if there will be limited release of data to the school during the test year, perhaps only of school aggregate scores. If the latter then this will encourage the production of totally inadmissible ranked league tables of school performance. If the former, then teachers and parents will find this frustrating (‘why administer a test that doesn’t help teaching and learning?’) even if such an approach may prevent over-interpretation of individuals’ results from a potentially unreliable test.
In practice, experience from the previous round of baseline testing suggested that many schools did not actually use the data for any purpose. It was seen as just another externally imposed task to be got out of the way. As many schools continued to use their own assessments anyway, this had negative consequences for teachers’ workload.
All this is likely to stoke resentment at having to put children through a ‘useless’ test and at the costs of development and administration – estimated at £10 million pounds per year. This comes at a time of considerable budget reductions in most schools.
5.2 Gaming baseline scores. The well-known Campbell’s Law states: ‘The more any quantitative social indicator is used for social decision-making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor’ (Campbell, 1976 p. 85). There is now extensive evidence that when tests results are used in high stakes accountability systems in education, such as NAPLAN in Australia and Key Stage2, GCSE and GCE Performance Tables in England, they distort the system as schools react by ‘gaming’ to get better results (Koretz et al, 2001; Hursh, 2005; Boyle & Bragg, 2006; Klenowski & Wyatt-Smith 2012).
Baseline tests in reception are different in some respects, as good scores may not necessarily be seen as a benefit. But this may not inhibit game playing, rather it simply poses schools new questions. Is it better to start from a low baseline so that the school’s later value added scores seem more impressive (Coe, 2017), or to demonstrate to parents that this school has a ‘good’ intake as measured by the overall baseline test result? However a school ‘plays’ the tests it will lead to a great deal of variability and further reduce the reliability, and therefore validity, of the tests for the stated purpose of school accountability.
5.3 Distorting younger children’s provision: Concerns raised within the education community
With a baseline test there is always a debate about when to measure the baseline to catch the most progress. Many primary schools with Reception classes will also have Nursery classes and a growing number of those will be catering for 2-year olds. Generally these are more vulnerable 2-year olds who are receiving free early education. What will the impact of baseline be on the Nursery Curriculum? If the test itself is narrow, then pressure to prepare pupils may narrow the pre-school curriculum in harmful ways.
Many schools see the Early Years as part of one Key Stage, and already monitor progress in the Foundation Stage from the beginning of Nursery using the Early Years Foundation Stage profile. Implementing the Reception baseline could split the Early Years work. This could mean either narrowing the development work with 2 year olds to prepare Nursery children for the assessments at Reception, or deflating of Nursery outcomes so that Reception baselines stay lower. From the point of view of parents, it could also mean pressure to encourage coaching of pre-schoolers to ensure they were test ready, if parents think this means the best results for their children.
5.4 The impact on children in Reception: Concerns raised within the education community
The impact on children undertaking the baseline has been a key concern for Early Years practitioners. Children start school with a wide range of experiences and ‘settling in’ is a very important part of reception schooling. The temptation could be to administer the testing as early as possible to get the lowest measure and capture the progress of settling in. But this could interfere with building positive relationships and be stressful for a child orientating in an unfamiliar environment. Concerns have been raised about children who might not be able to complete a task and the risk of them feeling they have failed or that school is too hard. Length of settling time is varied, age is a consideration as well as context, some children will take longer to settle into school and the longer-term impact of a difficult start to school needs to be understood. All this will reduce the chances of the test producing reliable data.
This is why the majority of Early Years practitioners favoured the Early Excellence Baseline Programme6 when they had the opportunity to choose. It fitted better with Early Years good practice which is based on professional judgement and the understanding of the developing child as a learner. It also allowed practitioners to ensure settling in time was as stress free as possible, particularly for children who hadn’t spent time away from home or caregivers before; were new to the language; or were unfamiliar with the sorts of activities or resources in a school environment.
5.5 Unintended consequences of baseline testing at 4: Concerns raised within the education community
The baseline test for Reception is not intended to have any diagnostic value for schools or individual children However, since administering teachers will see the test scores produced, the baseline test could see some children unnecessarily labelled as low ability at the very start of their formal education. This is likely to be a particular issue for summer born pupils, EAL children and those with SEN. Unless contextual information is collected, the data would not indicate the reasons for these children scoring low.
Practitioners value the more observational style of assessment precisely because it encourages them to use their professional judgment in a more fluid way to support development of young children regardless of their starting point. It has an immediate positive purpose. This was reflected in the fact that observational tests were the most popular choice for baseline in 2015-16. As things stand, the time spent on the baseline tests may result in time lost collecting more useful information about young learners that would enable EY staff to plan and discuss best support for individuals in the Foundation Stage.
5.6 The impact of ’gaming’ baseline on the curriculum across the primary school
It is widely recognised that high stakes accountability can have unintentional impacts on teaching and the way a curriculum is delivered. One of the concerns about using baseline tests to measure progress towards the indicators in the Key Stage 2 tests is the narrowing of the curriculum which may follow. A relentless emphasis on Literacy and Numeracy all the way through primary school is already leading to less focus on foundation subjects such as Science and limiting access to the Arts. Schools have been increasingly using tick boxes and prescribed tools to make decisions on content for lesson planning. This is already reducing too much teaching time to pre-planned drill. Baseline tests with a narrow focus on literacy and numeracy may well further entrench such practice.

Section 6. Are there better alternatives to baseline testing?

The panel considered alternatives to baseline testing that could answer questions about school accountability in a much more productive way.
In her 2002 Reith lectures, the philosopher Onora O’Neill critiqued current accountability measures in the public services for their emphasis on ‘performance indicators chosen for ease of measurement
The Early Excellence Baseline Scheme was chosen by over 11,000 out of 17,000 primary schools making it the most popular choice from a practitioners perspective for Baseline testing in 2015-16. and control rather than because they measure quality of performance accurately’ (p.54). She called for intelligent accountability in which more trust would be placed in professionals and more attention paid to self-governance ‘since much that has to be accounted for is not easily measured it cannot be boiled down to a set of stock performance indicators’ (p.58). Her vision was of accountability which ‘provides substantive and knowledgeable independent judgement of an institution’s or professional work’ (p.58).
Crooks (2006) provides six principles for intelligent accountability in education:
1. It preserves and enhances trust among the key participants in the accountability processes
2. It involves participants in the process, offering them a strong sense of professional responsibility and initiative
3. It encourages deep, worthwhile responses rather than surface window dressing
4. It should recognise and attempt to compensate for the severe limitations of performance indicators in capturing educational quality
5. It provides well-founded and effective feedback…to support good decision-making
6. It leaves the majority of participants more enthusiastic and motivated in their work.
This is a far more defensible approach than a single measure based on aggregating results from a 20 minute test to be used seven years later as an indicator of school progress.
6.1 Examples of intelligent accountability in the early years include collaborations between academics, Local Authorities and schools that have encouraged reflection on the value of the data collected and the purposes to which it is best put. The Surrey value added project, for example, collected reception baseline data that could be used by practitioners to support individual pupils, while also informing school improvement planning more broadly. It took into account and made explicit the important role of child age, and other background effects. It also provided separate measures of school performance in different areas (not one total measure) while taking into account the statistical uncertainty associated with calculating value added measures of school effects. Schools received their own results alongside, in anonymised form, those of other schools in their LA. Participation was voluntary and schools agreed not to use their VA results for marketing purposes The intention was to support schools in the formative use of data, to ask ‘intelligent’ questions and focus on improvement. Importantly, to this end the LA provided teachers and schools with guidance and resources on individual education plans to support children whose baseline scores suggested they might need extra support (Sammons & Smees, 1998).
6.2 For many years the London Education Research Network (LERN) has championed the notion of using performance data effectively to aid school improvement and the network has shared good practice in data use across London Boroughs. The approach relies on good partnership working that can foster open and productive conversations between LA School Improvement staff, LA Education data teams and school leadership teams. The aim of the partnership is to underpin robust self-evaluation at school level by providing good comparative performance information and comprehensive training on using national and local data tools. All parties are encouraged to ask intelligent questions of the appropriate data and reflect on how to feed that back most productively into practice.
To do this well requires adequate finance and the appropriate discharge of responsibilities across the different service levels. The reduction in the size of LA school improvement services along with the diminished role of LAs in terms of their roles and responsibilities for their schools is making this kind of conversation harder to maintain, though there are examples of where this has continued as a Traded Service, e.g. Wandsworth, Hounslow, Southwark and Lambeth in London. These LAs have continued to provide effective training at local level in the use of data and research to support school improvement. Lambeth has summarised and documented some of the research they have conducted exploring this practice through a case study of their own journey. (Using Data to Raise Achievement – Good Practice in Schools Feyisa Demie 2013). Working with the updated Ofsted Framework, Local Authorities have placed a bigger focus on the quality of curriculum and real time observation of children’s work and progress. This could be championed further as an alternative way of making effective use of already existing local and national data. Lambeth LA has also continued providing effective training at national level to share good practice in the use of data and research to support school improvement.
Other examples of intelligent accountability come from work undertaken by Hampshire Education authority which demonstrated that these principles could be implemented well (Yang et al., 1999). These all serve as workable examples that could be adopted.

7. Conclusion: Key Questions for the DfE

We have raised a number of serious concerns about the introduction of a baseline test in Reception and its proposed use to measure progress to KS2 for accountability purposes. We consider the proposals as they stand to be flawed.
In this closing section we put forward a number of issues that the DfE needs to address before implementing any new baseline tests for accountability purposes.

Given the radical and untested nature of this form of baseline testing, DfE/NFER will need to offer re-assurances to those involved that they have taken into account how they will:
1. Be evaluating the impact of the testing on pupils, teachers and schools?
2. Establish whether any groups (for example SES, ethnicity, gender) are advantaged or disadvantaged by the test.
3. Evaluate the impact of pupil mobility, particularly in more disadvantaged schools and areas and the allowances they will make for this.
4. Calculate the proportion of schools taking the baseline test and whether these schools will be advantaged or disadvantaged in relation to the value added system for separate Infant and Junior schools.
5. Make available Reliability data on the administration and scoring of the tests.
6. Monitor the predictive value of the test and how this compares to the current Early Years assessments.

There is further information required:

• Of children sitting KS2 tests in 2017, what was the relationship to elements of the Foundation assessment for the same children, and what proportion of these children were in the same primary school after the 6-7 year period?
•• What proportion of pupils in this summer’s school census started in the same school at the beginning of reception (Mobility study)?
• What proportion of schools are Infant and juniors in the UK – and so exempt from this accountability measure?
• What is the impact of controlling for age and other contextual factors beyond schools’ control in school VA comparisons (based on past DfE and other effectiveness research and in studies using baseline measures)? In particular do schools’ serving more disadvantaged pupils show poorer results if only VA analyses rather than contextual value added measures are calculated?
Will there be an equalities impact assessment during the testing of the new baseline materials to examine the impact for different groups of children?

References

American Educational Research Association, Joint Committee on Standards for Educational and Psychological Testing, American Psychological Association, and National Council on Measurement in Education (2014) Standards for Educational and Psychological Testing, Washington, DC : American Educational Research Association.
Bradbury, A and Robert-Holmes, G. (2016) The Introduction of Reception Baseline Assessment: “They are Children…not robots, not machines” National Union of Teachers and Association of Teachers and Lecturers (ATL) (available at https://www.teachers.org.uk/sites/default/files2014/baselineassessment–final-10404.pdf). Campbell, D.T. (1976) Assessing the impact of planned social change. Evaluation and Program Planning. 2,1, 67–90. Department for Education (2017) Primary Assessment in England: Government consultation response. DfE
Ecob R & Goldstein H. (1983). Instrumental Variable Methods for the Estimation of Test Score Reliability . Journal of Education Statistics 8 (3) 223-241.
Foley and Goldstein
AERA (2014) American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Boaler, J. (2009) The elephant in the classroom, London: Souvenir Press.
Bradbury, A., Jarvis, P. Nutbrown, C., Roberts-Holmes, G., Stewart, N. and Whitebread, D. (2018) More than a score. Baseline assessment: why it doesn’t add up. London: More Than a Score,
16
available from https://morethanascorecampaign.files.wordpress.com/2018/02/neu352-baseline-a4-16pp-crop.pdf
Coe, R. (2017) Education Select Committee evidence quoted in Bradbury et al. (2018).
Crooks, T. J. (2007) Principles for Intelligent Accountability, with illustrations from education, Inaugural Professorial Lecture, University of Otago 4/10/2007.
Crooks, T. J., Kane, M.T. and Cohen, A.S. (1996) Threats to the valid use of assessment, Assessment in Education, 3, 3, 265-285.
Educational Testing Service (ETS) (2004) ETS international principles for fairness review of assessment. Princeton, NJ: ETS.
Hart, S., Dixon, A., Drummond, M.J. and McIntyre, D. (2004) Learning without Limits, Maidenhead, Open University Press.
Kim, J., & Suen, H. K. (2003). Predicting children’s academic achievement from early assessment scores: A validity generalization study. Early Childhood Research Quarterly, 18, 547–566.
LaParo, K. M., & Pianta, R. C. (2000). Predicting children’s competence in the early school years. A meta-analytic review. Review of Educational Research, 70(4), 443–484.Evaluating Early Childhood Assessments: A Differential Analysis
Lynch, S., Mills, B., Theobald, K., Worth, J. (2017) Keeping Your Head: NFER Analysis of Headteacher Retention. Slough: NFER
Mattei, P. (2012) Market accountability in schools: policy reforms in England, Germany, France and Italy, Oxford Review of Education, 38, 3, 247-266.
Messick, S (1989) Meaning and Values in Test Validation: The Science and Ethics of Assessment. Educational Researcher, 18, 2, 5-11 Meisels, S., & Atkins-Burnett, S. (2008) Evaluating Early Childhood Assessments: A Differential Analysis in K. McCartney and D. Phillips (Eds) Blackwell Handbook of Early Childhood Development, Chapter 26, 533-549.Hoboken, NJ: Blackwell-Wiley. Newton, P. E. (2007) Clarifying the purposes of educational assessment, Assessment in Education: Principles, Policy & Practice 14, 2,149-170. Newton, P. E. (2009) The reliability of results from national curriculum testing in England, Educational research, 51, 2, 181-212.
O’Neil, O. (2002) A question of trust, BBC Reith Lectures 2002. London: BBC.
Raudenbush, S. (2004) Schooling, Statistics, and Poverty: Can We Measure School Improvement? Princeton, NJ: Educational Testing Service
Rodda, M. with Hallgarten, J. and Freeman, J. (2013) Between the cracks: Exploring in-year admissions in schools in England. London: RSA Action and Research Centre
Shepard, L. A., Kagan, S. L, & Wurtz, E. (Eds.). (1998). Principles and recommendations for early childhood assessments. Washington, DC: National Education Goals Panel.
Standards and Testing Agency (STA) (2016) Reception baseline comparability study: Results of the 2015 study. London: DfE. Available at
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/514581/Reception_baseline_comparability_study.pdf Stiggins, R. J. (2000) Student-Involved Classroom Assessment Paperback, London: Pearson
West, A., Mattei, P. and Roberts, J. (2011) Accountability and sanctions in English schools, British Journal of Educational Studies, 59,1, 41-62.
Gipps, C.& Stobart, G. (2010) Fairness, in B.McGraw, E. Baker, & P. Peterson (Eds) International Encyclopedia of Education, 3rd Edition, 56-60, Elsevier.
Kim & Suen, 2003
LaParo & Pianta, 2000;
Leckie, G. and Goldstein, H. (2011). Understanding uncertainty in school league tables. Fiscal studies, 32, 207-224.
Meisels & Atkins-Burnett, 2006,
Melhuish, E., Sylva, K., Sammons, P., Siraj-Blatchford, I., Taggart, B. & Phan, M. (2008), Effects of the Home Learning Environment and preschool center experience upon literacy and numeracy development in early primary school. Journal of Social Issues, Vol. 64 (1) pp. 95-114.
Sammons, P., Sylva, K., & Mujtaba, T. (2000) What Do Baseline Assessment Schemes Measure? A Comparison of the QCA and Signposts Schemes, Report prepared for the Qualifications and Curriculum Authority, London: Institute of Education University of London.
Sammons, P. & Smees, R. (1998) Measuring Pupil Progress at Key Stage 1: using baseline assessment to investigate value added, School Leadership and Management, Vol. 18, No 1, pp 389-407.
Sammons, P., Sylva, K., Melhuish, E. C., Siraj-Blatchford, I., Taggart, B., & Elliot, K. (2002). The Effective Provision of Pre-school Education Project, Technical Paper 8a: Measuring the impact on children’s cognitive development over the pre-school years. London: Institute of Education, University of London/DfES. http://dera.ioe.ac.uk/18189/11/EPPE_TechnicalPaper_08a_2002.pdf
Sammons, P., Sylva, K., Melhuish, E. C., Siraj-Blatchford, I., Taggart, B & Elliot, K. (2003). The Effective Provision of Pre-school Education Project, Technical Paper 8b: Measuring the impact on children’s social behavioural development over the pre-school years. London: Institute of Education, University of London/DfES. http://dera.ioe.ac.uk/18189/12/EPPE_TechnicalPaper_08b_2003.pdf
Sammons, P., Sylva, K., Melhuish, E., Siraj-Blatchford, I., Taggart, B, Elliott, K., & Marsh, A. (2004). The Effective Provision of Pre-school Education (EPPE) Project: Technical Paper 11: The continuing effect of pre-school education at age 7 years. London: Institute of Education, University of London. http://dera.ioe.ac.uk/18189/15/EPPE_TechnicalPaper_11_2004.pdf
Sammons, P., Sylva, K., Melhuish, E., Siraj-Blatchford, I., Taggart, B., and Jelicic, H. (2008a). Influences on Children’s Development and Progress in Key Stage 2: Social/behavioural outcomes in Year 6. Research Report DCSF-RR049, ISBN 978 1 84775 230 7. London: DCSF http://dera.ioe.ac.uk/18192/1/DCSF-RR049.pdf
18
Sammons, P., Sylva, K., Melhuish, E., Siraj-Blatchford, I., Taggart, B., and Hunt, S. (2008b). Influences on Children’s Attainment and Progress in Key Stage 2: Cognitive outcomes in Year 6. Effective Pre-School and Primary Education 3-11 Project (EPPE 3-11), Research Report DCSF-RR048. London: DCSF http://dera.ioe.ac.uk/18190/1/DCSF-RR048.pdf
Sammons, P., Anders, Y., Sylva, K., Melhuish, E., Siraj-Blatchford, I., Taggart, B. and Barreau, S. (2008), ‘Children’s Cognitive Attainment and Progress in English Primary Schools During Key Stage 2: Investigating the potential continuing influences of pre-school education, Zeitschrift für Erziehungswissenschaften, 10. Jahrg., Special Issue (Sonderheft) 11/2008, pp. 179-198.
Sammons, P., Toth, K., Sylva, K., Melhuish, E., Siraj, I., & Taggart, B. (2015) The long-term role of the home learning environment in shaping students’ academic attainment in secondary school, Journal of Children’s Services, Vol. 10, No. 3, pp189-201. Emerald Literati Outstanding Paper Award 2016 Journal of Children’s Services
Sharma, N. (2016) Pupil Mobility: what does it cost London? A London Councils Member Briefing April 2016, London: London Councils
Shepard, 1998;
Sylva, K., Melhuish, E., Sammons, P., Siraj-Blatchford, I. and Taggart, B. (2004). The Effective Provision of Pre-School Education (EPPE) Project: Technical Paper 12 – The final report. London: DfES / Institute of Education. http://dera.ioe.ac.uk/18189/16/EPPE_TechnicalPaper_12_2004.pdf
Sylva, K., Melhuish, E., Sammons, P., Siraj-Blatchford, I., Taggart, B.and Sammons, P. (2006) Influences on children’s attainment, progress and social/behavioural development in primary school, p. 22-62, Part 1 of Promoting Equality in the Early Years: Report to The Equalities Review. London, Cabinet Office. Effective Pre-school and Primary Education 3-11 (EPPE 3-11) Team (2007), http://ro.uow.edu.au/cgi/viewcontent.cgi?article=2176&context=sspapers
Tymms, P., Merrell, C., Hawker, D. & Nicholson, F. (2014) Performance Indicators in Primary Schools: A comparison of performance on entry to school and the progress made in the first year in England and four other jurisdictions, Department for Education Research Report. 344 https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/318052/RR344_-_Performance_Indicators_in_Primary_Schools.pdf