Papers 2020

Forecasting volcanic eruptions

Forecasting the repose between eruptions at a volcano is a key goal of volcanology for emergency planning and preparedness. Previous studies have used the statistical distribution of prior repose intervals tpo estimate the probability of a certain repose interval occurring in the future, and to offer insights into the underlying physical processes that govern eruption frequency. However, distributions are only decipherable after the eruption, when a full dataset is available, or not at all in the case of an incomplete time-series. Thus there is value in using an approach that does not assume an underlying distribution in forecasting likely repose intervals, and that can make use of additional information that may be related to the duration of repose. The use of a non-parametric survival model is novel in volcanology, as the size of eruption records is typically insufficient. Here, we apply a non-parametric Bayesian grouped time Markov Chain Monte Carlo (MCMC) survival model to the extensive 58-year eruption record (1956 to 2013) of Vulcanian explosions at Sakura-jima volcano, Japan. The model allows for the use of multiple observed and recorded data sets, such as plume height or seismic amplitude, even if some of the information is incomplete. Thus any relationships between explosion variables and subsequent or prior repose interval can be investigated. The model was successfully able to forecast future repose intervals for Sakura-jima using information about the prior plume height, plume colour and repose durations. For plume height, smaller plumes are followed by shorter repose intervals. This provides one of the first statistical models that uses plume height to quantitatively forecast explosion frequency.

Anonymisation of linked datasets

The requirement to anonymise datasets that are to be released for secondary analysis should be balanced by the need to allow their analysis to provide efficient and consistent parameter estimates. The proposal in this paper is to integrate the process of anonymisation and data analysis. The first stage uses the addition of random noise with known distributional properties to some or all variables in a released (already pseudonymised) data set where the values of some identifying and sensitive variables for data subjects of interest are also available to an external ‘attacker’ who wishes to identify those data subjects so that they can interrogate their records in the dataset. The second, analysis, stage accounts for the noise addition in the data to provide required parameter estimates. Where the characteristics of the noise are made available to the analyst by the data provider , we propose a new method that allows a valid analysis.  This is formally a measurement error model and we describe a Bayesian MCMC algorithm that recover consistent estimates of the true model parameters. A novel method for handling categorical data is presented. The paper shows how an appropriate noise distribution can be determined.

Use of accountability for school progress

In Australia, under the National Assessment Plan, educational accountability testing in literacy and numeracy (NAPLAN) is undertaken with all students in Years 3, 5, 7 and 9 to monitor student achievement and inform policy. However, the extent to which these data have been analyzed to report student progress is limited. This article reports a study analyzing Year 3 and Year 5 NAPLAN reading and numeracy data, school and student information for a single student cohort from Queensland, Australia, to examine student achievement and progress. The analyses use longitudinal multilevel modelling, incorporating an enhanced approach for missing data imputation, given that such data frequently involve large amounts of missing data and failure to account properly for such missing data may bias interpretations of analyses. Further, statistical adjustments to deal with the impact of measurement error, an aspect not previously addressed in such analyses of data, are undertaken. An especial focus of analyses is achievement of Australian Indigenous and non-Indigenous students. International and national data demonstrate a considerable achievement gap between these students. “Closing the gap” is a core Australian education equity policy, with NAPLAN data used as a primary indicator of policy impact. Overall, analyses indicate greater understanding of student progress for all students is available from Australian data if appropriate analyses are undertaken. However, analyses also demonstrate not only that the gap between Australian Indigenous and non-Indigenous student progress increases as they move through school but also diversity of achievement within the Indigenous student cohort. Implications for policy are considered.

Measerrorcomplete

The presence of randomly distributed measurement errors in scale scores such as those used in educational and behavioural assessments implies that careful adjustments are required to statistical model estimation procedures if inferences are required for ‘true’ as opposed to ‘observed’ relationships. In many cases this requires the use of external values for ‘reliability’ statistics or ‘measurement error variances’ which may be provided by a test constructor or else inferred or estimated by the data analyst. Popular measures are those described as ‘internal consistency’ estimates and sometimes other measures based on data grouping. All such measures, however, make particular assumptions that may be questionable but are often not examined. In this paper we focus on scaled scores derived from aggregating a set of indicators, and set out a general methodological framework for exploring different ways of estimating reliability statistics and measurement error variances, critiquing certain approaches and suggesting more satisfactory methods in the presence of longitudinal data. In particular, we explore the assumption of local (conditional) item response independence and show how a failure of this assumption can lead to biased estimates in statistical models using scaled scores as explanatory variables. We illustrate our methods using a large longitudinal dataset of mathematics test scores from Queensland, Australia.