Bayesian modeling for Classic Test Theory

Bayesian Psychometric Modeling by Levy and Mislevy (2016) provides a nice discussion of Bayesian modeling for the Classic Test Theory (CTT). The classic references are Lord and Novick (1968) and McDonald (1999).

Initially, Levy and Mislevy discuss the CTT by considering a single test and assuming that the measurement model parameters and the hyperparameters are known. Then they generalize to the case where there are multiple tests, still assuming that the measurement model parameters and hyperparameters are known. Finally, they consider the case where the measurement model parameters and hyperparameters are unknown. In their book, Levy and Mislevy use WinBUGS but here I re-write one of their models in Stan. I will consider the case in which the measurement model parameters and hyperparameters are unknown.

For each examinee $i$, $x_i$ is the observable test score. The test can be made up by several items and the test score is the sum of the responses to the individual items. According to CTT, $x_i$ is the sum of a true score (for each examinee) and of an error component:

$$x_i = T_i + E_i,$$
where `$T_i$` is the true score for examinee `$i$`, with mean `$\mu$` and variance `$\sigma_T^2$` and `$E$` is the error component for examinee `$i$`, with mean 0 and variance `$\sigma_E^2$`. Errors are assumed to be uncorrelated with true scores in the population,
$$\rho_{TE} = 0.$$

The variance of the observed scores, $\sigma^2_x$, is the sum of the true variance and of the error variance,

$$\sigma^2_x = \sigma^2_T + \sigma^2_E,$$

from which the reliability of the test is given by the ratio between the variance of the true scores and the variance of the observed scores (that is, the proportion of observed score variance that is accounted for by the true score variance):

$$\rho = \frac{\sigma^2_T}{\sigma^2_x}.$$

The reliability $\rho$ also corresponds to the squared correlation between the observed scores and the true scores, $\rho = \rho_{xT}^2$.

Levy and Mislevy (2016) discuss how the true scores can be estimated with Kelley’s formula (Kelley, 1923). Kelley’s formula shifts the estimate of each examinee’s true score in the direction of the sample mean by a measure which depends on the reliability of the test. If $\rho = 1$ the observed score becomes the estimate of the true score; if $\rho = 0$, $\mu_x$ becomes the best estimate of the true score. The best estimate of the true score will therefore fall between $x_i$ and $\mu_x$, and it will be closer to $x_i$ if the reliability of the test is higher.

Levy and Mislevy (2016) show that the posterior mean obtained from a Bayesian implementation of the CTT is consistent with Kelley’s formula. They provide a formal proof of this, but here I will only replicate the MCMC estimation via Stan. I will consider here their last example, in which they consider the case of the CTT with an unknown measurement model. Specifically, Levy and Mislevy analyze the data of 10 examinees, each of them with 5 observable measures. This can be understood as the case in which the same test is administered to each examinee on five occasions, with no effects of learning or fatigue.

I focus here on the case in which $\mu_T$, $\sigma_T^2$, and $\sigma_E^2$ are unknown. The purpose is to estimate the reliability of the test and to estimate the true scores of the examinees by using only the observed scores.

I have re-written in Stan the WinBUGS model that Levy and Mislevy (2016) present in section 8.3.3.3. Since I will use rstan, I saved the following code in a file called ctt_2.stan.

data {
  int<lower=0> N; //the number of subjects (rows in the X matrix)
  int<lower=0> J; //the number of columns in the X matrix
  matrix[N, J] X; //the data matrix
}
parameters {
  vector[N] T; //the true scores
  real mu_T; // mean of the distribution of true scores
  real<lower=0> sigma_T; // sd of the distribution of true scores
  real<lower=0> sigma_E; // sd of the distribution of errors
}
transformed parameters {
}
model {
  // Priors
  mu_T ~ normal(0, 200);
  sigma_T ~ normal(0, 20);
  sigma_E ~ normal(0, 20);
  // Likelihood
  for (i in 1:N) {
      T[i] ~ normal(mu_T, sigma_T);
      for (j in 1:J) {
          X[i, j] ~ normal(T[i], sigma_E);
      }
  }
}
generated quantities {
  real<lower=0> reliability;
  real<lower=0> reliability_composite;

  reliability = sigma_T^2 / (sigma_T^2 + sigma_E^2);
  reliability_composite = J * reliability / ((J - 1) * reliability + 1);
}

Differently from Levy and Mislevy (2016), who used priors for $\mu_T$, $\sigma_T$, and $\sigma_E$ centered on the correct values, I use much less informative priors in order to test the robustness of the estimation. I will not discuss here the details of the code. It is easy to find many introductions of Stan on the web. Let see instead how we can run this code in R. I used the following instructions:

rm(list = ls(all = TRUE))
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

The previous lines of code will load Stan in the R environment after clearing the workspace. With the following instructions, I create a list with the data in the appropriate format for Stan. N is the number of examinees, J is the number of administrations of the test for each examinee, and X is the data matrix. Each row of X contains the results of one examinee, with the columns providing the results of the repeated administrations of the test.

stan_dat <- list(
  N = 10,
  J = 5,
  X = matrix(
    c(
      80, 77, 80, 73, 73,
      83, 79, 78, 78, 77,
      85, 77, 88, 81, 80,
      76, 76, 76, 78, 67,
      70, 69, 73, 71, 77,
      87, 89, 92, 91, 87,
      76, 75, 79, 80, 75,
      86, 75, 80, 80, 82,
      84, 79, 79, 77, 82,
      96, 85, 91, 87, 90
    ), nrow = 10, ncol = 5, byrow = TRUE)
)

I fitted the model contained in the ctt_2.stan file as follows:

fit <- stan(file = "ctt_2.stan",
               data = stan_dat,
               pars = c("T", "mu_T", "sigma_T",
                        "sigma_E", "reliability",
                        "reliability_composite"),
               iter = 20000,
               chains = 4)

Here are the results:

Inference for Stan model: ctt_2.
4 chains, each with iter=20000; warmup=10000; thin=1;
post-warmup draws per chain=10000, total post-warmup draws=40000.

                         mean se_mean   sd    2.5%     25%     50%     75%   97.5% n_eff Rhat
T[1]                    76.85    0.01 1.55   73.85   75.81   76.84   77.87   79.92 32011    1
T[2]                    79.07    0.01 1.54   76.05   78.04   79.07   80.08   82.12 33954    1
T[3]                    82.06    0.01 1.52   79.09   81.03   82.06   83.08   85.03 32860    1
T[4]                    74.99    0.01 1.53   71.96   73.97   74.99   76.00   78.01 32862    1
T[5]                    72.57    0.01 1.56   69.51   71.53   72.55   73.59   75.71 32570    1
T[6]                    88.55    0.01 1.57   85.41   87.53   88.58   89.60   91.58 31220    1
T[7]                    77.20    0.01 1.52   74.23   76.18   77.20   78.20   80.21 32944    1
T[8]                    80.58    0.01 1.52   77.57   79.57   80.58   81.59   83.56 33246    1
T[9]                    80.19    0.01 1.53   77.18   79.16   80.18   81.20   83.21 33681    1
T[10]                   89.12    0.01 1.58   85.94   88.07   89.14   90.18   92.17 31818    1
mu_T                    80.09    0.01 2.17   75.80   78.76   80.09   81.43   84.40 24525    1
sigma_T                  6.47    0.01 1.94    3.80    5.14    6.12    7.40   11.20 21723    1
sigma_E                  3.56    0.00 0.41    2.86    3.26    3.52    3.81    4.48 23602    1
reliability              0.74    0.00 0.11    0.50    0.67    0.75    0.82    0.92 24608    1
reliability_composite    0.93    0.00 0.04    0.83    0.91    0.94    0.96    0.98 25083    1
lp__                  -107.17    0.02 2.83 -113.67 -108.86 -106.81 -105.12 -102.70 14242    1

Samples were drawn using NUTS(diag_e) at Sun Nov 27 09:08:53 2016.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

The Stan implementation of Levy and Mislevy’s model works. For example, for the estimate of the true score of the first examinee, Levy and Mislevy (2016) report a mean of 76.85, a standard deviation of 1.53, and a 95% CI of $[73.87, 79.83]$. With much more generic priors, I find a mean of 76.85, a standard deviation of 1.55, and a 95% CI of $[73.85, 79.92]$. Levy and Mislevy report a reliability of the composite score of 0.93, with a standard deviation of 0.04, and a 95% CI of $[0.86, 0.98]$. I find a reliability of 0.94, with a standard deviation of 0.04, and a 95% CI of $[0.83, 0.98]$.

It is interesting that, even when almost uninformative priors are used, correct estimates for the parameters can be obtained.

Corrado Caudek
Corrado Caudek
Psychometrics and Quantitative Psychology

Investigating cognitive processes and individual differences.