Title: | Classification Accuracy and Consistency under Item Response Theory |
---|---|
Description: | Computes classification accuracy and consistency indices under Item Response Theory. Implements the total score IRT-based methods in Lee, Hanson & Brennen (2002) and Lee (2010), the IRT-based methods in Rudner (2001, 2005), and the total score nonparametric methods in Lathrop & Cheng (2014). For dichotomous and polytomous tests. |
Authors: | Quinn N. Lathrop |
Maintainer: | Quinn N. Lathrop <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.4 |
Built: | 2025-02-21 04:59:33 UTC |
Source: | https://github.com/cran/cacIRT |
Computes classification accuracy and consistency under Item Response Theory by the approach proposed by Lee, Hanson & Brennen (2002) and Lee (2010), the approach proposed by Rudner (2001, 2005), and the approach proposed by Lathrop & Cheng (2014).
Package: | cacIRT |
Type: | Package |
Version: | 1.3 |
Date: | 2015-08-15 |
License: | GPL (>= 2) |
This packages computes classification accuracy and consistency indices with two approaches proposed by Lee, Hanson & Brennan (2002) and Lee (2010) or by Rudner (2001, 2005). The two functions class.Lee()
and class.Rud()
are the wrapper functions for the most common implementations of the respective approaches. They accept a range of inputs: ability estimates, quadrature points, or response data matrix and item parameters. Marginal indices are computed with either the D (using a theoretical or simulated distribution) or P (using the sample directly) method (see Lee (2010)). The function recursive.raw()
computes the probabilities of total scores given ability and item parameters and may be of interest outside of classification.
The major difference between the Lee approach and the Rudner approach is the scale that the classification occurs on. The Lee approach uses the total score scale, and finds the probability of each total score given an examinee's latent ability estimate and the item parameters. The cut score is also given as a total score. The Rudner approach occurs on the latent trait scale, and is given a cut score on the latent trait scale. Dispite their similarities, the two estimators generally do not estimate the same index, see Lathrop & Cheng (2013) and Lathrop (2015) for discussion and simulation studies.
A new nonparametric approach is also provided with pnr()
and Lee.pnr()
. It is a nonparametric extension to the Lee approach and is explained and tested in Lathrop & Cheng (2014). This approach does not require an assumption of a parametric IRT model or a parametric ability distribution and is often more accurate when those assumptions are violated compared to parametric approaches.
Polytomous tests (where item responses are in more categories than two ordered categories) are easily computed with Lee.pnr()
and class.Rud
. To use Lee's (2010) approach with polytomous or mixed format tests, use Lee.poly.P()
, Lee.poly.D()
, and/or gen.rec.raw()
.
Quinn N. Lathrop
Maintainer: <quinn.lathrop @ gmail.edu>
Lathrop, Q. N., & Cheng, Y. (2013) Two Approaches to Estimation of Classification Accuracy Rate Under Item Response Theory. Applied Psychological Measurement, 37, 226-241.
Lathrop, Q. N., & Cheng, Y. (2014). A Nonparametric Approach to Estimate Classification Accuracy and Consistency. Journal of Educational Measurement, 51(3), 318-334.
Lee, W. (2010) Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47, 1-17.
Lee, W., Hanson, B. A., & Brennan, R. L. (2002) Estimating consistency and accuracy indices for multiple classifications. Applied Psychological Measurement, 26, 412-432.
Lee, W., & Kolen, M. J. (2008) IRT-class: IRT classification consistency and accuracy (version 2.0).
Rudner, L. M. (2001) Computing the expected proportions of misclassified examinees. Practical Assessment, Research & Evaluation, 7(14), 1-5.
Rudner, L. M. (2005) Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13), 1-4.
Computes classification accuracy and consistency with Lee's approach. The probability of each possible total score conditional on ability is found with recursive.raw
. Those probabilities are grouped according to the cut scores and used to estimate the indices. See references or code for details.
class.Lee(cutscore, ip, ability = NULL, rdm = NULL, quadrature = NULL, D = 1.7) Lee.D(cutscore, ip, quadrature, D = 1.7) Lee.P(cutscore, ip, theta, D = 1.7)
class.Lee(cutscore, ip, ability = NULL, rdm = NULL, quadrature = NULL, D = 1.7) Lee.D(cutscore, ip, quadrature, D = 1.7) Lee.P(cutscore, ip, theta, D = 1.7)
cutscore |
A scalar or vector of cut scores on the True Score scale. If you have cut scores on the theta scale, you can transform them with |
ip |
Matrix of item parameters, columns are discrimination, difficultly, guessing, respectively. For 1PL and 2PL, still give a Jx3 matrix, with |
ability , theta
|
Ability estimates for each subject. |
rdm |
The response data matrix with rows as subjects and columns as items |
quadrature |
A list containing 1) The quadrature points and 2) Their corresponding weights |
D |
Scaling constant for IRT parameters, defaults to 1.7, alternatively often set to 1. |
Must give only one ability, rdm, or quadrature. If ability is given, those scores are used for the P method. If rdm is given, ability is estimated with MLE (perfect response patterns given a -4 or 4) and used for the P method. If quadrature, the D method is used. class.Lee
calls Lee.D
or Lee.P
.
Marginal |
A matrix with two columns of marginal accuracy and consistency per cut score (and simultaneous if multiple cutscores are given) |
Conditional |
A list of two matrixes, one for conditional accuracy and one for conditional consistency. Each matrix has one row per subject (or quadrature point). |
In order to score above a cut, an examinee must score at or above the cut score. Since we are working on the total score scale, be aware that if a cut score is given with a decimal (like 2.4), the examinee must have a total score at the next integer or more (so 3 or more) to score above the cut.
Quinn N. Lathrop
Lee, W. (2010) Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47, 1–17.
##from rdm, item parameters denote 4 item 1PL test, cut score at x=2 ##only print marginal indices params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) rdm<-sim(params, rnorm(100)) class.Lee(2, params, rdm = rdm)$Marginal ##or from 40 quadrature points and weights, 2 cut scores quad <- normal.qu(40) class.Lee(c(2,3), params, quadrature = quad, D = 1)$Marginal
##from rdm, item parameters denote 4 item 1PL test, cut score at x=2 ##only print marginal indices params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) rdm<-sim(params, rnorm(100)) class.Lee(2, params, rdm = rdm)$Marginal ##or from 40 quadrature points and weights, 2 cut scores quad <- normal.qu(40) class.Lee(c(2,3), params, quadrature = quad, D = 1)$Marginal
Computes classification accuracy and consistency with Rudner's approach. For each examinee, a normal distribution is created with mean at the ability estimate and standard deviation equal to the standard error of the ability estimate. Rudner's method assumes the standard error is conditionally normally distributed. The area under this normal curve between cut scores is used to estimate the indices. See references.
class.Rud(cutscore, ip, ability = NULL, se = NULL, rdm = NULL, quadrature = NULL, D = 1.7) Rud.D(cutscore, quadrature, sem) Rud.P(cutscore, theta, sem)
class.Rud(cutscore, ip, ability = NULL, se = NULL, rdm = NULL, quadrature = NULL, D = 1.7) Rud.D(cutscore, quadrature, sem) Rud.P(cutscore, theta, sem)
cutscore |
A scalar or vector of cut scores on the theta scale. Should not include +- inf, the function will include them. |
ip |
Matrix of item parameters, columns are discrimination, difficultly, guessing. For 1PL and 2PL, still give a Jx3 matrix, with |
ability , theta
|
Ability estimates for each subject. |
se , sem
|
Standard errors of ability estimates |
rdm |
The response data matrix with rows as subjects and columns as items |
quadrature |
A list containing [[1]] The quadrature points and [[2]] Their corresponding weights |
D |
The scaling constant for the IRT parameters, defaults to 1.7, alternatively often set to 1. |
Must give only ability and se, rdm, or quadrature. If ability and se are given, those scores are used for the P method. If rdm is given, ability and se are estimated with MLE (perfect response patterns given a -4 or 4) and used for the P method. If quadrature, the D method is used.
Marginal |
A matrix with two columns of marginal accuracy and consistency per cut score and/or simultaneous |
Conditional |
A list of two matrixes, one for conditional accuracy and one for conditional consistency. Each matrix has one row per subject (or quadrature point). |
class.Rud
is a wrapper for Rud.P
and Rud.D
.
Quinn Lathrop
Rudner, L. M. (2001) Computing the expected proportions of misclassified examinees. Practical Assessment, Research & Evaluation, 7(14), 1–5.
Rudner, L. M. (2005) Expected classification accuracy. Practical Assessment Research & Evaluation, 10(13), 1–4.
##from rdm, item parameters denote 4 item 1PL test, cut score at theta=.5 ##only return marginal indices params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) rdm<-sim(params, rnorm(100)) class.Rud(.5, params, rdm = rdm)$Marginal ##or from 40 quadrature points and weights, 2 cut scores quad <- normal.qu(40) class.Rud(c(-.5,1.5), params, quadrature = quad, D = 1)$Marginal
##from rdm, item parameters denote 4 item 1PL test, cut score at theta=.5 ##only return marginal indices params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) rdm<-sim(params, rnorm(100)) class.Rud(.5, params, rdm = rdm)$Marginal ##or from 40 quadrature points and weights, 2 cut scores quad <- normal.qu(40) class.Rud(c(-.5,1.5), params, quadrature = quad, D = 1)$Marginal
Computes classification accuracy and consistency with Lee's approach for polytomous tests. The probability of each possible total score conditional on ability is found with gen.rec.raw()
. Those probabilities are grouped according to the cut scores and used to estimate the indices.
Lee.poly.D(cutscore, Pij, quadrature) Lee.poly.P(cutscore, Pij, theta)
Lee.poly.D(cutscore, Pij, quadrature) Lee.poly.P(cutscore, Pij, theta)
cutscore |
A scalar or vector of cut scores on the True Score scale. If you have cut scores on the theta scale, you can transform them with |
Pij |
An NxMxJ array of probabilities. Each slice of the array represents an item. Within a slice, each row corresponds to the respective element in |
theta |
Ability estimates for each subject. Must correspond to the first dimension of |
quadrature |
A list containing 1) The quadrature points and 2) Their corresponding weights. Must correspond to the first dimension of |
The polytomous generalization to class.Lee
. Requires the user build the Pij
array.
Marginal |
A matrix with two columns of marginal accuracy and consistency per cut score (and simultaneous if multiple cutscores are given) |
Conditional |
A matrix of conditional accuracy and conditional consistency |
In order to score above a cut, an examinee must score at or above the cut score. Since we are working on the total score scale, be aware that if a cut score is given with a decimal (like 2.4), the examinee must have a total score at the next integer or more (so 3 or more) to score above the cut.
If the test is mixed format (some dichotomous, some polytomous items), Pij
must be of an appropriate size for the item with the most response categories. The response categories that do no appear in other items can be filled with zeros. Note also that the function assumes response categories are scored as 0,1,2,3,...,M
While this function is needed for polytomous tests for the Lee approach, class.Rud()
works directly with polytomous tests when given the ability estimate and the standard error and so does not need an analogous set of functions.
Quinn N. Lathrop
Lee, W. (2010) Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47, 1–17.
#Same example as \code{class.Lee()}, #build \code{Pij} the same as in the example for \code{gen.rec.raw()}. params <- matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) theta <- rnorm(100) Pij.flat <- irf(params, theta)$f Pij.array <- array(NA, dim = c(length(theta), 2, nrow(params))) Pij.array[,1,] <- 1 - Pij.flat #P(X_j = 0 | theta_i) Pij.array[,2,] <- Pij.flat #P(X_j = 1 | theta_i) Lee.poly.P(2, Pij.array, theta)$Marginal #in the dichotomous case, identical to \code{Lee.P()} Lee.P(2, params, theta)$Marginal #For Rudner and polytomous tests, compute the theta estimate and se and use those as input theta.est <- theta #just for example theta.se <- SEM(params, theta.est) #also for example, SEM() assumes 3PL model, #but if you use mirt or similar package, #the theta estimates and their se will be available Rud.P(.5, theta.est, theta.se)$Marginal
#Same example as \code{class.Lee()}, #build \code{Pij} the same as in the example for \code{gen.rec.raw()}. params <- matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) theta <- rnorm(100) Pij.flat <- irf(params, theta)$f Pij.array <- array(NA, dim = c(length(theta), 2, nrow(params))) Pij.array[,1,] <- 1 - Pij.flat #P(X_j = 0 | theta_i) Pij.array[,2,] <- Pij.flat #P(X_j = 1 | theta_i) Lee.poly.P(2, Pij.array, theta)$Marginal #in the dichotomous case, identical to \code{Lee.P()} Lee.P(2, params, theta)$Marginal #For Rudner and polytomous tests, compute the theta estimate and se and use those as input theta.est <- theta #just for example theta.se <- SEM(params, theta.est) #also for example, SEM() assumes 3PL model, #but if you use mirt or similar package, #the theta estimates and their se will be available Rud.P(.5, theta.est, theta.se)$Marginal
Computes classification accuracy and consistency with Lathrop & Cheng's (2014) approach. First, the kernel-smoothed estimate of the probability of a correct response, conditional on observed total score, is found with pnr()
. Then, the method proceeds similar to class.Lee()
. Using the nonparametric approach does not require a parametric IRT model, keeps the problem on the total score scale, and can produce more accurate CA and CC estimates when the IRT model's assumptions are violated (see Lathrop & Cheng, 2014).
Lee.pnr(cutscore, pnr.out) pnr(resp, bw.g = NULL, alpha = .5)
Lee.pnr(cutscore, pnr.out) pnr(resp, bw.g = NULL, alpha = .5)
cutscore |
A scalar or vector of cut scores on the total score scale. Should not include 0 or the max total score, as the end points are added internally. |
pnr.out |
The output from
|
resp |
The response data matrix with rows as subjects and columns as items. Because the method is based on total score, the method is not robust to missing data. Any |
bw.g |
The global bandwidth parameter. The default of NULL will estimate the global bandwidth with the optimal (in terms of MSE) estimate of the bandwidth for normally distributed variables. The default is generally a good starting point. |
alpha |
The adaptivity of the bandwidth parameter. A value of 0 means no adaptation and each evaluation point uses the value in |
Marginal |
A matrix with two columns of marginal accuracy and consistency per cut score (and simultaneous if multiple cutscores are given) |
Conditional |
A list of two matrixes, one for conditional accuracy and one for conditional consistency. Each matrix has one row per evaluation point. |
The function pnr()
is modified from Ramsay's (1991) kernel-smoothed response functions, specifically because they occur conditional total score (and not conditional on a latent trait) and the addition of an adaptive bandwidth (which helps performance when the distribution of total scores is not normal.)
There is no "D" method of marginalization (as there is for class.Rud
and class.Lee
). But if there is a theoretical distribution of total scores, the pnr.out[[2]]
can be adjusted to match this theoretical distribution.
Quinn N. Lathrop
Lathrop, Q. N., & Cheng, Y. (2014). A Nonparametric Approach to Estimate Classification Accuracy and Consistency. Journal of Educational Measurement, 51(3), 318-334.
Lee, W. (2010) Classification consistency and accuracy for complex assessments using item response theory. Journal of Educational Measurement, 47, 1-17.
Ramsay, J. O. (1991). Kernel Smoothing Approaches to Item Characteristic Curve Estimation. Psychometrika, 56(4), 611-630.
#Simulate simple response data params <- matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) theta <- rnorm(100) rdm <- sim(params, theta) pnr.out <- pnr(rdm) resultsNP <- Lee.pnr(3, pnr.out)
#Simulate simple response data params <- matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) theta <- rnorm(100) rdm <- sim(params, theta) pnr.out <- pnr(rdm) resultsNP <- Lee.pnr(3, pnr.out)
Recursively computes the probabilities of each possible total score conditional on ability.
recursive.raw(ip, theta, D = 1.7) gen.rec.raw(Pij, theta.names = NULL)
recursive.raw(ip, theta, D = 1.7) gen.rec.raw(Pij, theta.names = NULL)
ip |
Jx3 matrix of item parameters, columns are discrimination, difficulty, and guessing; in that order. |
theta |
Vector of abilities or points to condition on. |
D |
The scaling constant for the IRT parameters, defaults to 1.7, alternatively often set to 1. |
Pij |
Either: (1) an NxJ matrix of probabilities of correct response, where each row corresponds to the respective element in or (2) an NxMxJ array of probabilities. Each slice of the array represents an item. Within a slice, each row corresponds to the respective element in |
theta.names |
Optional vector to use as row.names in the output matrix. Should correspond to the first dimension of |
A matrix of theta points by possible total score 0,1, . . . ,J.
As described in Huynh 1990.
If the test is mixed format (some dichotomous, some polytomous items), to use gen.rec.raw()
, Pij
must be of an appropriate size for the item with the most response categories. The response categories that do no appear in other items can be filled with zeros. Note also that the function assumes response categories are scored as 0,1,2,3,...,M
Quinn Lathrop
theta <- c(-1,0, 1) params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) #using IRT model and item parameters rec.mat <- recursive.raw(params, theta) #using user supplied probability array Pij.flat <- irf(params, theta)$f #through matrix input rec.mat2 <- gen.rec.raw(Pij.flat, theta) #through array input (this can be generalized to polytomous tests) Pij.array <- array(NA, dim = c(length(theta), 2, nrow(params))) Pij.array[,1,] <- 1 - Pij.flat #P(X_j = 0 | theta_i) Pij.array[,2,] <- Pij.flat #P(X_j = 1 | theta_i) rec.mat3 <- gen.rec.raw(Pij.array, theta) #same results max(c(rec.mat-rec.mat3, rec.mat2-rec.mat3))
theta <- c(-1,0, 1) params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) #using IRT model and item parameters rec.mat <- recursive.raw(params, theta) #using user supplied probability array Pij.flat <- irf(params, theta)$f #through matrix input rec.mat2 <- gen.rec.raw(Pij.flat, theta) #through array input (this can be generalized to polytomous tests) Pij.array <- array(NA, dim = c(length(theta), 2, nrow(params))) Pij.array[,1,] <- 1 - Pij.flat #P(X_j = 0 | theta_i) Pij.array[,2,] <- Pij.flat #P(X_j = 1 | theta_i) rec.mat3 <- gen.rec.raw(Pij.array, theta) #same results max(c(rec.mat-rec.mat3, rec.mat2-rec.mat3))
Creates a 2x2 table (or larger for multiple cutscores), to compute the proportion of correctly classified, or consistently classified, subjects. Can easily be modified to return kappa or other summaries as needed.
TOtable.F(theta, os, theta.cutoff, os.cutoff)
TOtable.F(theta, os, theta.cutoff, os.cutoff)
theta |
a vector of scores, could be True scores, or simply the first test's scores |
os |
a vector of scores with elements corresponding to those given for theta. Could be observed scores, or the second test's scores |
theta.cutoff |
the cut score on the same scale as the scores given in theta |
os.cutoff |
the cut score on the same scale as the scores given in os |
The cut scores need to be on the same scale as their subjects' scores, but do not need to match each other. For example, one test scored with theta.hat and a theta scaled cut score, and the second scored on total score and a number correct cut score. See example.
Table |
classification table with margins |
P |
P or proportion of subjects on the diagonal, raw accuracy/consistency |
This is useful in simulation studies when we have usually unknowable knowledge, either a second test (for consistency), or True scores or Theta (for accuracy).
Quinn Lathrop
##with simple 4 item 1PL simulated test: ## find True classification accuracy of theta.hat and total score ## use a cut score of theta = 0 (assume true score for total score cut score) params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) theta <-rnorm(100) rdm<-sim(params, theta) theta.hat <- MLE(rdm,params) total.score <- rowSums(rdm) t.cut<-0 x.cut <- sum(irf(params, t.cut)$f) TOtable.F(theta, theta.hat, t.cut, t.cut)$P TOtable.F(theta, total.score, t.cut, x.cut)$P # for 1PL model total score is sufficient for theta.hat. # repeat with 2PL or 3PL items to see effect of classifying # on theta.hat instead of total score
##with simple 4 item 1PL simulated test: ## find True classification accuracy of theta.hat and total score ## use a cut score of theta = 0 (assume true score for total score cut score) params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) theta <-rnorm(100) rdm<-sim(params, theta) theta.hat <- MLE(rdm,params) total.score <- rowSums(rdm) t.cut<-0 x.cut <- sum(irf(params, t.cut)$f) TOtable.F(theta, theta.hat, t.cut, t.cut)$P TOtable.F(theta, total.score, t.cut, x.cut)$P # for 1PL model total score is sufficient for theta.hat. # repeat with 2PL or 3PL items to see effect of classifying # on theta.hat instead of total score
Modified from the package irtoys
.
iif(ip, x, D = 1.7) irf(ip, x, D = 1.7) MLE(resp, ip, D = 1.7, min= -4, max = 4) normal.qu(n = 15, lower = -4, upper = 4, mu = 0, sigma = 1) SEM(ip, x, D = 1.7) sim(ip, x, D = 1.7) tif(ip, x, D = 1.7)
iif(ip, x, D = 1.7) irf(ip, x, D = 1.7) MLE(resp, ip, D = 1.7, min= -4, max = 4) normal.qu(n = 15, lower = -4, upper = 4, mu = 0, sigma = 1) SEM(ip, x, D = 1.7) sim(ip, x, D = 1.7) tif(ip, x, D = 1.7)
ip |
A Jx3 matrix of item parameters. Columns are discrimination, difficulty, and guessing |
x |
Vector of theta points |
resp |
Response data matrix, subjects by items |
min , max
|
MLE is undefined for perfect scores. These parameters define the range in which to search for the MLE, if the score is perfect, the min or max will be returned. |
n |
Number of quadrature points wanted |
lower , upper
|
Range of points wanted |
mu , sigma
|
The normal distribution from which points and weights are taken |
D |
The scaling constant for the IRT parameters, defaults to 1.7, alternatively often set to 1. |
iif
gives item information, irf
gives item response function, MLE
returns maximum likelihood estimates of theta (perfect scores get +-4), normal.qu
returns a list length 2 of normal quadrature points and weights, SEM
gives the standard error of measurement at the given ability points, sim
returns simulated response matrix, tif
gives the test information function.
Quinn N. Lathrop
Partchev, I. (2014) irtoys: Simple interface to the estimation and plotting of IRT models. R package version 0.1.7.
params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) rdm<-sim(params, rnorm(100)) theta.hat <- MLE(rdm, params) theta.se <- SEM(rdm, params) ## transform a cut score of theta = 0 to the expected true score scale t.cut <- 0 x.cut <- sum(irf(params, t.cut)$f)
params<-matrix(c(1,1,1,1,-2,1,0,1,0,0,0,0),4,3) rdm<-sim(params, rnorm(100)) theta.hat <- MLE(rdm, params) theta.se <- SEM(rdm, params) ## transform a cut score of theta = 0 to the expected true score scale t.cut <- 0 x.cut <- sum(irf(params, t.cut)$f)