These days, mixed-format(MF) tests which are composed of multiple-choice (MC) items and free-response(FR) items are widely used. Unlike to MC items, FR items are affected by raters' subjectivity and errors are introduced by raters(Schoonen, 2005). It ...
These days, mixed-format(MF) tests which are composed of multiple-choice (MC) items and free-response(FR) items are widely used. Unlike to MC items, FR items are affected by raters' subjectivity and errors are introduced by raters(Schoonen, 2005). It is called as 'rater effect'.
The most popular rater effects are rater severity/leniency and consistency/variability of rater severity. When a rater rates higher/lower score than the other raters, the rater is lenient/severe. When a rater gives scores consistently/inconsistently compared to the reasonable scores, the rater has consistency/variability of rater severity. Because these rater effects decrease the accuracy and reliability of measurement, it is recommended to use proper model considering rater effects.
Several studies examined reliability of MF tests using multivariate generalizability theory. First, Powers and Brennan(2009) estimated reliability of MF tests. However, the paper did not take a consideration of rater. Moses and Kim(2015) added rates facet in design, not rater. Recently, Kim et al.(2016) developed advanced research, and introduced new design for MF tests. Although the study is meaningful for using new indices to express MF test, limitation of using real data exists.
In this paper, simulation study is used to investigate the effect of rater effects on reliability of MF tests. The research question of this paper is to check the effects of various rater effects on reliability of MF tests by changing the design and the ratio of MC items and FR items.
With 12 rater conditions and 3 designs, MF tests data are generated, and the reliabilities are calculated. The result of this study was that the reliabilities of designs without rater effects were higher than those considering rater effects. The differences of severity had more effects on the reliability than consistency of severity. When the rater effects were low, by adding FR items, the reliability increased. And the reliabilities of crossed designs with raters were computed lower than nested designs.
To sum up, the conclusions of the study are like below. The effects of differences of rater severities on reliabilities are larger than another effect, especially to dependence index. When the difference of severity is high and the consistency of severity is low, the reliabilities are overestimated. And the reliabilities of rater nested models are higher than crossed model. The ratios of items function differently according to the rater effects.