In recent years, ranging from classroom assessment to large-scale standardized test, mixed-format tests which are composed of multiple-choice items and free-response items have been frequently used in many criterion-referenced test. This type of test ...
In recent years, ranging from classroom assessment to large-scale standardized test, mixed-format tests which are composed of multiple-choice items and free-response items have been frequently used in many criterion-referenced test. This type of test can utilized both the merits of guaranteed objectivity and efficiency of scoring via the multiple-choice items, in addition to measuring the subjects` more comprehensive understanding via the free-response items. Furthermore, they are being developed based on item response theory which is useful for solving many problems in the field of educational measurement and enables the test results to have more practical implications using the models. Likewise, it is increasing that the standard setting method applying the test theory is used to the process of setting the cut score in the criterion-referenced evaluation.
Given this situation, it is necessary to use the same test theory and apply to a classification indices estimation method that reflects psychometric problems of mixed-format tests to determine the classification accuracy and consistency, which is the validity and reliability of the criterion-referenced evaluation. There have been proposed several methods to determine the classification indices of mixed-format tests in a single test. However, few studies have been conducted to evaluate the performance of Rudner method and Guo method, which set the cut score on the ability scale of the item response theory.
Thus, in this study, a simulation study was conducted to examine whether the classification indices estimated by the Rudner method and Guo method differed according to the length, item composition rate and the cut score location of the mixed-format tests and to investigate the interaction between the three study conditions and then, to find out which of the two methods produces more accurate estimates.
For this purpose, the two-parameter logistic model for the multiple-choice items and the generalized partial credit model for the free-response items were used to generate the mixed-format test for each study conditions. The test was consisted of 20 items, 60 items, and 10%, 30% and 50% of the free-response items and the subjects` ability was extracted from the standard normal distribution. Next, the maximum likelihood estimation method was applied to estimate the subjects' ability parameters and then, the classification accuracy, classification consistency, and kappa coefficient were estimated by Ruder method and Guo method respectively when the cut score is -1.0, -0.5, 0, 0.5, or 1.0. In addition, the “true” classification indices, which is the criterion for evaluating the accuracy of the two methods, was calculated and compared with the classification indices estimates, the standard error of estimates, bias, and root mean square error were calculated for each method.
The results of this study are summarized as follows. First, the longer the length of mixed-format test, the greater the classification indices regardless of the method. Second, the classification indices tended to grow as the proportion of free-response items in the mixed-format tests increased. This aspect was more prominent when the test length was short. Third, as the cut score was closer to zero, the classification accuracy and consistency indices became smaller, while the kappa coefficient became larger.
The conclusions based on the results of the study are as follows. First, it is reasonable to use two methods in estimating the classification accuracy and consistency of the mixed-format tests, but in the case of the kappa coefficient, it is necessary to pay attention to the fact that the two methods can calculate the inaccurate value depending on the test length and cut score location. Second, for accurate and consistent evaluation of the achievement level of the subjects, it is necessary to construct a test with sufficient number of items in the mixed-format tests. Third, when conducting a mixed-format tests consisting of a small number of items, it should be used with care according to the composition ratio of the multiple-choice items and free-response items. Fourth, when the cut score is located at the low or high level in the distribution of the subjects' abilities, the performance of the Guo method is relatively lower than that of the Rudner method.