
http://chineseinput.net/에서 pinyin(병음)방식으로 중국어를 변환할 수 있습니다.
변환된 중국어를 복사하여 사용하시면 됩니다.
하기욱 경희대학교 국제대학원 2025 국내박사
Achieving and sustaining the intended outcomes of development projects is the ultimate goal of international development cooperation. However, despite its significance, research on the sustainability of development projects and its evaluation has been insufficient. In particular, sustainability criteria have not been adequately addressed in the evaluation of Korean ODA projects, and studies aimed at improving this aspect remain limited. Given these concerns, this study aims to establish a more systematic approach to sustainability evaluation. To achieve this, a comprehensive review of previous studies was conducted to identify factors influencing the sustainability of development projects, and a framework for the influential factors of sustainability was established based on these findings. Subsequently, a Delphi study was conducted to derive 21 core sustainability evaluation items appropriate for end-of-project evaluations, and their relative importance was analyzed using the Analytic Hierarchy Process (AHP). Furthermore, a meta-evaluation using content analysis was performed on 134 KOICA end-of-project evaluation reports to examine the current state of sustainability evaluations. The analysis revealed that KOICA’s end-of-project evaluations primarily focus on recipient countries' financial, human, and policy resources, while other critical factors tend to be overlooked. Additionally, sustainability evaluation items were not comprehensively applied, and only a limited number of items were explicitly reflected in the evaluations. In-depth interviews further revealed that the low quality of sustainability evaluations is largely attributable to factors such as evaluators' lack of expertise and the absence of detailed evaluation guidelines. This study contributes to development evaluation research by quantitatively validating sustainability evaluation items and diagnosing the current state of sustainability evaluations. Methodologically, it enhances the rigor of sustainability evaluation research by employing expert consensus methods and systematic content analysis. Practically, the findings provide valuable implications for revising KOICA’s evaluation guidelines, strengthening evaluator capacity, and improving the overall sustainability evaluation framework. However, this study primarily focuses on prospective sustainability within KOICA’s bilateral project end-of-project evaluations and does not encompass the long-term continuation of project outcomes (actual sustainability). Future research should expand the scope of sustainability evaluation studies to include various project types and sectors while verifying the applicability of the proposed sustainability evaluation items in actual evaluations. Furthermore, this study offers insights applicable beyond the field of evaluation to the broader domain of development cooperation. Projects that do not thoroughly consider sustainability factors during planning are inherently less likely to be sustainable, and sustainability constraints may arise if recipient institutions are not adequately prepared to maintain project outcomes and key activities during project implementation process. The establishment of more comprehensive sustainability evaluation items, as proposed in this study, provides guidance for stakeholders involved in project planning and implementation, helping them recognize critical sustainability factors in their decision-making processes.
거주후평가의 평가 항목에 대한 연구자와 설계자의 인식 비교연구
거주후 평가연구는 설계의 결과물에 대한 평가를 내렸을 때가 아니라 그 결과를 다시 계획/설계의 과정으로 환원시키는 사이클이 형성되었을 때 완성된다. 평가와 그 결과의 제시는 물론 연구자의 몫이지만, 이를 다시 설계로 환원시키는 것은 설계를 하는 사람의 몫이다. 그러므로 거주후 평가연구의 설계 적용성 문제에 대한 접근은 설계자에 대한 이해, 특히 설계자가 설계 정보에 대해 가지고 있는 인식에 대한 이해로부터 시작하는 것이 바람직하다. 그러나 현재 연구자들이 여러 가지 설계 정보와 평가 항목에 대한 설계자들의 인식을 구체적으로 알고 있는지는 의문이라고 할 수 있다. 이에 본 연구는 거주후 평가의 설계 적용성 증진을 목적으로, 연구자와 설계자의 거주후 평가에 대한 인식을 서로 비교하는 방식을 통해 연구자들로 하여금 설계자들의 이해를 가능하게 하고자 한다. 이를 위해 공동주택의 거주후 평가연구에서 사용된 평가 항목을 설계자가 설계 과정에서 이용하는 설계 정보에 대응시키고, 문헌 등을 통한 고찰에 이어 이들에 대한 연구자와 설계자의 인식을 비교하는 설문 조사를 실시하였으며, 그 결과 다음과 같은 분석 결과를 얻을 수 있었다. 첫째, 일반적으로 알려져 있는 설계 정보와 평가 항목을 비교해본 결과 이들이 서로 상이한 내용과 공통된 내용을 함께 포함하고 있음을 알 수 있었다. 즉 거주후 평가에서 사용되는 평가 항목 가운데 일부는 설계자들이 직접적으로 도움이 된다고 생각하는 ‘설계 정보’이지만, 설계의 과정에서 이용될 수 없다고 여기는 항목들도 있는 것으로 분석되었다. 따라서 거주후 평가연구의 결과를 설계자들에게 제공하고자 할 때에는 설계 정보로서의 적절성을 가지고 있는 항목들의 평가 결과를 중점적으로 전달할 필요가 있다. 둘째, 연구자와 설계자 모두에서 과반수이상이 조사에서 제시된 평가 항목들이 설계에 도움이 될 것이라고, 즉 설계 정보로서의 유용성을 지니고 있다고 응답했다. 그러나 동시에 대부분의 응답자가 실제 설계 수행 시에 설계 정보를 입수하는 수단은 기존의 유사한 설계 사례라고 답하였다. 결국 설계자들이 거주후 평가연구의 결과에 대해 긍정적으로 인식하고는 있지만 실제로 설계 정보로서 사용하는 경우는 드물다고 할 수 있다. 즉, 연구자들이 설계자에게 평가 연구의 결과를 제시할 때, 좀더 활용이 용이한 형태의 자료를 제공할 필요가 있음을 알 수 있다. 셋째, 평가 항목들의 분류 가운데 주택의 평면 구조와 관련된 항목들이 두 집단 모두에서 가장 높은 중요도와 활용도를 동시에 나타냈으며, 가장 낮은 활용도를 보이는 항목들은 유지 관리, 거주자 행태 특성, 경제성 등 설계 과정에 도입되기 힘든 정보들로 구성되어 있었다. 그리고 중요도와 활용도 간에 현격한 차이를 보이는 항목들은 입지 특성과 관련한 평가 항목들과 건물의 시공 상태, 거주자 행태 특성, 유지 관리, 경제성 등에 관련된 평가 항목들로, 두 집단 모두에서 공통적으로 나타나고 있으며 이들 모두에 대해 대부분의 응답자들은 중요도는 높지만 설계 정보로서의 활용도는 낮다고 인식하고 있었다. 또한 중요도나 활용도에 대한 인식이 연구자와 설계자 간에 분명한 차이를 보이는 항목들도 찾을 수 있었다. 본 연구에서는 거주후 평가연구에 사용된 평가 항목들을 가지고 거주후 평가 연구자와 건축 설계자들 간의 인식 차이를 알아보았다. 이들은 어떤 측면에서는 서로 유사한 면을 보이기도 하고 어떤 측면에서는 분명히 구별되는 특성들을 보이기도 했다. 이를 통해 알 수 있는 사실은 거주후 평가연구를 수행하는 연구자와 그 결과를 수용할 설계자 사이에 서로 다른 가치 인식이 존재한다는 것이다. 향후에는 집단간의 인식의 차이에 대한 더욱 심도 깊은 분석과 이들을 거주후 평가 연구의 설계 적용성 증진과 직접적으로 연계하는 연구가 진행되어야 할 것이다. This Study is based on the criticism about the design applicability of Post Occupancy Evaluation (POE) research and aimed to improve that design applicability of POE. It is researchers who evaluate buildings and show the results, but it is designers who apply those results into the design. Therefore the design applicability problem of POE might more or less come from the perception difference between researchers and designers. it would be desirable that the approach to solving the problem begins with understanding differences between researchers and designers, especially their perception about various design informations. I made a survey of researchers' and designers' perception on evaluation items; the analysis of the findings obtained from the survey is as follows. First, in comparison between existing evaluation items and design informations, they have not only common things but different things. Designers think some of evaluation items used in POE are directly applicable 'design informations' but others are not. When we present the result of POE to the designers, researchers had better not to forget it must be 'relevant to design works'. Next, more than half of respondents (both of researchers and designers) answered that the evaluation items presented in survey would be useful to design. At the same time most of them answered that they got design informations from existing plan of similar cases. After all, designers take POE results positively but tend not to use as design informations actually. Also it means that POE results researchers present are not applicable enough. It is why researchers require to supply more easily applicable POE results for designers. Third, details of findings by items are as follows; the items on the plan system of a house show the highest degree at both importance and applicability in two groups, the items on the maintenance, occupants' behavioral characteristics and economic values show the lowest degree. It is the items on conditions of location, quality of construction, occupants' behavioral characteristics, maintenance and economic values that have sharp differences between their importance and applicability; most of respondents appeared that they considered those items' importance high but applicability low. Also there are some items that their importance and applicability have distinctive differences between researchers and designers. The above is the findings from comparative study on evaluation items in POE between researchers and designers. They have some similarities and some differences. We could see that there exist different perception and understanding each other between researchers who perform POE research and designers who accept those results from this study. Hereafter, more profound analysis of perceptual differences between groups and more directly connecting study to improvement of POE's design applicability are expected to be performed.
최홍남 Graduate School, Yonsei University 2022 국내박사
1960년 이후 급격한 산업화 및 도시화 과정에서 신속한 주택 공급 및 인프라 조성 등을 위하여 환경, 역사, 문화적 자산 등 지역의 개발 여건을 고려하지 않은 일률적 관계 법규 및 기준 적용이 이뤄졌다. 이는 전통적인 도시조직과 구조의 훼손과 함께 고층·고밀의 획일적인 도시경관을 양산하는 결과를 초래하였다. 또한 기술의 급속한 발전과 4차 산업혁명 시대가 도래함에 따라 첨단 정보통신기술의 적용과 건설 기술의 혁신, 새로운 건설소재 등의 개발이 활발히 이뤄지면서, 소방, 방재, 구조 등 건설 관련 기술들이 발달하였지만, 법·제도의 경직성으로 인하여 적용에 한계가 있다는 문제가 지속적으로 제기되어왔다. 이 같은 제도적 한계에 대응하여 2007년 건축법 개정을 통하여 특별건축제도를 도입하여, 기존의 건축법규에 구애받지 않고 각종 법령 및 기준을 적용 받지 않거나 완화해 적용하고 관련법령에 의하여 처리 절차도 통합할 수 있도록 하였다. 특별건축구역 제도의 조입은 기존의 경직된 법 체계에서 벗어나 새로운 기술의 적용이나 창의적인 건축 설계 등 새로운 시도를 가능하게 하고, 건축물의 디자인품질 및 경관적 가치의 향상과 함께 그 주변지역의 공공성을 증진 시킬 수 있는 새로운 대안으로 평가받고 있다. 하지만, 제도 도입 이후 구체적인 평가기준 및 운영지침 부족 등 평가체계의 부재로 인하여 관계 당사자들 간에 제도 도입 취지에 대한 공감대 형성이 부족하고, 법제도 운영과 관련된 해당 주체 간에도 법제도의 해석이 달라 다양한 의견을 조율하지 못함으로써 당초 도입취지와는 달리 활성화가 이뤄지지 못하고 있다. 이에 본 연구는 특별건축구역 제도의 도입 취지인 ‘공공성 확보’와 창의적 디자인을 통한 ‘디자인 품질 향상’에 대한 평가항목을 도출하고, 계획 평가에 있어서 다양한 평가의견을 조율하기 위한 평가항목의 중요도 및 우선순위를 도출하고자 한다. 또한, 특별건축구역의 운영절차별 평가항목의 차등화 적용 필요성 분석을 통하여 제도의 효율적 운영을 위한 평가체계를 제시하고자 하였다. 첫째, 특별건축구역 계획 평가시 공공성 측면에서의 개방성과 안전성, 접근성, 디자인 품질 측면에서의 다양성에 우선순위를 둔 평가 수 있는 평가 기준 마련이 필요하다. 세부적으로는 개방성 측면에서는 시각적 개방성, 안전성 측면에서는 보행자 안전, 경관성 측면에서는 형태 및 매스디자인, 접근성 측면에서는 보행접근성, 조화성 측면에서는 주변환경·건물과의 조화, 다양성 측면에서는 디자인 특화계획과 다양한 건물 형태를 중점적으로 평가하여야 한다. 둘째, 특별건축구역 지정시 평가항목을 간소화하여 부담은 경감하되, 건축심의시 평가항목별로 구체적 평가를 실시하여 공공성 확보과 창의적인 디자인 실현할 수 있도록 운영 절차 개선이 필요하다. 특별건축구역 지정시에는 도시 및 주변 지역과의 관계성 측면에서 조화성, 영향성, 개방성, 접근성, 다양성을 중점적으로 평가하고, 건축심의단계에서는 구역지정시 계획 내용에 기반한 디자인 품질 전반에 대한 평가를 실시하는 평가체계의 개선이 필요하다. 셋째, 계획 평가 결과 디자인품질 관련 평가항목에 비하여 공공성 관련 평가항목들의 점수가 낮게 분석되어, 공공성 확보를 위한 평가체계 및 시스템 개선이 필요하다. 세부적으로 무장애계획, 지역커뮤니티 향상, 기존대지 및 자연 순응, 시설 가변성 및 확장성, 신기술의 적용 항목의 계획 개선을 위한 평가기준이 필요하다. 다만, 시설 가변성 및 확장성, 신기술의 적용 등은 미래 수요 및 기술 발전을 모두 반영하기에는 한계가 있기 때문에 타 평가항목과 분리하여 별도의 평가 기준 및 유도 방안 마련이 필요하다. 특별건축구역은 창의적인 디자인을 지닌 건축물의 건축을 통하여 공공성을 증진, 도시 경관 개선, 건설기술 수준 향상 및 건축 관련 제도 개선을 도모하기 위하여 2008년 처음 도입되었다. 그 동안 구체적인 평가기준 및 운영지침 부족 등 평가체계의 부재로 인하여 제도의 활성화가 이뤄지지 못했다. 이 같은 이유로 2021년 11월 3일 국토교통부는 특별건축구역 활성화를 위한 운영 가이드라인을 재정하여 시행하였다. 제도의 활성화 및 확산을 위해서는 운영 가이드라인과 함께 평가항목 및 평가기준의 구체화 등 평가체계에 대한 지속적인 논의와 개선 노력이 필요하다. 넷째, 특별건축구역의 건축법 특례 적용을 통하여, 현행법보다 공공성과 디자인 품질이 우수한 건축계획이 실현되는데 기여하고 있다. 특례 사항에 대한 지속적인 검토와 모니터링을 통하여 제도의 유연성 확보를 위한 확대와 적용 기준을 체계화 한다면 공공성 확보와 디자인 품질 향상을 위한 특별건축구역제도의 취지 실현에 기여할 수 있을 것이다. 본 연구는 특별건축구역 제도의 평가 체계 및 운영 개선의 측면에서 공공성 확보와 디자인품질 향상 측면의 평가 항목과 운영절차 별 평가항목의 평가 필요성을 분석하고자 하였다. 평가항목의 중요도 제시를 통하여 정성적 평가들 간의 우선적 고려사항을 제시하고, 운영 절차별 평가항목 차등 적용의 필요성을 검증한데 의의가 있으며, 향후 관련 제도 및 운영 방향 개선의 기초자료 활용 될 수 있을 것이다. Since 1960, uniform related laws and standards have been applied without considering regional development conditions such as environmental, historical and cultural assets for rapid housing supply and infrastructure creation in the process of rapid industrialization and urbanization. This resulted in the mass production of a uniform high-rise and high-density urban landscape along with damage to traditional urban organizations and structures. In addition, the application of advanced IT technology, innovation of architecture technology, and development of advanced new materials are actively taking place due to the recent rapid technological development and the advent of the 4th industrial revolution era. Even though fire fighting, disaster prevention and structure technologies have been developed through such technological development, limitations in the application of new technologies have been continuously raised due to rigid laws and systems. In response, through the revision of the Building Act in 2007, the ‘Special Architectural District’ system was introduced, which allows various laws and standards to be excluded or relaxed regardless of existing building laws and regulations and integrates processing procedures in accordance with relevant laws and regulations. The designation of a special architectural district is evaluated as a new alternative to enable the realization of new construction technologies or creative architectural design that require new attempts outside existing regulations, improve the design quality of buildings, and improve the publicity and landscape value of the surrounding area. Since the introduction of the system, there has been a lack of consensus among related parties on the purpose of introducing the system due to the lack of specific evaluation standards and operation guidelines, and the interpretation of the legal system is different between the subjects related to the operation of the legal system so that the system is unable to coordinate various opinions. Therefore, the system has not been activated, contrary to the original purpose of introduction. Therefore, this study aims to derive evaluation items for ‘securing publicity’ and ‘improvement of design quality’ through creative design, which are the purposes of introduction of special architectural district system, and to derive the importance and priority of evaluation items to coordinate various evaluation opinions in planning evaluation. In addition, it was intended to present an evaluation system for the efficient operation of the system through the analysis of the necessity of differential application of evaluation items for each operation procedure of special architectural districts. First, when evaluating the special architectural district plan, it is necessary to prepare evaluation criteria that prioritize openness, safety, accessibility in terms of publicity and diversity in terms of design quality. In detail, visual openness in terms of openness and pedestrian safety in terms of safety and form and mass design in terms of landscape, pedestrian accessibility in terms of accessibility, harmony with surrounding environments and buildings, and various building types in terms of harmony should be evaluated. Second, it is necessary to reduce the burden by simplifying the evaluation items when designating a special architectural district, but improving the operation procedure to secure publicity and realizing creative design by conducting specific evaluations for each evaluation item during construction review is also needed. When designating a special architectural district, it is necessary to improve the evaluation system to focus on harmony, influence, openness, accessibility, and diversity in terms of relationship with cities and surrounding areas, and to evaluate overall design quality based content of the plan in the review stage. Third, as the result of the plan evaluation, the score of the evaluation items related to publicity was analyzed lower than that of the evaluation items related to design quality, so it is necessary to improve the evaluation system to secure publicity. In detail, evaluation criteria are needed for barrier-free planning, regional community improvement, existing land and natural adaptation, facility variability and scalability, and application of new technologies items. However, since facility variability, scalability, and application of new technologies are limited to be able to reflect both future demand and technological development, it is necessary to prepare separate evaluation criteria and induction measures separately from other evaluation items. Special architectural district was first introduced in 2008 to promote publicity, improve urban landscape, improve construction technology level, and improve construction-related policies through the construction of buildings with creative designs. Until now, the policy has not been activated due to the absence of an evaluation system, such as a lack of specific evaluation standards and operation guidelines. To solve this problem, on November 3, 2021, the Ministry of Land, Infrastructure and Transport reorganized and implemented the operation guidelines for revitalizing special architectural districts. In order to revitalize and spread the system, continuous discussion and improvement efforts on the evaluation system, such as specifying evaluation items and evaluation criteria, are required along with operation guidelines. Fourth, through the special provision application of the Building Act of Special Architectural District, it is contributing to the realization of a building plan with superior publicity and design quality than the current law. If the expansion and application standards for securing flexibility of the policy are systematized through continuous review and monitoring of special provisions, it will contribute to the realization of the purpose of the special architectural district policy to secure publicity and improve design quality. This study attempted to analyze the necessity of evaluation items of securing publicity and improving design quality and evaluation items of each operational procedure in terms of improving the evaluation system and operation of the special architectural district system. It is meaningful to present priority considerations among qualitative evaluations by presenting the importance of evaluation items and to verify the necessity of differential application of evaluation items by operation procedure. In addition, it can be used as basic data for improving related systems and operational directions in the future.
국제스포츠이벤트 유치를 위한 평가항목 개발 : 이벤트 개최의 기대효과를 중심으로
김원풍 국민대학교 일반대학원 2024 국내박사
This study was conducted to develop evaluation items for hosting international sports events, with the aim of contributing to the advancement of host regions and society by focusing on the diverse anticipated outcomes of such events. To this end, domestic cases of evaluation systems for international sports event bidding were examined and analyzed, while overseas cases were referenced for the conceptual exploration of evaluation guidelines and the ideal objectives pursued through event legacy. In addition, a Delphi survey was conducted with experts in the field of sports events. As a result, a total of 31 evaluation items across four domains were identified. Specifically, this research systematically established and presented evaluation items that comprehensively assess the anticipated outcomes of hosting international sports events in the domains of sports, society, economy, and environment: In the sports domain, the evaluation items address the potential for qualitative and quantitative growth in sports, as well as the capacity building of professionals and related organizations. In the social domain, the items include the assessment of positive effects on local communities, such as resident satisfaction, community spirit, quality of life, social equity, and regional development. The economic domain encompasses items for evaluating both direct economic effects, such as local consumption, income, and employment, and indirect effects, including tourism and business opportunities. In the environmental domain, the evaluation focuses on minimizing negative environmental impacts, preserving natural resources, enhancing environmental awareness, and promoting civic consciousness. The significance of this study lies in its systematic development of evaluation items that enable comprehensive assessment of the anticipated outcomes of international sports events, based on case analysis and expert consultation. Furthermore, the findings provide a new direction for improving future evaluation systems for international sports events and are expected to serve as a foundation for related academic discourse. Additionally, the evaluation items derived from this study can serve as empirical standards and guidelines for stakeholders seeking to host international sports events, enabling them to objectively assess their own hosting capabilities and to establish clear purposes and goals. Institutionalizing the submission of assessment results and supporting documentation based on these criteria would provide a basis for effectively preventing indiscriminate event bidding, which has been identified as a persistent issue. Moreover, by consistently linking these evaluation items to post-event performance assessments, this research is expected to contribute to the establishment of a knowledge management system that enhances the capacity for hosting and managing international sports events in Korea. 본 연구는 국제스포츠이벤트의 국내 유치가 개최지역과 사회의 발전에 기여할 수 있도록, 이벤트의 다양한 기대효과를 중심으로 하는 유치평가항목의 개발을 위해 수행되었다. 이를 위해 국내 국제스포츠이벤트 유치평가사례와 국외 사례에 대한 조사·분석 및 스포츠이벤트 분야 전문가를 대상으로 한 델파이 조사를 수행하였으며, 그 결과 4개 영역에서 총 31개의 유치평가항목이 도출되었다. 구체적으로, 본 연구에서는 국제스포츠이벤트 개최로 인해 예상되는 스포츠, 사회, 경제, 환경 영역에서의 다양한 기대효과를 종합적으로 평가할 수 있는 유치평가항목들을 다음과 같이 체계화하여 제시하였다. 첫째, 스포츠 영역에서의 유치평가항목은 스포츠의 질적·양적 성장과 전문인력 및 관련 조직의 역량 강화 가능성을 평가할 수 있는 항목들로 구성되었다. 둘째, 사회 영역에는 지역주민의 만족도, 공동체 의식, 삶의 질, 사회적 평등, 지역 사회 발전 등 이벤트가 지역사회에 미치는 긍정적인 효과를 평가할 수 있는 항목들 이 포함되었다. 셋째, 경제 영역에서는 지역 내 소비, 소득, 고용 등 직접 경제효과와 관광 및 비즈니스 기회 창출 등 간접 경제효과를 평가할 수 있는 항목들이 포함되었다. 넷째, 환경 영역에서는 이벤트 개최로 인한 부정적인 환경 영향의 최소화, 자연환경 보존, 환경 인식 제고 및 시민 의식 향상을 위한 노력 등을 평가하기 위한 항목들이 포함되었다. 본 연구의 결과는 국내외 사례 조사와 전문가 패널의 의견 수렴을 바탕으로, 영역 별 이벤트 기대효과를 종합적으로 평가할 수 있는 유치평가항목을 체계화하여 제시 하였다는 점에서 중요한 의의를 지닌다. 또한, 본 연구의 결과는 향후 국제스포츠이벤트 평가체계의 개선을 위한 새로운 방향을 제시함으로써, 이와 관련한 학술적 논의의 기반을 마련하는 데 중요한 역할을 할 것으로 기대된다. 한편, 본 연구를 통해 도출한 유치평가항목들은 국제스포츠이벤트의 유치를 모색 하는 주체가 자신들의 유치역량을 객관적으로 진단하고, 명확한 유치 목적과 목표를 설정할 수 있도록 하는 실증적인 기준이자 지침으로도 활용될 수 있다. 따라서 이러한 기준과 지침에 근거한 진단 결과 및 관련 자료의 제출을 제도화할 경우, 그 동안 문제점으로 지적되어왔던 무분별한 이벤트 유치 시도를 사전에 효과적으로 차단할 수 있는 제도적 근거를 마련할 수 있을 것이다. 나아가, 본 연구에서 제시한 유치평가항목들을 이벤트 사후의 성과평가에도 일관성 있게 연계하여 적용함으로써, 국제스포츠이벤트 유치와 운영 역량 강화를 위한 국내의 지식관리 체계 구축에 도 기여할 것으로 기대된다.
Studies of Rater and Item Effects in Rater Models
Zhao, Yihan Columbia University ProQuest Dissertations & These 2020 해외박사(DDOD)
The goal underlying educational testing is to measure psychological constructs in a particular domain and to produce valid inferences about examinees’ ability. To achieve this goal of getting a precise ability evaluation, test developers construct questions with different formats, such as multiple-choice (MC) items, and open-ended questions or constructed response (CR) test items, for example, essay items. In recent years, large-scale assessments have implemented CR items in addition to MC items as an essential component of the educational assessment landscape. However, utilizing CR items in testing involves two main challenges, including rater effects and rater correlations. One challenge is the error added by human raters’ subjective judgments, such as rater severity and rater central tendency. Rater severity effect refers to the effect that raters may tend to give consistently low or high ratings that cause biased ability evaluation (Leckie & Baird, 2011). Central tendency describes when raters tend to use middle categories in the scoring rubric and avoid using extreme criteria (Saal et al., 1980). The second challenge is that multiple raters usually grade an examinee’s essay for quality control purposes; however, ratings based on the same item are correlated and need to be handled carefully by appropriate statistical procedures (Eckes, 2011; Kim, 2009). To solve these problems, DeCarlo (2010) proposed an HRM-SDT model that extended the traditional signal detection theory (SDT) model used in the first level of HRM. The HRM-SDT model not only considers the hierarchical structure of rating data but also deals with various rater effects beyond rater severity. This research examined to what extent the HRM-SDT separates rater effects (i.e., rater severity and rater central tendency) from item effects (i.e., item difficulty). Accordingly, one goal of this study was to simulate various rater effects and item effects to investigate the performance of the HRM-SDT model with respect to separating these effects. The other goal was to compare the fit of the HRM-SDT model with one commonly used model in language assessments, the Rasch model, in different simulation conditions and to examine the difference between these two models in terms of segregating rater and item effects.To answer these questions, Simulation A and Simulation B were conducted. In Simulation A, seven sets of parameters were varied in the first set of simulations. Simulation B addressed some questions of particular interest using another four sets of parameters, where both the rater and item parameters were simultaneously varied. This study found the HRM-SDT accurately recovered parameters, and clearly detected and separated changes in rater severity, rater central tendency, and item difficulty in most conditions.
혼합형 검사의 길이, 문항 구성 비율, 분할점수 위치가 문항반응이론을 적용한 분류정확도 및 분류일치도 추정에 미치는 영향
In recent years, ranging from classroom assessment to large-scale standardized test, mixed-format tests which are composed of multiple-choice items and free-response items have been frequently used in many criterion-referenced test. This type of test can utilized both the merits of guaranteed objectivity and efficiency of scoring via the multiple-choice items, in addition to measuring the subjects` more comprehensive understanding via the free-response items. Furthermore, they are being developed based on item response theory which is useful for solving many problems in the field of educational measurement and enables the test results to have more practical implications using the models. Likewise, it is increasing that the standard setting method applying the test theory is used to the process of setting the cut score in the criterion-referenced evaluation. Given this situation, it is necessary to use the same test theory and apply to a classification indices estimation method that reflects psychometric problems of mixed-format tests to determine the classification accuracy and consistency, which is the validity and reliability of the criterion-referenced evaluation. There have been proposed several methods to determine the classification indices of mixed-format tests in a single test. However, few studies have been conducted to evaluate the performance of Rudner method and Guo method, which set the cut score on the ability scale of the item response theory. Thus, in this study, a simulation study was conducted to examine whether the classification indices estimated by the Rudner method and Guo method differed according to the length, item composition rate and the cut score location of the mixed-format tests and to investigate the interaction between the three study conditions and then, to find out which of the two methods produces more accurate estimates. For this purpose, the two-parameter logistic model for the multiple-choice items and the generalized partial credit model for the free-response items were used to generate the mixed-format test for each study conditions. The test was consisted of 20 items, 60 items, and 10%, 30% and 50% of the free-response items and the subjects` ability was extracted from the standard normal distribution. Next, the maximum likelihood estimation method was applied to estimate the subjects' ability parameters and then, the classification accuracy, classification consistency, and kappa coefficient were estimated by Ruder method and Guo method respectively when the cut score is -1.0, -0.5, 0, 0.5, or 1.0. In addition, the “true” classification indices, which is the criterion for evaluating the accuracy of the two methods, was calculated and compared with the classification indices estimates, the standard error of estimates, bias, and root mean square error were calculated for each method. The results of this study are summarized as follows. First, the longer the length of mixed-format test, the greater the classification indices regardless of the method. Second, the classification indices tended to grow as the proportion of free-response items in the mixed-format tests increased. This aspect was more prominent when the test length was short. Third, as the cut score was closer to zero, the classification accuracy and consistency indices became smaller, while the kappa coefficient became larger. The conclusions based on the results of the study are as follows. First, it is reasonable to use two methods in estimating the classification accuracy and consistency of the mixed-format tests, but in the case of the kappa coefficient, it is necessary to pay attention to the fact that the two methods can calculate the inaccurate value depending on the test length and cut score location. Second, for accurate and consistent evaluation of the achievement level of the subjects, it is necessary to construct a test with sufficient number of items in the mixed-format tests. Third, when conducting a mixed-format tests consisting of a small number of items, it should be used with care according to the composition ratio of the multiple-choice items and free-response items. Fourth, when the cut score is located at the low or high level in the distribution of the subjects' abilities, the performance of the Guo method is relatively lower than that of the Rudner method. 최근 들어 교실 평가부터 대규모 표준화 검사까지 준거참조평가를 시행하는 여러 검사에서는 채점의 객관성과 효율성을 보장하는 선택형 문항과 피험자의 고등정신능력을 보다 종합적으로 측정할 수 있는 서답형 문항으로 구성된 혼합형 검사를 많이 활용하고 있다. 이와 같은 검사는 교육 측정 분야의 많은 문제를 해결하는데 유용하고 모형을 사용하여 검사 결과가 보다 실질적인 함의를 가질 수 있도록 해주는 문항반응이론을 기반으로 개발되고 있는 추세이다. 마찬가지로 준거참조평가에서 분할점수를 설정하는 과정에 문항반응이론을 기반으로 하는 준거설정방법을 적용하는 사례도 점차 증가하고 있다. 교육 현장에서 나타나는 이러한 현상을 고려할 때, 준거참조평가에서 이루어진 피험자 성취 수준에 대한 분류 결정의 타당도, 신뢰도라 할 수 있는 분류정확도 및 분류일치도를 파악하기 위해서는 같은 검사 이론을 사용하고 혼합형 검사의 측정학적 문제를 고려하는 분류 지수 추정방법을 사용할 필요가 있다. 이에 따라 한 번의 검사 시행으로 혼합형 검사의 분류 지수를 파악할 수 있는 여러 방법들이 제안되었으나 문항반응이론의 능력 척도 상에 분할점수를 설정하고 분류 지수를 추정하는 Rudner방법과 Guo방법에 대한 수행 능력을 평가한 연구는 거의 진행되지 않았다. 따라서 본 연구에서는 모의실험 연구를 통해 Rudner방법, Guo방법으로 추정한 분류 지수가 혼합형 검사의 길이, 문항 구성 비율, 분할점수 위치에 따라 차이를 보이는지 그리고 세 연구 조건 간의 상호작용 효과가 있는지 살펴보고 두 방법 중 어떤 방법이 더 정확한 추정치를 산출하는지 알아보고자 하였다. 이를 위해 혼합형 검사를 구성하는 선택형 문항에는 2모수 로지스틱 모형, 서답형 문항에는 일반화부분점수모형을 사용하여 검사 길이가 20문항, 60문항 그리고 검사 길이별로 서답형 문항의 비율을 10%, 30%, 50%로 구성하여 연구 조건에 따라 모의 자료를 반복 생성하였고 피험자의 능력은 표준 정규 분포로부터 추출하였다. 다음으로 최대우도추정법을 적용하여 피험자 능력 모수를 추정한 뒤 두 방법으로 분할점수가 –1.0, -0.5, 0, 0.5, 1.0일 때의 분류정확도, 분류일치도, 카파계수 추정치를 각각 산출하였다. 또한 두 방법의 수행 능력과 정확성을 평가하기 위한 준거인 진 분류 지수를 연구 조건마다 계산하였고 이를 분류 지수 추정치와 비교하여 방법별로 추정의 표준오차, 편의, 평균 제곱근 오차를 계산하였다. 이를 바탕으로 도출한 본 연구의 결과는 다음과 같다. 첫째, 혼합형 검사의 길이가 길어지면 분류 지수는 높아졌다. 둘째, 혼합형 검사를 구성하는 선택형 문항과 서답형 문항 중 서답형 문항의 비율이 증가할수록 분류 지수는 높아졌으며, 이러한 양상은 검사 길이가 짧을 때 더욱 두드러졌다. 셋째, 분할점수가 0에 가깝게 위치할수록 분류정확도와 분류일치도 지수는 작아졌으며, 카파계수는 이와 반대의 결과 양상을 보였다. 연구 결과를 바탕으로 한 결론은 다음과 같다. 첫째, 혼합형 검사의 분류정확도와 분류일치도 지수 추정 시 두 방법을 활용하는 것은 합리적이나 카파계수의 경우 검사 길이, 분할점수 위치에 따라 두 방법이 정확하지 않은 값을 산출할 수 있으므로 유의하여야 한다. 둘째, 정확하고 일관된 피험자의 성취 수준 평가를 위해서는 혼합형 검사의 문항 수를 충분히 확보하여 검사를 구성할 필요가 있다. 셋째, 적은 수의 문항으로 구성되는 혼합형 검사를 시행할 때는 선택형 문항과 서답형 문항의 구성 비율에 대하여 유의할 필요가 있다. 넷째, 분할점수가 피험자 능력 분포에서 능력이 낮거나 높은 수준에 위치하면 Guo방법의 수행 능력은 Rudner방법에 비해 상대적으로 낮아진다.
I only Care about the Americano: Basic Item Effect
Eo, Mi Yeon 고려대학교 대학원 2017 국내석사
The Republic of Korea is referred to as the Coffee Republic because Koreans consume much coffee every day. Accordingly, coffee shops abound in Korea, and competition in this industry is severe. Price varies even though similar items are sold. For example, one coffee shop near a university sells a cup of Americano for 1,000 won, whereas another coffee shop like Starbucks sells the same type of coffee for 4,100 won. Assume that café W offers an inexpensive cup of Americano, but the other items it sells are more expensive than those at other stores; conversely, café Y sells all items at a slightly cheaper price than other cafés. In this situation, is the satisfaction of consumers the same between the two cafés? Which café do consumers prefer? This study is founded on these questions. Some industries or stores offer basic items. For example, Americano is basic item in the coffee shop industry. A basic item has characteristics that influence consumers to judge a store that is not familiar to consumers. If a basic item such as Americano is sold at an inexpensive price, consumers perceive this coffee as “cheap”. This perception can be applied to other items, and thus consumers conclude that the overall price of the store is inexpensive. This judgment leads to brand preference. This series of processes is called the basic item effect. The basic item effect was confirmed through Studies 1 and 2. Participants in the one very condition, i.e., café W, evaluated the overall price perception and brand preference more positively than those in the all little condition, i.e., café Y. The basic item effect was observed in Study 1. However, the basic item effect was not affected by the number of items. In Study 2, the basic item effect also emerged. The strength of this effect was different depending on price sensitivity. This study not only presents a new topic for future research on consumers and price but also provides evidence to verify other interesting phenomena. The results of this study have great implication for marketers who want to start a new business.
The Use of Large Language Models to Predict Item Properties
Smart, Francis Michigan State University ProQuest Dissertations & 2024 해외박사(DDOD)
Calibrating items is a crucial yet costly requirement for both new tests and existing ones as items become outdated due to changing relevance or overexposure. Traditionally, this calibration involves giving items to a large number of participants, a process that requires substantial time and resources. To reduce these costs, researchers have sought alternative calibration methods. Before the emergence of Large Language Models (LLMs), these methods mainly relied on expert opinions or computational analysis of item features. Yet, the accuracy of experts in predicting item performance has varied, and computational approaches often struggle to capture the intricate semantic details of test items.The emergence of LLMs might offer a new avenue of addressing the need for item calibration. These models, popularized by OpenAI (like the GPT series), have shown remarkable abilities in mimicking complex human thought processes, and performing advanced reasoning tasks. Their achievements in passing sophisticated exams and executing cross-language translations underline their potential. However, their capacity for predicting item properties in test calibration has not been thoroughly investigated. Traditional calibration relies heavily on direct human interaction, such as pretesting and expert assessment, or on statistical modeling of item features through resource intensive machine learning algorithms. This dissertation explores the potential of LLMs to predict item characteristics, tasks that have traditionally required human insight or complex statistical models. With the increasing accessibility of high-performance LLMs from organizations like OpenAI, Meta, and Google, and through open-source platforms such as HuggingFace.com, there is promising ground for investigation. This study examines whether LLMs could replace human efforts in item calibration tasks.To evaluate the effectiveness of LLMs in predicting item properties, this dissertation implements a training and testing framework, focusing on assessing both the relative and absolute difficulties of items. It undertakes three theoretical investigations: firstly, examining the ability of LLMs to predict the relative difficulty of items; secondly, assessing the feasibility of using multiple LLMs as substitutes for test-takers and attempts to use their responses predictors of item difficulty; and thirdly, applying a search algorithm, guided by LLM predictions of relative difficulty, to ascertain absolute difficulties.The findings indicate that the models have statistical significance in predicting relative item difficulty, limited by modest explanatory power - with adjusted R-squared values around 5-10%. However, the application of LLMs in predicting relative item difficulties through pairwise comparisons proves to be more promising, achieving a pairwise accuracy of about 62% and demonstrating predicted correlations with item difficulty ranging between 0.36 and 0.42.This suggests that whereas LLMs show potential in certain aspects of item calibration, their effectiveness varies depending on the specific task. This demonstrates a potential promising result that warrants further exploration into the capabilities of LLMs for item calibration, potentially leading to more efficient and cost-effective methods in the field of test development and maintenance.
As recently, information technology has been developed rapidly in various areas of the world, the role and the usability of the information technology has been increased. Every country makes investment in training personnel and R&D to raise information ability for the future. And Every country has built IT adoption of a military information system first by developing IT infra and information communication service. But cyber breaches have been increased on a reverse side in proportion to be increased investment and usability to IT adoption. Even though defense information security management system is not the same as ISMS which is ISO/IEC 27001, there is a military information affairs instructions which describes organization security evaluation and a military security affairs instructions which describes procedures and criteria for activities to audit and measure security in order to protect military information. As threats have been increased in cyber space, military has analyzed and evaluated the vulnerabilities of each information system and organization since 2003. But that is not enough to check and evaluate a large and complexity system as military information system comprehensively and systematically. Therefore, this paper proposes the methodology to evaluate security information level for military information system and control items(13 control areas, 41 control items) which is specified to military information. Here, we developed the control items which is based on ISO/IEC 27001(BS7799) and is applied to the concept of SSE-CMM about production, development, and operation of security technology, CC about security level of information security products and systems, KCMVP about security level of cryptographic modules, nation security accreditation. As information system has a special character that if one item among control items has at least low security level, the information system is weak. This paper proposed evaluation level which be reflected by this special character of the information system. The method to evaluate the information security level of military information system is as following. First, we check 4 management courses and 15 requirements according to PDCA(Plan-Do-Check-Act) which are covered by ISO/IEC 27001 or G-ISMS. Second, we check 3 requirements and 12 check items according to documentation which are covered by ISO/IEC 27001 or G-ISMS. Third, We choice control items which is proper to a special military information system. Here, the control items is proposed by this paper. Forth, we perform to evaluate the military information system according to the control items chosen above. Here, the configuration of the military information system consist of management field, operation field, technology field and the main factors are assets, policy, organization, manager and user, physical and environmental facilities, security system, etc. Fifth, we can decide the information security level which is applied to management system control items and evaluation level by the methodology to evaluate the information security level proposed by this paper. Specially, the methodology which we proposed can be used as a base criteria to apply the real military information system. Considering special environment character of the military information system and current technology level, we optimized and developed the security control items of military information system. The security control items and security evaluation level which is proposed by this paper in order to evaluate information security level for military information system can support to manage the operation and assets of military information system systematically and efficiently.
첨단 기술 분야 전문인력양성사업 평가체계 개발에 관한 연구 : : 디지털 제조장비를 중심으로
남기솔 한밭대학교 산업대학원 2023 국내석사
첨단 산업의 빠른 태동과 진화에 따라 전문 기술과 인재 확보는 국가경쟁력 강화에 핵심적인 전략으로 자리했다. 우리나라의 국가인적자원개발정책은 경제발전을 위한 수단으로 수행되어 왔으며, 현재도 과학기술분야 인력양성정책이 추진되고 있다. 과학기술인재정책 관련 기본계획은 5년을 주기로 수립되고 있으며, 현재 추진되고 있는 4차 기본계획은 인력의 변화대응역량에 초점을 두고 있다. 정책은 수립뿐만 아니라, 목적 달성과 과정 또한 중요하며, 정책평가는 정책의 성공을 유도한다. 정책평가는 관련 활동의 영향을 검증하는 작업으로 결과와 행위의 인과관계를 체계적인 방법으로 밝히는 일이다. 더욱이 국가연구개발사업 및 인력양성사업의 투자가 증가하고 있어 해당 정책 및 사업의 효율성과 당위성 확보가 한층 강조되고 있다. 그러나 과학기술분야 인력양성사업의 평가는 그 중요성과 필요성에 비해 연구가 부족한 실정이다. 우선 교육 사업의 특성에 따라 성과관리가 어렵다는 것이 주요 원인으로 꼽힌다. 교육은 성과가 발현되는 시점의 차이가 크고 추적과 관리가 어려운 특징이 있다. 또한 분야별 교육내용에 따른 특성이 매우 강하여 일률적인 평가를 어렵게 한다. 따라서 인력양성사업, 특히 과학기술분야 인력양성사업에 대한 독자적인 평가체계 개발과 관련 연구의 필요성이 여러 차례 제기된 바 있다. 하지만 최근 연구에서도 지적된 바와 같이 이에 대한 연구가 소극적이거나 뚜렷한 한계를 가지고 있다. 따라서 본 연구에서는 첨단 기술 분야 전문인력양성사업을 대상으로 중간 및 자체 점검을 위한 평가체계를 구축하고자 한다. 교육과 연구의 연계 성과를 고려하여 대상 사업의 특성을 반영한 평가체계를 구축하는 것을 목적으로 하였다. 대상 사업은 산업통상자원부 전문인력양성사업으로 과학기술인재정책과 관련되어 있으며, 산업정책과도 관련이 깊다. 해당 사업은 디지털제조장비 전문인력을 양성하는 것을 목표로 한다. 현재 1개의 주관기관과 5개 대학이 참여하고 있으며, 4년째로 접어들었다. 연구의 수행은 크게 세 단계로 이루어졌다. 첫째, 논리 모형에 기반한 대상 사업의 분석 및 구조화를 수행하였다. 둘째, 문헌 연구를 기반으로 사업 특성을 반영한 평가체계를 개발하였다. 마지막으로, 평가체계를 적용한 실증 사례연구를 통해 실무적 유용성을 확인하였다. 분석 결과, 전체 사업 수행 과정을 평가하기 위한 4단계, 13개 항목, 33개 지표를 개발하였다. AHP 분석을 통해 항목 및 단계에 대한 가중치를 도출하였다. 가중치는 모든 단계에서 고르게 나타났다. 유사 프로그램 수행 경험을 보유한 4인의 전문가가 제안된 체계를 기반으로 대상 사업을 평가하였다. 평가 결과, 사업 수행 현황이 활동 영역에서는 상대적으로 우수하였으나 취업률과 같은 결과단계에서 약점을 보였다. 본 연구는 다음과 같은 점에서 차별성과 의의를 갖는다. 첫째, 이공계열 교육과 대상 사업의 특성을 모두 고려한 평가체계 구축이다. 둘째, 논리모형의 단계에 따른 평가항목 및 지표 선정이다. 셋째, 실제 수행된 사업을 대상으로 한 실무적 유용성이다. 본 연구의 결과를 통해 대상 사업의 성과를 개선할 수 있을 것이며, 유사한 성격의 첨단 기술 분야 전문인력양성사업의 자체 점검을 위해 활용될 수 있을 것을 기대한다. 또한 다른 분야의 경우 평가체계 구축 및 지표 개발을 위한 선행연구가 될 수 있다. 다만, 본 연구는 단일 사업을 대상으로 하였다는 점에서 한계를 가진다. 따라서 여러 사업을 대상으로 평가체계를 구축하기 위한 후속 방법론 연구가 필요하다. 또한 본 연구에서 대상 사업을 통해 실증한 바 있지만 이는 사례연구에 가깝다. 따라서 다른 평가체계나 기존 평가체계의 적용 결과와 비교분석을 통해 제안된 방법론의 가치를 확인하는 작업이 필요하다. With the rapid emergence and evolution of high-tech industries, securing professional skills and manpower is a key strategy to strengthen national competitiveness. Korea's national human resource development policy has been implemented as a means of economic development, and the policy of fostering human resources in the field of science and technology is still being promoted. The plan for science and technology manpower policy is established every five years. In addition, the fourth plan currently being promoted focuses on manpower's ability to respond to changes. In policy, achieving goals and performance management activities are important, and policy evaluation induces success. Policy evaluation is the work of verifying related activities and effects, and systematically revealing the causal relationship between results and actions. Even as investment in national R&D programs and manpower training projects increases, securing the efficiency and justification of the program is more emphasized. However, the evaluation of human resource training programs in the field of science and technology lacks research compared to their importance and necessity. First of all, the main reason is that it is difficult to manage performance depending on the characteristics of the educational program. In education, there is a difference in the time when performance is expressed and it is difficult to track and manage. In addition, the characteristics of each field of education are very strong, making it difficult to evaluate uniformly. Therefore, the need for developing an independent system and related research for human resource training programs evaluation, in the field of science and technology, has been raised several times. However, it is pointed out that related research is passive or has clear limitations until recently. Therefore, this study aims to establish an evaluation system for internal and intermediate checking for high-tech professional manpower training programs. Specifically, I intend to develop an evaluation framework that is responsive to the unique features of the target program and that takes into account the interplay between educational and research outcomes. Target program is a professional manpower training program by Ministry of Trade, Industry and Energy and is related to HRST(Human Resources in Science and Technology) policy and is also deeply related to industrial policy. This program aims to nurture experts in digital manufacturing equipment. Currently, one leading organization and five universities are participating, and it has entered its fourth year. This study's process is largely three steps. First, the analysis and structuring of the target program based on the Logic model were performed. Second, an evaluation system that reflects program characteristics was developed based on literature research. Finally, practical usefulness was confirmed through case applying the evaluation system. As a result of the analysis, we developed 4 stages, 13 items, and 33 indicators to evaluate the overall program’s process. AHP(Analytical Hierarchy Process) analysis was used to derive weights for the items and stages. The weights were evenly distributed across all stages. Four experts with experience in implementing similar programs evaluated the target program based on the proposed framework. The evaluation showed that the program performance was relatively good in the activity areas, but weak in the outcome areas such as employment rate. This study has the following differentiation. First, it establishes an evaluation system that considers both the characteristics of science and technology education and the target program. Second, it is the selection of evaluation items and indicators according to the stage of the Logic model. Third, it has practical usefulness for actual program. The results of this study are expected to improve the performance of the target program and can be used for self-checking of similar high-tech professional training programs. It can also serve as a preliminary study for building evaluation systems and developing indicators in other fields. However, this study is limited by the fact that it was conducted on a single program. Therefore, a follow-up methodological study is needed to establish an evaluation system for multiple programs. In addition, although this study has been demonstrated through the target program, it is more of a case study. Therefore, it is necessary to confirm the value of the proposed methodology through comparative analysis with the results of other evaluation systems or existing evaluation systems.