Measures the ability to communicate by combining all four language skills – Reading, Listening, Speaking, and Writing Is 100% academically-focused, measuring the kind of English used in academic settings Provides fair and objective scoring Provides valid and reliable information to support score users to make effective decisions ETS知道考试成绩的可靠性是衡量一个考试质量的重要指标。可靠性之所以重要是因为它 能够代表一个考试究竟能够多么一致地衡量考生的能力。考试,与其他任何衡量活动一 样,很容易受到与所衡量之能力无关之因素的影响;此类无关因素会导致最终会反过来 决定考试成绩之可靠性的所谓“衡量错误”。考试成绩越可靠,考试成绩使用者(往往 指那些大学录取委员会的工作人员——他们要使用考试成绩来衡量最终录取哪些申请 者。)才越有信心去用考试成绩做出与考生有关决定。在教育衡量中,考分的可靠性被 认为是一个量化及评价考试成绩究竟有多么一致的统计指标。在ETS官方的一份调查报告 中,ETS声称TOEFL考试成绩的 “Reliability Estimate”约为0.95。[2]
An important measure of the quality of a test is how reliable the test scores are. Reliability is important because it indicates how consistently a test measures test takers’ ability. Testing, like other measurement events, is subject to the influence of many factors that are not relevant to the ability being measured. Such irrelevant factors contribute to what is called “measurement error,” which in turn determines how reliable test scores are. The more reliable the scores are, the more confidence score users have in using the scores for making important decisions about test takers. In educational measurement, score reliability is a statistical index to quantify and evaluate how consistent test scores are.
The three sections of the TOEFL test (Listening, Structure/Writing, and Reading) are designed to measure different skills within the domain of English language proficiency. It is commonly recognized that these skills are interrelated; persons who are highly proficient in one area tend to be proficient in the other areas as well. If this relationship were perfect, there would be no need to report scores for each section. The scores would represent the same information repeated several times.
In this special study, the test performance of repeaters who took a second test within 30 days of having taken a first test in the period from January to August 2007 was examined and evaluated. Small changes were observed in the test scores between the repeaters' first tests and their second tests. In addition, the effect sizes of the mean score changes of the four sections and the total score were found to be small, reinforcing the fact that the mean score changes are negligible. High to moderate correlations between the two test scores indicated a high degree of consistency in repeaters’ rank orders of their scores. In the context of the data used in the study, the correlations are reflective of the test-retest reliability of alternate forms except that the data were not collected from a controlled design.