Here’s a familiar scenario to many SLPs. You’ve administered several standardized language tests to your student (e.g., CELF-5 & TILLS). You expected to see roughly similar scores across tests. Much to your surprise, you find that while your student attained somewhat average scores on one assessment, s/he had completely bombed the second assessment, and you have no idea why that happened.
So you go on social media and start crowdsourcing for information from a variety of SLPs located in a variety of states and countries in order to figure out what has happened and what you should do about this. Of course, the problem in such situations is that while some responses will be spot on, many will be utterly inappropriate. Luckily, the answer lies much closer than you think, in the actual technical manual of the administered tests.
So what is responsible for such as drastic discrepancy? A few things actually. For starters, unless both tests were co-normed (used the same sample of test takers) be prepared to see disparate scores due to the ability levels of children in the normative groups of each test. Another important factor involved in the score discrepancy is how accurately does the test differentiate disordered children from typical functioning ones.
Let’s compare two actual language tests to learn more. For the purpose of this exercise let us select The Clinical Evaluation of Language Fundamentals-5 (CELF-5) and the Test of Integrated Language and Literacy (TILLS). The former is a very familiar entity to numerous SLPs, while the latter is just coming into its own, having been released in the market only several years ago.
Both tests share a number of similarities. Both were created to assess the language abilities of children and adolescents with suspected language disorders. Both assess aspects of language and literacy (albeit not to the same degree nor with the same level of thoroughness). Both can be used for language disorder classification purposes, or can they?
Actually, my last statement is rather debatable. A careful perusal of the CELF – 5 reveals that its normative sample of 3000 children included a whopping 23% of children with language-related disabilities. In fact, the folks from the Leaders Project did such an excellent and thorough job reviewing its psychometric properties rather than repeating that information, the readers can simply click here to review the limitations of the CELF – 5 straight on the Leaders Project website. Furthermore, even the CELF – 5 developers themselves have stated that: “Based on CELF-5 sensitivity and specificity values, the optimal cut score to achieve the best balance is -1.33 (standard score of 80). Using a standard score of 80 as a cut score yields sensitivity and specificity values of .97. “
In other words, obtaining a standard score of 80 on the CELF – 5 indicates that a child presents with a language disorder. Of course, as many SLPs already know, the eligibility criteria in the schools requires language scores far below that in order for the student to qualify to receive language therapy services.
In fact, the test’s authors are fully aware of that and acknowledge that in the same document. “Keep in mind that students who have language deficits may not obtain scores that qualify him or her for placement based on the program’s criteria for eligibility. You’ll need to plan how to address the student’s needs within the framework established by your program.”
But here is another issue – the CELF-5 sensitivity group included only a very small number of: “67 children ranging from 5;0 to 15;11”, whose only requirement was to score 1.5SDs < mean “on any standardized language test”. As the Leaders Project reviewers point out: “This means that the 67 children in the sensitivity group could all have had severe disabilities. They might have multiple disabilities in addition to severe language disorders including severe intellectual disabilities or Autism Spectrum Disorder making it easy for a language disorder test to identify this group as having language disorders with extremely high accuracy. ” (pgs. 7-8)
Of course, this begs the question, why would anyone continue to administer any test to students, if its administration A. Does not guarantee disorder identification B. Will not make the student eligible for language therapy despite demonstrated need?
The problem is that even though SLPs are mandated to use a variety of quantitative clinical observations and procedures in order to reliably qualify students for services, standardized tests still carry more value then they should. Consequently, it is important for SLPs to select the right test to make their job easier.
The TILLS is a far less known assessment than the CELF-5 yet in the few years it has been out on the market it really made its presence felt by being a solid assessment tool due to its valid and reliable psychometric properties. Again, the venerable Dr. Carol Westby had already done such an excellent job reviewing its psychometric properties that I will refer the readers to her review here, rather than repeating this information as it will not add anything new on this topic. The upshot of her review as follows: “The TILLS does not include children and adolescents with language/literacy impairments (LLIs) in the norming sample. Since the 1990s, nearly all language assessments have included children with LLIs in the norming sample. Doing so lowers overall scores, making it more difficult to use the assessment to identify students with LLIs. (pg. 11)”
Now, here many proponents of inclusion of children with language disorders in the normative sample will make a variation of the following claim: “You CANNOT diagnose a language impairment if children with language impairment were not included in the normative sample of that assessment!” Here’s a major problem with such assertion. When a child is referred for a language assessment, we really have no way of knowing if this child has a language impairment until we actually finish testing them. We are in fact attempting to confirm or refute this fact, hopefully via the use of reliable and valid testing. However, if the normative sample includes many children with language and learning difficulties, this significantly affects the accuracy of our identification, since we are interested in comparing this child’s results to typically developing children and not the disordered ones, in order to learn if the child has a disorder in the first place. As per Peña, Spaulding and Plante (2006), “the inclusion of children with disabilities may be at odds with the goal of classification, typically the primary function of the speech pathologist’s assessment. In fact, by including such children in the normative sample, we may be “shooting ourselves in the foot” in terms of testing for the purpose of identifying disorders.”(p. 248)
Then there’s a variation of this assertion, which I have seen in several Facebook groups: “Children with language disorders score at the low end of normal distribution“. Once again such assertion is incorrect since Spaulding, Plante & Farinella (2006) have actually shown that on average, these kids will score at least 1.28 SDs below the mean, which is not the low average range of normal distribution by any means. As per authors: “Specific data supporting the application of “low score” criteria for the identification of language impairment is not supported by the majority of current commercially available tests. However, alternate sources of data (sensitivity and specificity rates) that support accurate identification are available for a subset of the available tests.” (p. 61)
Now, let us get back to your child in question, who performed so differently on both of the administered tests. Given his clinically observed difficulties, you fully expected your testing to confirm it. But you are now more confused than before. Don’t be! Search the technical manual for information on the particular test’s sensitivity and specificity to look up the numbers. Vance and Plante (1994) put forth the following criteria for accurate identification of a disorder (discriminant accuracy): “90% should be considered good discriminant accuracy; 80% to 89% should be considered fair. Below 80%, misidentifications occur at unacceptably high rates” and leading to “serious social consequences” of misidentified children. (p. 21)
Review the sensitivity and specificity of your test/s, take a look at the normative samples, see if anything unusual jumps out at you, which leads you to believe that the administered test may have some issues with assessing what it purports to assess. Then, after supplementing your standardized testing results with good quality clinical data (e.g., narrative samples, dynamic assessment tasks, etc.), consider creating a solidly referenced purchasing pitch to your administration to invest in more valid and reliable standardized tests.
Hope you find this information helpful in your quest to better serve the clients on your caseload. If you are interested in learning more regarding evidence-based assessment practices as well as psychometric properties of various standardized speech-language tests visit the SLPs for Evidence-Based Practice group on Facebook learn more.
- Peña ED, Spaulding TJ, and Plante E. ( 2006) The composition of normative groups and diagnostic decision-making: Shooting ourselves in the foot. American Journal of Speech-Language Pathology 15: 247–54.
- Spaulding, T. J., Plante, E., & Farinella, K. A. (2006). Eligibility criteria for language impairment: Is the low end of normal always appropriate? Language, Speech, and Hearing Services in Schools, 37, 61-72.
- Vance, R., & Plante, E. (1994). Selection of preschool language tests: A data-based approach. Language, Speech, and Hearing Services in Schools, 25, 15-24.
7 thoughts on “Help, My Student has a Huge Score Discrepancy Between Tests and I Don’t Know Why?”
[…] across the board on various tests including the Woodcock-Johnson Fourth Edition (WJ-IV) and the Clinical Evaluation of Language Fundamentals-5 (CELF-5). Stranger still is the fact that he aced Comprehensive Test of Phonological Processing, Second […]
[…] Help, My Student has a Huge Score Discrepancy Between Tests and I Don’t Know Why? […]
[…] of their deficits, coupled with the use of general (vs. targeted), often psychometrically weak tests, a lack of or under-identification of their deficit areas often […]
[…] be picked up by standardized tests. Furthermore, not all standardized tests by far are alike in psychometric properties, and all static tests possess various limitations. Consequently, all need to be supplemented by a […]
After learning that CELF-5 over-inflates scores so much I feel so uneasy about using it. I use the TILLS now and supplement it with other non-standardized assessments.
I have just started using the TILLS and find the results to be much more meaningful in a functional way. I never knew about cut scores and it makes so much more sense to include subtests that show the relationship between spoken and written language! I found the “Social Communication” subtest useful and was able to do more social/pragmatic assessment which correlated well with that short subtest. I do hope that the TILLS stimulus book goes online soon!
[…] TILLS is the only comprehensive assessment with strong psychometric properties. However, it is also not without numerous limitations! While it frequently appropriately identifies […]