If you’ve done some or any research at all into pre-employment assessments, you’ve probably stumbled upon the words “validity” and “validation” more than once. But what does it exactly mean to say that a pre-employment assessment has validity?
In this blog, you’ll find out:
- What does validity mean
- The 5 aspects of validity
- What is validation
- Other things that need to be considered when validating a pre-employment assessment
What does Validity Mean?
You’ve probably heard the word “validity” multiple times when you’ve heard others talk about the quality of an assessment. And yeah, indeed, validity is often used as a means of describing a test that is truthful, accurate and authentic (Hubley & Zumbo, 1996), but do you know what it actually entails?
“Validity is an integrated evaluative judgement of the degree to which empirical evidence and theoretical rationales support the accuracy and appropriateness of inferences and actions based on test score (Messick, 1987)”.
To put it more simply (because I can imagine that the definition above is perhaps not the easiest to understand), validity is the extent to which a test measures what it claims to measure. Validity helps us to know how well an assessment can accurately explain the underlying outcome of interest or answer the intended study question.
Some might think that validity is a property of the test or assessment itself, however, it is rather a property of the interpretation or specific meaning of the test’s scores (Messick, 1995). Long story short, it is the interpretations of the scores and the implications that follow those that need to be validated.
This provides perspective on whether test score interpretations are consistent over time or explain more complicated systems of testing (e.g., empirical data or social consequences) (Messick, 1989; Borsboom et al., 2004). Although the concept of validity is always multi-faceted and intertwined with other concepts, it is possible to break it down into several distinct aspects to gain a deeper understanding.
In this article, I’ll use terms like “assessment”, “instrument” and “measure” to describe any form of measurement, including scales, tests, tools, questionnaires and surveys. While, a “construct” or “attribute” refers to an unobservable theoretical variable that is used to describe a phenomenon (Brahma, 2009).
The 4 Different Aspects of Validity
Does the concept of validity still sound a bit complex? No need to worry. We are going to have a closer look at validity from multiple aspects (Taherdoost, 2016).
Source: (Taherdoost. H, 2016)
After deciding on particular behaviors or constructs that we want to measure, how would we make sure there is a causal relationship between those constructs and the test result?
This is the point where we need to ensure we have a sufficient level of construct validity. Simply put, it reflects how well we translate or transcribe a concept, idea or behavior into a practical, functioning assessment.
Construct validity consists of two primary elements, which are discriminant validity and convergent validity.
- Discriminant validity shows whether the measures of two constructs that are supposed to be unrelated, are actually unrelated to each other.
- In contrast, convergent validity describes how well two measures of constructs that are theoretically related to each other, are in fact related.
For example, if we intend to measure self-motivation, we need to make sure that the items in our instrument are actually related to theoretically defined self-motivation (e.g., amotivation, self-efficacy, intrinsic/extrinsic motivation etc.; convergent validity), but not related (too much) to creativity (discriminant validity).
Content validity represents the extent to which items in an instrument are relevant and reflect the definition of the construct for a particular assessment (i.e., Item quality; Haynes et al. 1995; Almanasreh et al., 2019).
To establish content validity, it is important to go through several processes: intensive literature review, creating assessment items based on existing literature, conducting pilot studies with a small target group, preliminarily analyses, deleting undesirable items, and evaluation by expert judges.
After finishing those processes, we can have more confidence that the tool we developed only includes necessary items and is validated in the sense that the score from our assessment genuinely represents what we want it to measure. Makes sense, right?
If a measure is meant to distinguish between people or to make predictions about future performance, then we have to make sure our instrument has adequate criterion validity.
It captures the extent to which a measure is associated with the predicted outcome. This helps to understand how well an instrument can predict the score of a criterion variable (e.g., job performance). For example, whenever employers need to use a test to predict employees’ future performance, the test should have sufficient criterion validity as if the test is truly measuring the future job-related performance, not some other random performance (e.g., how well is this person in keeping work-desk tidy).
It can also be further divided into predictive (whether it predicts what it is supposed to predict), concurrent (whether it differentiates between groups that it should be able to differentiate), incremental (whether it improves/adds to predictions based on already available information) and postdictive validity (whether the score from current tests relates to the scores of tests done in the past).
Face validity refers to the personal judgment on whether the tests and assessments appear to be associated with what they are supposed to measure. It is usually based on the beliefs of the test-takers.
In other words, it is made up of stakeholder perceptions regarding the assessment’s feasibility, relevance, and appropriateness. These are judged from the appearance and the items within the assessments.
It is possibly the weakest form of validity, but it could ultimately influence the impression of fairness and favourability towards the tests.
Next to all those types of validity mentioned before, we also need to consider the reliability of our instruments. Reliability refers to the degree to which a measure can provide stable and consistent results.
If it is taken multiple times, will the results be similar? Think of a scale for example. If you would stand on a scale two times within a short time frame and the displayed weight is very different the second time, the scale would not be a very reliable measure of weight.
In practice, we cannot claim that our instrument is valid solely based on any one of those types of validity in isolation from the others. Instead, we need to consider them as a whole to ensure the overall credibility of our instrument. For instance, if a test has high reliability but lacks satisfactory construct validity, we can get a similar pattern for the test scores but we cannot confirm the causal relationship between the attribute and the test result. The general rule is that the more types of validity we can achieve, the stronger our evidence for proving our assessment to be credible.
Additionally, there are also some validities that are more crucial than others. For example, reliability is a prerequisite to establishing validity and almost every test needs a sufficient level of construct validity. Without construct validity, assessment results will be misleading since we cannot justify that the result is based on a causal relationship rather than some random measurement errors (Brahma, 2009).
What exactly is validation?
Broadly speaking, the process of validation is providing evidence and a convincing argument to back up our inferences from the assessment, as well as proving that there are no other applicable inferences (Hubley & Zumbo, 2011). Several statistical methods can be used to assess the various forms of validity, such as item response theory (IRT) modeling, exploratory- and confirmatory factor analysis, and network analysis. The details of these methods are beyond the scope of this article.
P. S. Click here if you are interested in how exactly we validate our games!
The most straightforward way to validate a new tool would be to compare its outcomes to those of another pre-existing tool. The pre-existing tool should measure the same or a similar construct and of course, be validated already. If the new assessment is strongly associated with the pre-existing one and measures the same construct, we can see a similar pattern of responses from people using both.
From the review of 33 studies on gamification, Lumsden and colleagues (2016) concluded that although there are differences in the reported correlation complexity varying across study designs, most of the studies successfully validated gamified assessments by comparing the results to non-game assessments of the same construct.
Other things we need to consider
Within Messick’s (1989) validation framework, a concept can be seen from perspectives of evidential basis and consequential basis.
Evidential basis is about achieving all types of validity (except for face validity) we previously mentioned and also providing evidence of the relevance and utility (whether the tests are useful and relevant to the target group).
On the other hand, a consequential basis is about unforeseen or unintended consequences of legitimate test interpretation and use.
The consequential basis of test interpretation reminds us to handle our assessment with caution and care. We need to consider the two aspects of the consequential basis of testing:
While designing our instrument, we need to consider:
- The personal or social values in the interested construct and the chosen name/label to represent the construct. E.g., consider the difference between “Early Development Instrument” and “Developmental Immaturity Instrument”.
- The personal or social values that are reflected by the implicit theory underneath the construct and its measurement. E.g., consider the impact of ageing in cognitive ability tests.
At the time that we develop our instrument, we are often more concerned about how to avoid unanticipated negative or adverse effects from the use of the test.
However, Hubley & Zumbo (2011) suggested that positive effects also need to be considered when assessing validity and score interpretation. Two examples of invalidity threats that yield positive effects are construct underrepresentation and construct-irrelevant variance (Downing & Haladyna, 2009).
Construct underrepresentation: This threat refers to undersampling or biased sampling in the first place (be it from the content domain, selection, or criterion of instrument items) causing a mismatch with the initial construct definition, ultimately leading to failure to sample the proper domain. The outcomes of the instrument will appear to be promising, but in fact, they can be misleading and create an improper interpretation of the test score.
Construct-irrelevant variance: This threat is a systematic error in the data, created by adding irrelevant variables to the construct. It generates “noise” to the assessment data, which results in the decrement of our ability to infer assessment outcomes in the proposed way.
Interpretation of test scores is determined by various forms of validity evidence, including but not necessarily limited to: criterion-related, convergent/ discriminant, sample groups, content, score structure, reliability, and generalizability/ invariance evidence as well as (un)intended social and personal consequences and side effects (Hubley & Zumbo, 2011).
A final remark: Validation is an ongoing process!
Just simply because everything is evolving over time. For example, the different words and phrases we use now are not something anyone would have used 10 years ago..
Cultural and social values would also change from time to time, ultimately affecting our recognition or perspective towards the same operationalized constructs. Therefore, we need to continuously validate the inferences we make from test results.
Happy validating! 😉
Almanasreh, E., Moles, R., & Chen, T. F. (2019). Evaluation of methods used for estimating content validity. Research in social and administrative pharmacy, 15, 214-221. https://doi.org/10.1016/j.sapharm.2018.03.066
Borsboom, D., Mellenbergh, G. J., & Van Heerden, J. (2004). The concept of validity. Psychological review, 111, 1061. https://doi.org/10.1037/0033-295X.111.4.1061
Brahma, S. S. (2009). Assessment of Construct Validity in Management Research. Journal of Management Research (09725814), 9(2).
Downing, S. M., & Haladyna, T. M. (2009). Validity and its threats. Assessment in health professions education, 1, 21-56.
Haynes, S. N., Richard, D., & Kubany, E. S. (1995). Content validity in psychological assessment: A functional approach to concepts and methods. Psychological assessment, 7, 238. https://doi.org/10.1037/1040-35220.127.116.11
Hubley, A. M., & Zumbo, B. D. (2011). Validity and the consequences of test interpretation and use. Social Indicators Research, 103(2), 219-230. https://doi.org/10.1007/s11205-011-9843-4
Hubley, A. M., & Zumbo, B. D. (1996). A dialectic on validity: Where we have been and where we are going. The Journal of General Psychology, 123, 207-215. https://doi.org/10.1080/00221309.1996.9921273
Lumsden, J., Edwards, E. A., Lawrence, N. S., Coyle, D., & Munafò, M. R. (2016). Gamification of cognitive assessment and cognitive training: a systematic review of applications and efficacy. JMIR serious games, 4(2), e5888. https://doi.org/10.2196/games.5888
Messick, S. (1995). Standards of validity and the validity of standards in performance asessment. Educational measurement: Issues and practice, 14, 5-8. https://doi.org/10.1111/j.1745-3992.1995.tb00881.x
Messick, S. (1987). Validity. ETS Research Report Series, 1987, i-208. https://doi.org/10.1002/j.2330-8516.1987.tb00244.x
Messick, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational researcher, 18, 5-11. https://doi.org/10.3102/0013189X018002005
Peter, J. P. (1981). Construct validity: A review of basic issues and marketing practices. Journal of marketing research, 18(2), 133-145. https://doi.org/10.1177/002224378101800201
Taherdoost, H. (2016). Validity and reliability of the research instrument; how to test the validation of a questionnaire/survey in a research. How to test the validation of a questionnaire/survey in a research (August 10, 2016). http://dx.doi.org/10.2139/ssrn.3205040