Resources / 5: Test Construction / Test Validity – Content and Construct Validity

Test Validity – Content and Construct Validity

5: Test Construction

Why Content and Construct Validity Matter More Than You Think

Let's say you've been dating someone for a few months, and they consistently show up on time for every date. That's reliability – they're consistent and predictable. But does showing up on time mean they're actually in love with you? Not necessarily. Consistency doesn't tell you what's really going on beneath the surface.

This is exactly the challenge we face with psychological tests. A test can give us consistent scores every single time (that's reliability), but if it's not actually measuring what we think it's measuring, those consistent scores are useless. This is where validity comes in – the concept that tells us whether our test is actually doing its job.

For the EPPP, you need to understand that modern test theory views validity as one unified concept rather than separate types. According to the Standards for Educational and Psychological Testing, validity is "the degree to which evidence and theory support the interpretation of test scores for proposed uses of tests." But you'll still need to know the traditional three types: content validity, construct validity, and criterion-related validity. Today, we're focusing on the first two.

The Modern View of Validity: One Concept, Multiple Evidence Sources

Think of validity like evaluating whether someone is qualified for a job. You wouldn't just look at their resume (one piece of evidence). You'd check their work samples, interview them, call references, maybe give them a test assignment, and see if they fit the company culture. Each piece of evidence tells you something different, but together they give you confidence in your hiring decision.

Modern validity works the same way. Instead of three separate "types" of validity, we now talk about five sources of evidence that all contribute to the overall validity of a test:

  • Evidence based on test content
  • Evidence based on the response process
  • Evidence based on internal structure
  • Evidence based on relationships with other variables
  • Evidence based on consequences of testing

The traditional three types of validity fit within these sources. For now, let's dive deep into content and construct validity.

Content Validity: Does Your Test Cover the Territory?

Content validity answers this question: Does your test include a good representative sample of the entire domain you're trying to measure?

Imagine you're a chef applying for a job at an Italian restaurant, but the hiring test only asks you to make desserts. You might ace the tiramisu and panna cotta, but does that test really capture your full ability as an Italian chef? What about pasta, risotto, sauces, and seafood dishes? The test has poor content validity because it doesn't represent the full domain of Italian cooking.

Content validity is especially important for achievement tests and work samples. If you're creating a test to see if psychology students understand research methods, your test better include questions about experimental design, statistics, ethics, and measurement – not just statistics alone.

How to Establish Content Validity

Building content validity happens during test development, not after. Here's the process:

  1. Define the domain clearly: Map out exactly what you're trying to measure. If it's a test of depression symptoms, you need to define what depression includes based on current research and diagnostic criteria.

  2. Create a representative sample of items: Your test items should cover all important aspects of the domain. If 40% of depression involves cognitive symptoms, roughly 40% of your test should address cognition.

  3. Get expert review: Subject matter experts systematically review every item to ensure comprehensive coverage. They're checking: "Does this test ask about everything important? Are there gaps? Are there items that don't belong?"

Face Validity: The Validity That Isn't Really Validity

Here's something that trips people up: face validity sounds like it should be important, but it's not actually a type of validity. Face validity just means the test looks valid to the person taking it.

Think about taking a personality test where all the questions are about your favorite colors and foods. Even if that test somehow predicted job performance, you'd probably think "this is ridiculous" and not take it seriously. That's low face validity, and it matters because it affects test-taker motivation.

But sometimes you don't want face validity. If you're creating a test to catch people lying about criminal behavior, you don't want test-takers to easily figure out what you're measuring. You want them to answer honestly, not game the system.

Remember: Face validity is about appearance and test-taker buy-in, not about whether the test actually works.

Construct Validity: Measuring the Invisible

Now we get to the trickier concept. Construct validity is essential when you're trying to measure something you can't directly observe – a hypothetical trait that exists in theory but can't be seen or touched.

Depression, intelligence, motivation, extroversion, anxiety – these are all constructs. You can't hold intelligence in your hand or see motivation under a microscope. You can only infer these traits exist by observing behavior and measuring their effects.

Here's a modern analogy: Your smartphone tells you the battery is at 47%. You can't actually see the electrical charge, but the phone measures various indicators and infers the battery level. When that number is accurate (when your phone actually dies around 0% and not suddenly at 20%), the measurement has good validity. When your phone says 30% but dies immediately, the measurement lacks validity.

Establishing construct validity is more complex than content validity. You need to gather evidence from multiple angles that your test truly measures the construct you claim it measures.

The Two Key Components: Convergent and Divergent Validity

Think about how you'd prove to a skeptical friend that someone is really your romantic partner, not just a roommate. You'd show evidence that you act like partners (you hold hands, you go on dates, you know each other's families) – that's convergent validity: your relationship shows the characteristics it should show. You'd also show that you don't act like just roommates or business partners (you're not splitting utilities 50/50 or having formal meetings about household tasks) – that's divergent validity (also called discriminant validity): your relationship doesn't show characteristics it shouldn't show.

For tests:

  • Convergent validity: Scores on your test should correlate highly with other measures of the same or similar constructs
  • Divergent validity: Scores on your test should correlate weakly with measures of unrelated constructs

If you've developed a new anxiety test, it should correlate strongly with existing anxiety measures (convergent validity) but weakly with measures of completely different things like creativity or shoe size (divergent validity).

The Multitrait-Multimethod Matrix: A Comprehensive Validity Check

This technique has a mouthful of a name, but the concept is straightforward. It's a systematic way to check both convergent and divergent validity simultaneously by looking at the same traits measured in different ways.

The logic: If your test is valid, it should correlate with other measures of the same trait even when those measures use completely different methods. And it shouldn't correlate with measures of different traits even when they use the same method.

Breaking Down the Matrix

Let's use a real-world example. You've created a self-report test to measure how sociable middle-school students are. To validate it, you gather four correlation coefficients by comparing your test to three others:

Your TestCompared WithType of CoefficientWhat You Want
Self-report sociabilitySame test, taken twiceMonotrait-MonomethodHigh correlation (reliability)
Self-report sociabilityTeacher-rated sociabilityMonotrait-HeteromethodHigh correlation (convergent validity)
Self-report sociabilitySelf-report impulsivityHeterotrait-MonomethodLow correlation (divergent validity)
Self-report sociabilityTeacher-rated impulsivityHeterotrait-HeteromethodLow correlation (divergent validity)

Let's decode those intimidating terms:

  • Mono = same
  • Hetero = different
  • Trait = the characteristic you're measuring
  • Method = how you're measuring it

Monotrait-Monomethod (same trait, same method): This is actually just reliability. You're correlating your test with itself, perhaps through internal consistency or test-retest correlation.

Monotrait-Heteromethod (same trait, different method): This is your strongest evidence of convergent validity. When your self-report sociability test correlates highly with a completely different measure of sociability (like teacher ratings), that's powerful evidence you're measuring what you think you're measuring.

Heterotrait-Monomethod (different trait, same method): This tells you about divergent validity, but it's tricky. If you find a correlation here, it might be because of the trait similarity, or it might be because of the shared method (method variance). For example, two self-report tests might correlate just because some people tend to answer "strongly agree" to everything regardless of content.

Heterotrait-Heteromethod (different trait, different method): When this correlation is low, it's the cleanest evidence of divergent validity. Your test of sociability doesn't relate to teacher ratings of impulsivity – exactly what you'd hope for unrelated constructs.

Factor Analysis: Finding Hidden Patterns in Test Data

If the multitrait-multimethod matrix is like getting references for a job candidate, factor analysis is like using AI to analyze thousands of job applications and discover which skills actually cluster together in successful employees.

Factor analysis is a statistical technique that looks at correlations among multiple tests and identifies underlying patterns (factors). It helps answer: "When people score high on Test A, what other tests do they tend to score high on? Do certain tests naturally group together?"

The Four Basic Steps

  1. Administer multiple tests: Give your test plus several others measuring related and unrelated constructs to a group of people.

  2. Create a correlation matrix: Calculate how every test correlates with every other test. This creates a big table of correlation coefficients.

  3. Derive the initial factor matrix: Use statistical software to identify underlying factors that explain the patterns of correlations. (You don't need to know the math for the EPPP.)

  4. Rotate and interpret the factors: Rotate the factors to make the results clearer, then figure out what each factor represents by looking at which tests load highly on it.

Reading a Rotated Factor Matrix: The Part You Need to Know

Here's where factor analysis gets practical for test validation. Let's say you've developed a new test of locus of control (Test A). You administer it along with two existing locus of control tests (Tests B and C) and three tests of self-esteem (Tests D, E, and F), which research shows is unrelated to locus of control.

The factor analysis produces this rotated factor matrix:

TestFactor I (Locus of Control)Factor II (Self-Esteem)Communality
Test A (new locus of control).80.10.65
Test B (locus of control).85.12.73
Test C (locus of control).76.15.60
Test D (self-esteem).13.85.74
Test E (self-esteem).14.76.60
Test F (self-esteem).25.70.55

Understanding Factor Loadings

The numbers under Factor I and Factor II are factor loadings – they're correlation coefficients showing how strongly each test relates to each factor. Think of them like compatibility percentages on a dating app: a high loading means the test and factor are a strong match.

To interpret a factor loading, square it. This tells you the percentage of variance in the test explained by that factor.

For Test A:

  • Loading on Factor I = .80, squared = .64 (64% of Test A is explained by Factor I)
  • Loading on Factor II = .10, squared = .01 (1% of Test A is explained by Factor II)

Test A loads heavily on Factor I and barely on Factor II. Looking at the pattern, all the locus of control tests load on Factor I, and all the self-esteem tests load on Factor II. This gives us clear evidence that:

  • Convergent validity: Test A correlates strongly with the factor it should correlate with (locus of control)
  • Divergent validity: Test A correlates weakly with the factor it shouldn't correlate with (self-esteem)

What's That Communality Column?

Communality tells you how much of each test's variance is explained by all the factors combined. For Test A, the communality is .65, meaning 65% of the variance in Test A scores is explained by the factor analysis.

You can calculate communality yourself when factors are orthogonal (uncorrelated with each other) by squaring all the factor loadings for a test and adding them up:

Test A communality = (.80)² + (.10)² = .64 + .01 = .65

But when factors are oblique (correlated with each other), this calculation doesn't work – the math gets more complicated. For the EPPP, just know that communality represents total explained variance.

The remaining variance (100% - 65% = 35% for Test A) is called uniqueness – it's the part of the test that's either measuring something unique to that test or just random error.

Common Misconceptions That Trip People Up

Misconception 1: "If a test is reliable, it must be valid."

Nope. A bathroom scale could consistently tell you that you weigh 50 pounds more than you actually do. It's perfectly reliable (consistent), but completely invalid (inaccurate). Reliability is necessary for validity, but not sufficient. You can have reliability without validity, but you can't have validity without reliability.

Misconception 2: "Face validity is a real type of validity that matters."

Face validity is about appearances and user acceptance, not actual test quality. A test can look valid but measure nothing useful, or it can look silly but actually work well. Don't elevate face validity to the same status as real validity evidence.

Misconception 3: "Content validity requires statistical proof."

Unlike construct validity, content validity doesn't rely on correlation coefficients or fancy statistics. It's established through careful test construction and expert review. You can't compute a "content validity coefficient."

Misconception 4: "High factor loadings on one factor mean the test is valid."

Not quite. For good construct validity, you want high loadings on the correct factor (convergent validity) and low loadings on irrelevant factors (divergent validity). It's the pattern across multiple factors that matters.

Misconception 5: "Convergent and divergent validity are separate from construct validity."

They're not separate – they're the two main ways we establish construct validity. Think of construct validity as the umbrella, with convergent and divergent validity as the supporting poles holding it up.

Practice Tips for Remembering

For the Multitrait-Multimethod Matrix:

Create a simple memory anchor using the phrase "Same Same is Boring" to remember that monotrait-monomethod is just reliability (not the interesting validity evidence).

Then remember: "Same trait, different method = GOOD convergent validity" (you want a high correlation). "Different trait = GOOD divergent validity" (you want a low correlation).

For Factor Analysis:

Think "Loading = Correlation" every time you see factor loadings. Then remember to square them to get the percentage of explained variance.

Make a mental note: Communality = TOTAL explained, Uniqueness = LEFTOVER

For Content vs. Construct Validity:

Content = Coverage (does the test cover the domain?) Construct = Converge and Diverge (does the test relate appropriately to other measures?)

Visual Memory Trick:

Picture content validity as a pizza. If you're claiming to sell a complete pizza, you better include all the slices. Missing toppings or sections means poor content validity.

Picture construct validity as your phone's GPS. It should cluster you with people going to the same destination (convergent) and separate you from people going elsewhere (divergent).

Real-World Applications in Clinical Practice

Understanding these validity concepts isn't just about passing the EPPP – it affects how you practice psychology.

Scenario 1: You're considering using a new brief depression screening tool in your practice. Before adopting it, check: Does it have content validity (does it measure all aspects of depression, not just mood)? Does it have construct validity (does it correlate with established measures like the Beck Depression Inventory but not with unrelated constructs)?

Scenario 2: A school district wants you to develop a test to identify students who need social skills intervention. You'll need strong content validity – expert teachers and child psychologists should review your items to ensure you're covering all relevant social skills. Face validity matters here too; if kids think the test is stupid, they won't try their best.

Scenario 3: You're reading research about a new personality measure. The authors report factor analysis results showing their "neuroticism" scale loads on the same factor as established anxiety and depression measures, but not on factors related to extroversion. That's good evidence of construct validity, which should increase your confidence in the measure.

Key Takeaways

  • Validity is now viewed as a unitary concept with five sources of evidence, but you need to know the traditional three types: content, construct, and criterion-related validity.

  • Content validity establishes whether a test adequately samples from the domain it claims to measure. It's built during test development through clear domain definition and expert review.

  • Face validity isn't real validity – it's just whether the test looks valid to test-takers. It affects motivation but not actual measurement quality.

  • Construct validity is essential for tests measuring hypothetical traits. It's established through convergent validity (high correlations with related measures) and divergent validity (low correlations with unrelated measures).

  • The multitrait-multimethod matrix systematically examines convergent and divergent validity by comparing tests measuring the same or different traits using the same or different methods.

  • Monotrait-heteromethod coefficients (same trait, different method) provide the strongest evidence of convergent validity.

  • Heterotrait coefficients (different traits) should be low to demonstrate divergent validity.

  • Factor analysis identifies underlying patterns in test data. High factor loadings on the correct factor show convergent validity; low loadings on unrelated factors show divergent validity.

  • Factor loadings are correlations between tests and factors. Square them to find the percentage of variance explained.

  • Communality is the total variance in a test explained by all factors combined. Calculate it by squaring and adding factor loadings when factors are orthogonal.

  • Reliability is necessary but not sufficient for validity. A test must be consistent before it can be accurate, but consistency doesn't guarantee accuracy.

Remember: validity isn't about whether a test is "good" or "bad" in some absolute sense. It's about whether the test scores can be meaningfully interpreted for their intended purpose. A test might have excellent validity for one use but poor validity for another. Always consider the specific claims being made about test scores and whether the evidence supports those particular interpretations.

Ready to practice? Get started in the app.