Resources / 7: Research Methods & Statistics / Research – Internal/External Validity

Research – Internal/External Validity

7: Research Methods & Statistics

Why These Concepts Will Save You From Making Wrong Conclusions

Picture this: You've just spent six months running a therapy group, carefully tracking everyone's progress. Your results look amazing – participants show real improvement! But when you try to replicate this success with new clients, nothing happens. Or worse, you write up your findings, and reviewers tear them apart because you can't prove your therapy actually caused the improvements.

This frustrating scenario happens because of validity issues. Understanding internal and external validity is like having a quality control system for research. Internal validity asks: "Did my treatment actually cause these results, or was something else going on?" External validity asks: "Will this work for other people, in other places, at other times?" Without both, research findings become expensive guesses rather than reliable knowledge.

For the EPPP, you need to recognize when validity is threatened and know how to protect it. More importantly, as a future psychologist, you'll need these concepts to evaluate research critically and design interventions that actually work beyond your office.

The Core Distinction: Cause vs. Application

Think of research validity like cooking a new recipe. Internal validity is about knowing whether your specific ingredients and techniques actually created that delicious meal – or whether you just happened to be really hungry that day. External validity is about knowing whether this recipe will work in someone else's kitchen, with their equipment, at different altitudes, for different numbers of people.

Internal validity measures whether you can confidently say "X caused Y" in your specific study. It's about the integrity of your cause-and-effect conclusion. Did the meditation app actually reduce anxiety, or did participants just get less anxious because summer started and their workload naturally decreased?

External validity measures whether your findings travel well beyond your original study. If meditation worked for college students in your lab, will it work for middle-aged professionals in their homes? Will it work next year? Will a different meditation app also work?

The Five Flavors of External Validity

External validity breaks down into five types, each asking a different "Will this work when..." question:

TypeThe Question It AsksExample
Population ValidityWill this work for other people beyond my participants?You studied therapy with college students – will it work for retirees?
Ecological ValidityWill this work in real-world settings?It worked in a controlled lab – will it work in a busy clinic?
Temporal ValidityWill this work at different times?It worked in 2024 – will it work in 2030?
Treatment Variation ValidityWill variations of this treatment also work?This specific CBT protocol worked – will other CBT approaches work too?
Outcome ValidityWill this affect related outcomes too?It reduced depression scores – will it also improve life satisfaction?

The Seven Threats to Internal Validity: What Messes Up Your Cause-and-Effect Conclusions

1. History: When the World Interferes

History happens when external events crash your research party. Imagine you're studying whether a stress-management workshop reduces employee anxiety. You measure anxiety before the workshop, run the program, then measure anxiety again. Great results! Except... during your study, the company announced everyone's getting raises and better benefits. Is it your workshop or the good news that reduced anxiety?

Real-world scenario: During the pandemic, countless therapy studies got contaminated by history. A researcher studying loneliness interventions in early 2020 would have had worldwide lockdowns affecting their results – a massive historical event unrelated to their actual intervention.

The fix: Include a control group and randomly assign participants. If both your treatment and control groups experience the same historical events, you can still detect whether your treatment had an additional effect. However, if something happens to just one group (like a fire alarm during one group's session), that's trickier to control and you need to acknowledge it in your interpretation.

2. Maturation: People Change on Their Own

Maturation is about natural change over time. People heal, grow, learn, get tired, or develop regardless of your intervention. You're testing a reading program for kids – of course they get better at reading over six months! They're growing and learning constantly.

This happens with adults too. If you're studying a career satisfaction program over two years, participants naturally gain experience, build relationships with coworkers, and get better at their jobs simply through time and practice. Your program might get credit for changes that would have happened anyway.

The fix: Same as history – use multiple groups with random assignment. If both groups mature similarly but one group shows additional improvement, you know your treatment contributed something beyond natural maturation.

3. Differential Selection: Starting from Different Places

Despite its name, differential selection is really about how you assign people to groups. Imagine comparing two therapy approaches, but you let people choose which one they want. The self-confident, motivated people all pick Therapy A, while uncertain, skeptical people pick Therapy B. Even if both therapies are equally effective, Therapy A will probably show better results because those participants started with advantages.

Think about fitness apps: If one app attracts already-active people while another attracts complete beginners, comparing their outcomes is unfair. The groups started from different places.

The fix: Random assignment. This is gold standard for a reason – it distributes all known and unknown differences across groups randomly, making them similar at the start.

4. Statistical Regression: Extreme Scores Move Toward the Middle

Statistical regression (regression to the mean) is beautifully sneaky. When you select people specifically because they scored extremely high or low on something, their scores will likely move toward average on the next measurement – even without any intervention.

Here's why: Extreme scores partly reflect true extremeness and partly reflect random variation (measurement error, bad day, lucky day). On remeasurement, the random part averages out, pulling scores toward the middle.

Career example: You identify the company's worst-performing salespeople for a special training program. Next quarter, they improve! Amazing training, right? Maybe... but some of them probably had unusually bad quarters due to temporary factors (sick kid, car troubles, tough client load). They would have bounced back somewhat without any intervention.

The fix: Don't study only extreme scorers, or ensure your control group also has extreme scorers who experience similar regression effects.

5. Testing: The Pretest Changes How People Respond

Testing effects occur when taking a test the first time changes how someone responds the second time. This happens through practice effects (getting familiar with the test), sensitization (the pretest makes you think about issues differently), or memory (remembering your previous answers).

Modern example: You're testing whether social media breaks improve mental health. You give a pretest about social media use that asks questions like "Do you scroll mindlessly?" and "Does Instagram make you feel inadequate?" Just answering these questions might make participants more aware of their habits and naturally change their behavior, independent of your actual intervention.

The fix: Either skip the pretest entirely or use the Solomon four-group design (more on this later) to measure the pretest's effects.

6. Instrumentation: When Your Measuring Tool Changes

Instrumentation problems happen when your measurement instrument changes during the study. Human raters often improve with practice, becoming more accurate or more lenient. Equipment can degrade or be recalibrated. Survey questions might be revised mid-study.

Psychology scenario: You're rating therapy session quality, watching videos and scoring them on various dimensions. As you watch 100+ sessions, you naturally get better at noticing subtle signs of good technique. Sessions you watch later might get unfairly higher scores than earlier sessions, even if they're actually the same quality.

The fix: Keep your measurement tools consistent. If using human raters, regularly check their reliability against standard examples. If change is unavoidable, acknowledge this limitation when interpreting results.

7. Differential Attrition: When Dropouts Change Your Groups

Differential attrition happens when people drop out of different groups for different reasons, changing what those groups represent. This is particularly nasty because you often don't know why people left or how they differed from those who stayed.

Real scenario: You're comparing intensive therapy (three sessions per week) to standard therapy (one session per week). People drop out of both groups, but they drop out for different reasons. Intensive therapy loses people who can't handle the time commitment or emotional intensity – often those who need help most. Standard therapy loses people who aren't seeing fast enough progress – the most impatient or severely affected. By the end, you're not comparing what you think you're comparing.

The fix: This one's tough. Try to minimize attrition through engagement strategies, track who drops out and why, and acknowledge attrition as a limitation. Statistical techniques can sometimes help, but prevention is better.

The Four Threats to External Validity: When Results Don't Travel Well

1. Reactivity: The "Being Studied" Effect

Reactivity happens when people act differently because they know they're in a study. Two specific problems fuel this:

Demand characteristics are cues that tell participants what's expected. If you're testing a "revolutionary new relaxation technique" with obvious pre/post stress measurements, participants figure out they're supposed to feel more relaxed. They might report feeling calmer partly because that's what good participants do.

Dating app analogy: People behave differently when they know someone's watching. Your profile browsing behavior during a study about "finding meaningful connections" will be different from your normal 2 AM swiping habits.

Experimenter expectancy occurs when researchers unconsciously (or consciously) influence results. An experimenter who believes a treatment works might smile more with treatment group participants, spend more time with them, or subtly encourage certain responses. They might even record ambiguous responses differently based on which group they came from.

The fix: Use unobtrusive measures (observe naturally occurring behavior), deception (ethically debatable and requires debriefing), single-blind technique (participants don't know which group they're in), or double-blind technique (neither participants nor researchers know group assignments).

2. Multiple Treatment Interference: When Treatments Get Tangled

Multiple treatment interference happens in within-subjects designs where each person experiences multiple conditions. Earlier conditions affect responses to later ones.

Practical example: You're testing three sleep meditation techniques – Progressive Muscle Relaxation (PMR), Guided Imagery (GI), and Breathing Exercises (BE). Everyone tries all three in order: PMR, then GI, then BE. Breathing exercises work best! But wait – maybe breathing exercises only worked well because people were already relaxed from trying two other techniques first. Or maybe everyone was just exhausted and fell asleep faster by the third night regardless of technique.

This is like tasting wine: your third glass always tastes different than it would if you'd drunk it first, because the previous wines affected your palate.

The fix: Use counterbalancing – have different groups experience the conditions in different orders. If breathing exercises work best regardless of order, you know they're genuinely most effective.

3. Selection-Treatment Interaction: When Your Participants Are Special

Selection-treatment interaction threatens external validity when your research participants differ from the general population in ways that affect how they respond to treatment. Volunteers for studies tend to be more motivated, more trusting of research, higher in openness, and more desperate for help than non-volunteers.

Career scenario: You test a productivity app using volunteers from a psychology department. They're tech-savvy, motivated to self-improve, used to tracking and analyzing their behavior, and intrinsically interested in psychology. Your app works great! Then you try to market it to construction workers, retail staff, or lawyers. Completely different response – these groups have different tech comfort, work structures, and motivations.

The fix: Randomly select participants from your target population rather than relying on volunteers. In practice, this is often impossible, so you acknowledge this limitation and replicate studies with different samples.

4. Pretest-Treatment Interaction: When Testing Changes Treatment Response

Pretest-treatment interaction (pretest sensitization) occurs when taking a pretest changes how people respond to the treatment itself. The pretest essentially becomes an accidental part of your intervention.

Example: You're studying whether diversity training reduces prejudice. Your pretest asks detailed questions about racial attitudes, stereotypes, and biases. Just completing this questionnaire makes participants more aware of these issues and primes them to pay extra attention during training. The training might appear more effective in your study than it would be in real workplaces where people don't first complete a consciousness-raising questionnaire.

The Solomon Four-Group Design Solution:

This elegant design lets you detect and measure pretest effects:

GroupPretest?Treatment?Posttest?Purpose
1YesYesYesStandard experimental group
2YesNoYesControls for pretest + maturation
3NoYesYesTreatment without pretest sensitization
4NoNoYesPure control

By comparing Groups 1 and 3 (both get treatment, only one gets pretest), you see whether the pretest affected treatment response. By comparing Groups 2 and 4 (neither gets treatment), you see whether the pretest alone affected outcomes.

Common Misconceptions That Trip Up Students

Misconception 1: "External validity is more important than internal validity."

Reality: You need internal validity first. If you can't establish that your treatment caused your results, who cares if those results generalize? You'd be efficiently spreading misinformation. Internal validity is the foundation; external validity is the expansions.

Misconception 2: "Random assignment solves all problems."

Reality: Random assignment specifically addresses differential selection, maturation, and history threats to internal validity. It does nothing for testing effects, instrumentation, reactivity, or external validity threats. It's powerful but not magic.

Misconception 3: "Control groups prevent all threats to internal validity."

Reality: Control groups are essential but insufficient. Both groups can experience instrumentation problems, both can have differential attrition, and neither addresses testing effects or statistical regression if not designed carefully.

Misconception 4: "If a study has good internal validity, it must have good external validity."

Reality: These are independent. You can have a brilliantly controlled lab study (high internal validity) that tells you nothing about real-world applications (low ecological validity). Conversely, naturalistic observation might have great external validity but terrible internal validity because you can't control confounding variables.

Misconception 5: "All threats can be completely eliminated."

Reality: Research is about tradeoffs. Perfect control often reduces generalizability. Perfect naturalness eliminates control. Good researchers minimize the most relevant threats for their research question and acknowledge remaining limitations honestly.

Memory Strategies for EPPP Success

For Internal Validity Threats, use the acronym "HI MITTS":

  • History
  • Instrumentation
  • Maturation
  • I (differential selection – think "I pick groups badly")
  • Testing
  • T (statistical regression – think "Toward the mean")
  • S (differential attrition – think "Some leave")

For External Validity Threats, think "RAMPS":

  • Reactivity
  • A (selection-treatment interaction – think "Are participants special?")
  • Multiple treatment interference
  • Pretest-treatment interaction
  • S (okay, this one's a stretch, but RAMP works for four!)

Internal vs. External shortcut:

  • Internal = "Did I find the real cause INSIDE my study?"
  • External = "Will this EXTEND outside my study?"

Random assignment vs. random selection:

  • Random selection helps external validity (representative sample)
  • Random assignment helps internal validity (equivalent groups)

Remember the big fixes:

  • Most internal validity threats: Random assignment to multiple groups
  • Reactivity: Blind techniques
  • Multiple treatment interference: Counterbalancing
  • Pretest threats: Solomon four-group design

Key Takeaways for the EPPP

  • Internal validity = Can you conclude X caused Y? External validity = Will this work elsewhere/otherwise?

  • Seven internal validity threats: History, maturation, differential selection, statistical regression, testing, instrumentation, differential attrition

  • Four external validity threats: Reactivity (including demand characteristics and experimenter expectancy), multiple treatment interference, selection-treatment interaction, pretest-treatment interaction

  • Random assignment is the primary defense against history, maturation, and differential selection threats to internal validity

  • Counterbalancing controls multiple treatment interference in within-subjects designs

  • Blind techniques (single- or double-blind) control reactivity and experimenter expectancy

  • Solomon four-group design detects and controls pretest effects on both internal and external validity

  • External validity has five types: population, ecological, temporal, treatment variation, and outcome validity

  • You cannot have meaningful external validity without first establishing internal validity

  • Perfect validity is impossible; research involves strategic tradeoffs based on priorities

Understanding validity isn't just about passing the EPPP – it's about becoming a psychologist who can distinguish between solid evidence and wishful thinking, who designs interventions that actually work, and who interprets research with appropriate skepticism and wisdom. Every time you read a study, ask: "What threatened validity here, and did the researchers address it?" This critical thinking will serve you throughout your career.

Ready to practice? Get started in the app.