Resources / 7: Research Methods & Statistics / Correlation and Regression

Correlation and Regression

7: Research Methods & Statistics

Mastering Correlation and Regression: Your Guide to Understanding Relationships in Data

Why This Matters for Your Practice

Picture this: You're working with a client who's struggling with depression. You notice they mention poor sleep almost every session. You start wondering—is there actually a connection between their sleep quality and depression severity? And if so, how strong is that connection? Could you use their sleep patterns to predict how they might be feeling next week?

These are exactly the kinds of questions that correlation and regression help us answer. As a psychologist, you'll constantly be working with relationships between variables—whether you're reading research articles, conducting your own studies, or making clinical predictions. Understanding correlation and regression isn't just about passing the EPPP; it's about becoming a better consumer of research and a more informed practitioner.

The Foundation: What Correlation Really Tells Us

At its core, correlation measures how two things move together. When you're analyzing data, you typically have:

  • Predictor variables (X variables): The thing you think might predict or relate to something else
  • Criterion variables (Y variables): The outcome you're trying to understand or predict

Think of it like tracking your smartphone battery and screen time. Screen time is your predictor—you suspect it affects something. Battery life is your criterion—the outcome you're curious about. Correlation tells you whether more screen time consistently relates to lower battery life, and how strong that relationship is.

Correlation coefficients typically range from -1.0 to +1.0. Here's what those numbers mean:

  • +1.0: Perfect positive relationship (as one goes up, the other always goes up by a predictable amount)
  • 0: No relationship whatsoever
  • -1.0: Perfect negative relationship (as one goes up, the other always goes down by a predictable amount)

Most real-world correlations fall somewhere in between. A correlation of .70 is considered strong, .50 is moderate, and .30 is weak.

Important Subscript Alert

Here's a trick that will save you confusion on the exam: Look at the subscript of any correlation coefficient. If it shows two different letters (like r_xy), you're looking at the relationship between two different variables. If it shows the same letter twice (like r_xx), you're looking at a reliability coefficient, which is something entirely different and covered in test construction material.

Three Critical Assumptions: When Correlation Works and When It Doesn't

Before you can trust a correlation coefficient, three conditions need to be met. When they're violated, your correlation might be lying to you.

Assumption 1: Linear Relationships

The relationship between your variables needs to be linear—meaning if you plotted them on a graph, they'd roughly form a straight line, not a curve.

Imagine tracking anxiety levels throughout the day. If anxiety starts low in the morning, peaks at midday, then drops again in the evening, that's a curved (nonlinear) relationship with time. A regular correlation coefficient would miss this pattern and underestimate the actual relationship. You'd get a weak correlation even though there's clearly a strong pattern—it's just not a straight-line pattern.

Assumption 2: Unrestricted Range

You need variation in your data. If everyone in your study scored about the same, you can't see how variables relate across different levels.

Suppose you're studying the relationship between years of therapy experience and treatment effectiveness, but you only survey therapists who've been practicing for 8-10 years. You're missing the beginners and the veterans. Your correlation might suggest experience doesn't matter much, when really you just didn't have enough range in your data to see the true relationship.

Assumption 3: Homoscedasticity

This intimidating word simply means that the variability in your criterion variable should be roughly the same across all levels of your predictor variable.

Think about predicting therapy outcomes from initial symptom severity. If clients with mild symptoms all have pretty similar outcomes (low variability), but clients with severe symptoms have wildly different outcomes (high variability), you've violated this assumption. Your predictions will be accurate for some predictor values but not others.

Choosing the Right Correlation: A Practical Guide

The type of correlation you use depends on how your variables are measured. Here's a breakdown:

Correlation TypeWhen to Use ItExample
Pearson rBoth variables are continuous (interval or ratio scales) and linearly relatedCorrelating therapy hours completed with depression scale scores
Spearman rhoBoth variables are ranksCorrelating class rank in graduate school with ranking on clinical evaluations
Point BiserialOne continuous variable and one true dichotomyCorrelating test scores (continuous) with gender identity (male/female dichotomy)
BiserialOne continuous variable and one artificial dichotomyCorrelating full exam scores (continuous) with pass/fail status (artificially created dichotomy)
Contingency CoefficientBoth variables are nominal (categories)Correlating preferred therapy orientation (CBT, psychodynamic, humanistic) with geographic region (Northeast, South, West, Midwest)

The distinction between true and artificial dichotomies matters: A true dichotomy is naturally binary—you either have a psychology license or you don't. An artificial dichotomy happens when you take something continuous and split it into two groups—like taking GPA scores and dividing them into "high achievers" and "low achievers."

When Pearson r Isn't Enough

If you suspect your variables have a nonlinear relationship but both are continuous, use eta instead of Pearson r. Eta can capture curved relationships that Pearson r would miss.

From Correlation to Understanding: The Coefficient of Determination

Here's where correlation gets really practical. When you square a correlation coefficient, you get something called the coefficient of determination (r²). This tells you the percentage of variability in one variable that's explained by the other.

Let's say you find a correlation of .60 between number of therapy sessions attended and improvement in anxiety symptoms. Square that: .60 × .60 = .36, or 36%.

This means that 36% of the differences we see in anxiety improvement can be explained by how many sessions people attended. The other 64%? That's due to other factors—maybe medication, social support, life circumstances, therapeutic relationship quality, or random chance.

This perspective is humbling and important. Even a respectable correlation of .60 means that most of what's happening is due to other factors we haven't measured. This is why clinical judgment matters—numbers tell part of the story, never the whole story.

Making Predictions: Regression Analysis

Correlation tells you how strong a relationship is. Regression takes the next step: it lets you make actual predictions.

When you run a regression analysis on your data, you create a regression equation. This equation becomes a formula you can use to predict someone's criterion score based on their predictor score.

Here's the key principle: The stronger the correlation, the more accurate your predictions.

With a correlation of .90, your predictions will be pretty close to reality. With a correlation of .30, your predictions will be in the ballpark but with lots of room for error. This is why we combine multiple sources of information when making clinical decisions—no single predictor is ever perfect.

Leveling Up: Multiple Predictors with Multivariate Techniques

Real life is complicated. Usually, outcomes are influenced by multiple factors, not just one. That's where multivariate techniques come in.

Multiple Regression: Predicting One Outcome from Multiple Predictors

Multiple regression lets you use several predictors to estimate a single continuous outcome.

Imagine predicting a client's recovery from depression. You might include therapy attendance, medication adherence, social support level, exercise frequency, and sleep quality as predictors. Multiple regression tells you which predictors matter most and creates an equation that combines all of them for the best prediction.

There are two main approaches:

Simultaneous (standard) multiple regression: You throw all predictors into the equation at once. Use this when theory or previous research suggests all your predictors are important.

Stepwise multiple regression: The analysis adds predictors one at a time, starting with the strongest predictor and adding others only if they improve prediction. Use this when you want to identify the minimum number of predictors needed for good prediction—helpful when assessment time is limited.

The Multicollinearity Problem

Here's what you want: each predictor should correlate strongly with the outcome but weakly with other predictors. When predictors correlate strongly with each other, that's called multicollinearity, and it's a problem.

Why? Because you're essentially measuring the same thing multiple times. If you're predicting job satisfaction and you include "stress at work" and "feeling overwhelmed at work" as separate predictors, these are probably measuring nearly the same thing. They're not giving you new information—they're redundant.

Other Multivariate Techniques

TechniqueWhen to UseExample
Canonical CorrelationMultiple continuous predictors, multiple continuous criteriaUsing personality traits to predict both job satisfaction and work productivity simultaneously
Discriminant Function AnalysisMultiple predictors (any type), one nominal criterionUsing symptoms, history, and test scores to predict which diagnostic category a client falls into
Logistic RegressionSame as discriminant analysis but when assumptions aren't metPredicting treatment dropout (yes/no) from various client characteristics when your data violates normality assumptions

Logistic regression deserves special mention because it's becoming increasingly popular. It's more flexible than discriminant function analysis because it doesn't require normally distributed predictors. In modern research, you'll see it frequently when the outcome is binary—will a client relapse or not? Will they respond to treatment or not?

Advanced Tool: Structural Equation Modeling (SEM)

SEM is the sophisticated cousin of regression and factor analysis. While you won't need to run SEM for the exam, you should understand what it does because you'll encounter it constantly in research articles.

The Key Concepts

Observed variables are things you directly measure—actual test scores, questionnaire responses, behavioral counts. Think of them as the raw data you collect.

Latent variables are theoretical constructs you can't directly measure but infer from multiple observed variables. Depression is a latent variable—you can't measure it directly, but you infer it from symptoms like sleep problems, low mood, and appetite changes.

Exogenous variables are the starting points—nothing in your model predicts or explains them. They're like independent variables.

Endogenous variables are predicted or explained by other variables in your model. They're like dependent variables, though they can also predict other variables.

How SEM Works: The Five-Step Process

  1. Model Specification: You draw a diagram showing your hypothesized relationships based on theory and previous research. Observed variables go in squares, latent variables in circles, and arrows show relationships.

  2. Model Identification: (Technical stuff—the analysis checks whether the model is mathematically solvable)

  3. Model Estimation: (More technical stuff—calculating the specific values)

  4. Model Evaluation: You check how well your proposed model fits the actual data using measures like the goodness-of-fit index (GFI). It's like asking, "Does my theory match reality?"

  5. Model Modification: If the fit isn't good, you revise your model and try again.

Why SEM Matters

SEM lets researchers test complex theories about how multiple variables relate to each other, including indirect effects. For example, you might hypothesize that having a strong social network (structural social support) reduces loneliness not directly, but by increasing the quality of social interactions (functional social support). SEM can test whether this mediation pathway actually exists in your data.

Common Misconceptions to Avoid

"Correlation equals causation": This is the classic mistake. Just because two things correlate doesn't mean one causes the other. People who own more books tend to live longer, but books don't cause longevity—education level and socioeconomic status probably explain both.

"Negative correlations are weaker than positive ones": Wrong. A correlation of -.80 is just as strong as +.80; the sign just indicates direction. Both show strong relationships.

"A low correlation means no relationship": Not necessarily. It might mean no linear relationship. There could be a strong curved relationship that a linear correlation missed.

"You can use any correlation coefficient for any data": The type of data matters. Using the wrong coefficient can give you misleading results or, in some cases, meaningless numbers.

"If the correlation is significant, it's important": Statistical significance just means the relationship probably isn't due to chance. A correlation of .15 might be statistically significant with a large sample but still explain only 2% of the variance—not very useful for prediction.

Practice Tips for Remembering

For choosing correlation types: Create a simple decision tree. Start with "Are both variables continuous?" If yes, use Pearson r (assuming linearity). If no, ask "Are both ranks?" and so on.

For the subscript rule: Remember "different letters = different variables, same letters = reliability." Make a flashcard with r_xy on one side and r_xx on the other.

For assumptions: Use the acronym LRH: Linear relationships, Unrestricted range, Homoscedasticity.

For multivariate techniques: Focus on the criterion variable first. How many criteria do you have? What scale are they measured on? That narrows down your choices quickly.

For coefficient of determination: Practice squaring common correlation values: .30² = .09 (9%), .50² = .25 (25%), .70² = .49 (49%). This helps you quickly interpret the practical significance of correlations you encounter.

For SEM vocabulary: Remember that "observed" means "directly measured" (like observing through a microscope), while "latent" means "hidden beneath the surface" (like latent fingerprints that need special tools to reveal).

Key Takeaways

  • Correlation measures association between variables, ranging from -1.0 to +1.0, with the sign indicating direction and the absolute value indicating strength.

  • Three critical assumptions must be met: linear relationships, unrestricted range, and homoscedasticity. Violations lead to underestimating the true relationship.

  • Choose your correlation coefficient based on measurement scales: Pearson r for continuous variables, Spearman rho for ranks, point biserial for continuous and true dichotomy, biserial for continuous and artificial dichotomy, and contingency coefficient for nominal variables.

  • The coefficient of determination (r²) tells you the percentage of shared variability—how much of the outcome is explained by the predictor.

  • Regression analysis creates equations for making predictions, with accuracy increasing as the correlation strengthens.

  • Multiple regression combines several predictors to predict a single continuous outcome. Avoid multicollinearity by ensuring predictors aren't too highly correlated with each other.

  • Other multivariate techniques include canonical correlation (multiple predictors, multiple continuous criteria), discriminant function analysis (multiple predictors, one nominal criterion), and logistic regression (an alternative when assumptions aren't met).

  • SEM tests complex theoretical models involving both observed and latent variables, using a five-step process from specification to modification.

  • Correlation never proves causation, but it's an essential tool for understanding relationships and making predictions in psychological research and practice.

Understanding these concepts deeply will help you not only pass the EPPP but also critically evaluate research throughout your career. Every time you read that a study found a "significant relationship" between variables, you'll know the right questions to ask: How strong was it really? What type of correlation did they use? Were the assumptions met? How much variance was actually explained? This critical thinking is what separates competent psychologists from exceptional ones.

Ready to practice? Get started in the app.