Resources / 3, 5, 6: Organizational Psychology / Employee Selection – Evaluation of Techniques

Employee Selection – Evaluation of Techniques

3, 5, 6: Organizational Psychology

Why This Matters: Making Better Hiring Decisions

Imagine you're swiping through dating profiles. Some profiles catch your eye, you go on dates, and some work out while others don't. Now imagine if there was a way to predict which dates would actually turn into successful relationships before you invested all that time and emotional energy. That's essentially what employee selection techniques do for companies—they're trying to predict who will succeed in a job before investing months or years of training and salary.

But here's the catch: How do you know if your prediction method actually works? If you kept choosing the same type of person to date and it never worked out, you'd need a different approach. Companies face the same challenge, and they need to evaluate whether their hiring methods are actually helping them find good employees or just wasting everyone's time.

This is where evaluation of selection techniques comes in. For the EPPP, you'll need to understand how psychologists determine if a hiring test, interview, or assessment is worth using. Think of this as quality control for hiring decisions—we're making sure the tools actually do what they claim to do.

The Four Big Questions About Selection Techniques

Before a company uses any new hiring method—whether it's a personality test, skills assessment, or interview format—they need to answer four crucial questions:

  1. Does it work consistently? (Reliability and Validity)
  2. Does it improve our decisions? (Incremental Validity)
  3. Is it fair to everyone? (Adverse Impact)
  4. Is it worth the money? (Utility Analysis)

Let's break down each of these questions in detail.

Question 1: Does It Work Consistently? (Reliability and Validity)

Understanding Reliability

Reliability is about consistency. Think about your bathroom scale. If you step on it three times in a row and get readings of 150 lbs, 175 lbs, and 140 lbs, that scale isn't reliable—it's all over the place. But if you get 150 lbs every single time, it's highly reliable. It measures consistently, even if it's measuring the wrong thing (maybe it's actually 10 pounds off!).

The same principle applies to hiring tests. A reliable test gives consistent results. If someone takes a typing test on Monday and gets 60 words per minute, then takes an equivalent version on Wednesday and gets 55 words per minute, that's pretty consistent. But if they get 60 on Monday and 20 on Wednesday, something's wrong with the test—not necessarily the person.

Reliability is measured with a reliability coefficient that ranges from 0 to 1.0. The closer to 1.0, the more consistent the test is. Think of it like a restaurant's consistency rating: 1.0 means the food is exactly the same delicious quality every time, while 0 means it's completely unpredictable—sometimes amazing, sometimes terrible.

Understanding Validity

Here's where it gets interesting. Just because a test is reliable doesn't mean it's measuring what you think it's measuring. Remember that bathroom scale? It could consistently give you the same reading but still be 10 pounds off. That scale is reliable but not valid.

Validity means the test actually measures what it claims to measure. There are three main types you need to know:

Content Validity

This is about whether the test covers the right material. Imagine you're hiring someone to be a restaurant server. A test with good content validity would include things like taking orders, handling customer complaints, and carrying multiple plates—the actual skills they'd use on the job.

A test with poor content validity might ask them to solve complex math problems or write essays about fine dining history. Those things might be interesting, but they don't reflect what servers actually do day-to-day.

Content validity is established by:

  • Basing the test on a job analysis (a detailed study of what the job actually involves)
  • Having experts review the test content

Construct Validity

A construct is a hypothetical trait—something we can't directly see or touch, like intelligence, extraversion, or leadership ability. Construct validity asks: "Does this test actually measure the trait it claims to measure?"

Let's say you're using a test that claims to measure "leadership potential." To establish construct validity, you'd need to show that:

  • People who score high on your test also score high on other established leadership tests (measuring the same thing)
  • People who score high on your test score low on tests of, say, social anxiety (measuring different things that shouldn't overlap)
  • The test predicts leadership-related outcomes (people who score high actually become effective leaders)

This is like when you're dating someone new and trying to figure out if they're actually "emotionally available" like they claim. You'd look for evidence across different situations—how they talk about past relationships, whether they make time for you, how they handle conflict. You're building a case for whether that trait is really there.

Criterion-Related Validity

This is the gold standard for hiring tests. Criterion-related validity answers the question: "Do scores on this test predict actual job performance?"

You measure this by giving the test to a group of people, then checking how well their test scores correlate with their actual job performance later. This produces a validity coefficient ranging from -1.0 to +1.0.

Here's what those numbers mean:

  • +1.0: Perfect positive relationship (high test scores always mean high job performance)
  • 0: No relationship at all (test scores tell you nothing about job performance)
  • -1.0: Perfect negative relationship (high test scores always mean poor job performance)

In reality, you'll never see +1.0 or -1.0. Most good selection tests have validity coefficients between .30 and .50—which might sound low, but it's actually useful for prediction.

Validity CoefficientStrengthWhat It Means
.50+StrongTest is a good predictor of job performance
.30 to .49ModerateTest provides useful information
.10 to .29WeakTest has limited predictive value
Below .10Very WeakTest is barely better than guessing

Question 2: Does It Improve Our Decisions? (Incremental Validity)

Okay, so you've found a test with decent validity. But here's the real question: Will adding this test to your current hiring process actually make your decisions better? That's what incremental validity is all about—the improvement in accuracy you get from adding something new.

Think about it like upgrading your smartphone. Sure, the new model has better features, but if your current phone already does everything you need, is the upgrade really worth it? Same logic applies to hiring tests.

A new test is most likely to improve decisions when three things line up:

The Selection Ratio is Low

The selection ratio is the percentage of applicants you plan to hire. It's calculated by dividing positions available by total applicants.

  • Low selection ratio (0.10): Hiring 1 out of 10 applicants—you're being picky
  • High selection ratio (0.90): Hiring 9 out of 10 applicants—you're hiring almost everyone

Low selection ratios are better because you have options. It's like when you're apartment hunting in a market with tons of available units—you can be choosy and find exactly what you want. But if there's only one apartment available, all the fancy checklists in the world won't help you much because you don't have choices.

The Base Rate is Moderate

The base rate is the percentage of employees hired using your current method who are successful. This is crucial for understanding incremental validity.

  • High base rate (0.80): 80% of people you hire are already successful—your current method works great
  • Moderate base rate (0.50): 50% are successful—there's room for improvement
  • Low base rate (0.20): Only 20% are successful—something is seriously wrong

Here's the key insight: A moderate base rate (around .50) gives you the most room for improvement.

If your base rate is already high (.80 or .90), adding a new test probably won't help much. It's like if you're already eating healthy and exercising regularly—adding a fancy fitness tracker might give you minor improvements, but you're already doing well.

If your base rate is very low (.20 or .30), a new selection test probably isn't your main problem. Something else is going wrong—maybe your training program is terrible, management is toxic, or the job description doesn't match reality. It's like if every relationship you've ever had failed after two months—the problem probably isn't your method for choosing partners; something else needs to be addressed.

The Taylor-Russell Tables

Psychologists use Taylor-Russell tables to predict how much improvement a new test will provide. These tables combine three factors: the test's validity coefficient, the base rate, and the selection ratio.

Here's a practical example: Your company currently has a 50% success rate with new hires (base rate = .50). You're hiring 1 out of every 10 applicants (selection ratio = .10). You're considering adding a test with a validity coefficient of .30—which is fairly modest.

Looking at the Taylor-Russell tables, this combination would result in 71% of new hires being successful. That's a 21% improvement (71% - 50% = 21%), which is substantial! Even a test with moderate validity can make a big difference under the right conditions.

Question 3: Is It Fair to Everyone? (Adverse Impact)

This is where legal and ethical considerations come in. Adverse impact (also called disparate impact) occurs when a hiring method unintentionally discriminates against a legally protected group—like women, racial minorities, or older workers.

The key word here is "unintentionally." We're not talking about deliberate discrimination. We're talking about tests that seem neutral but end up screening out certain groups at higher rates.

Two Ways Adverse Impact Happens

Test Unfairness

Test unfairness occurs when members of one group consistently score lower on a test, but those score differences don't reflect actual differences in job performance.

Imagine a company uses a test that requires reading complex technical manuals quickly. Women score lower on this test and get hired less often. But once hired, women and men perform equally well on the job. That's test unfairness—the test is screening out qualified women based on something that doesn't actually matter for job success.

It's like if a dating app filtered people by how quickly they respond to messages, and this ended up excluding people who are thoughtful and deliberate about their communication. If those thoughtful people would actually make great partners, the filter is unfair—it's screening out good matches for the wrong reasons.

Differential Validity

Differential validity occurs when a test has significantly different validity coefficients for different groups.

For example, a test might have a validity coefficient of .70 for men (strongly predicts job performance) but only .20 for women (barely predicts job performance). This means the test works well for predicting men's success but doesn't work for women—it's a valid tool for one group but not the other.

The 80% Rule (Four-Fifths Rule)

The Uniform Guidelines on Employee Selection Procedures gives us a specific way to detect adverse impact: the 80% rule.

Here's how it works: Calculate the hiring rate for each group. If the hiring rate for a protected group is less than 80% of the hiring rate for the majority group, adverse impact is occurring.

Example calculation:

  • 100 White applicants, 70 hired = 70% hiring rate
  • 100 African American applicants, ? hired

What's the minimum acceptable hiring rate?

  • 70% × 0.80 = 56%

If fewer than 56 African American applicants are hired, adverse impact is occurring.

What Happens If There's Adverse Impact?

If a court determines a test has adverse impact, the employer has three options:

  1. Replace it: Find a different test that doesn't have adverse impact
  2. Modify it: Change the test so it no longer has adverse impact
  3. Justify it: Prove the test is necessary by showing it's:
    • Valid: Has proven criterion-related, content, or construct validity
    • A business necessity: Required for safe and efficient operations
    • A bona fide occupational qualification (BFOQ): Necessary for normal business operations

Here's an important distinction: BFOQs can apply to gender, age, religion, and national origin, but never to race.

For example, a religious school can require teachers to be members of their faith (religion is a BFOQ), or a film production can specify they need a female actor to play a female character (gender is a BFOQ). But a company can't justify racial discrimination as a BFOQ under any circumstances.

Question 4: Is It Worth the Money? (Utility Analysis)

Finally, we get to the practical business question: Does this test provide enough benefit to justify its cost?

Utility analysis calculates the economic return on investment for selection procedures. It's like calculating whether a gym membership pays for itself in reduced healthcare costs and increased productivity.

The most common formula is the Brogden-Cronbach-Gleser formula, which considers:

  • Number of people hired: More hires = more potential savings
  • Validity coefficient: Better prediction = more value
  • Standard deviation of job performance in dollars: How much variation exists in employee value
  • Cost of testing: What you're paying for the test itself

This produces a dollar amount showing the expected financial benefit of using the test. If a test costs $50 per applicant but saves the company $500,000 per year in reduced turnover and increased productivity, that's excellent utility.

Think of it like those "cost per wear" calculations for expensive clothing. A $300 coat seems pricey until you realize you'll wear it 100 times over five years—that's $3 per wear. Similarly, an expensive selection test might seem costly until you calculate how much it saves in hiring mistakes.

Real-World Applications: Putting It All Together

Let's walk through a realistic scenario to see how all these concepts work together.

Scenario: A hospital is hiring nurses and wants to add a new personality test to their selection process.

Step 1 - Check Reliability and Validity: They review research showing the test has a reliability coefficient of .85 (very consistent) and a criterion-related validity coefficient of .40 with job performance ratings (moderate validity). The test also has good construct validity for measuring conscientiousness and emotional stability—traits important for nursing.

Step 2 - Assess Incremental Validity: Currently, 60% of their new nurses are rated as successful after one year (base rate = .60). They receive 200 applications for 20 positions (selection ratio = .10). Using the Taylor-Russell tables with their validity coefficient of .40, low selection ratio, and moderate base rate, they estimate adding this test would improve success rates to about 75%—a meaningful 15% improvement.

Step 3 - Check for Adverse Impact: They pilot the test and find that the hiring rate for minority candidates is 57% while the hiring rate for White candidates is 65%. Using the 80% rule: 65% × 0.80 = 52%. Since 57% > 52%, the test doesn't show adverse impact.

Step 4 - Calculate Utility: They determine that the cost of training and potentially replacing an unsuccessful nurse is approximately $75,000. With 20 new hires per year, a 15% improvement in success rates means 3 fewer failures per year, saving roughly $225,000 annually. The test costs $100 per applicant ($20,000 total for 200 applicants), resulting in a net benefit of $205,000 per year. Excellent utility!

Decision: The hospital adopts the personality test as part of their selection process.

Common Misconceptions

Misconception 1: "High reliability means a test is good."

  • Reality: Reliability only tells you the test is consistent. A test could consistently measure the wrong thing. You need validity, not just reliability.

Misconception 2: "Validity coefficients need to be above .70 to be useful."

  • Reality: In employee selection, coefficients between .30 and .50 are common and useful, especially with low selection ratios and moderate base rates.

Misconception 3: "If a test shows group differences, it's automatically unfair."

  • Reality: Group differences on a test only indicate adverse impact if those differences aren't reflected in actual job performance differences. If the test accurately predicts performance and the performance differences are real, that's not unfairness—that's validity.

Misconception 4: "The 80% rule is a legal requirement that must be met."

  • Reality: The 80% rule is a guideline for detecting potential adverse impact, not an absolute legal standard. Courts consider it along with other evidence.

Misconception 5: "Adding more selection tests always improves decisions."

  • Reality: Tests only improve decisions if they add new information. Adding a second test that measures the same thing as the first test won't help—it's redundant.

Practice Tips for Remembering

For Reliability vs. Validity: Remember "Reliable but not Valid" with the bathroom scale example. Your scale could consistently say you weigh 150 lbs (reliable) but be 20 lbs off (not valid). Reliability = consistency; Validity = accuracy.

For the Three Types of Validity: Use the mnemonic "C-C-C":

  • Content = Coverage (does it cover the job content?)
  • Construct = Concept (does it measure the psychological concept?)
  • Criterion = Correlation (does it correlate with job performance?)

For Incremental Validity: Think "Low SR, Moderate BR, Makes Better":

  • Low Selection Ratio + Moderate Base Rate = Makes your decisions Better (when you add a valid test)

For the 80% Rule: Remember it as the "80% of the majority" rule. The protected group's hiring rate must be at least 80% of what the majority group's hiring rate is. If White candidates have a 70% hiring rate, minority candidates need at least 56% (70% × .80).

For BFOQ: Remember the acronym "GARN" for what BFOQ can apply to:

  • Gender
  • Age
  • Religion
  • National origin
  • But NEVER race

Memory Palace Technique: Imagine walking through a hiring process:

  1. Enter through Reliability Door (consistent, solid door that works every time)
  2. Walk down Validity Hallway with three rooms: Content (full of job tasks), Construct (abstract art representing traits), Criterion (scoreboard showing job performance)
  3. Enter Incremental Office where you're choosing from many applicants (low SR) with about half currently succeeding (moderate BR)
  4. Pass the Fairness Checkpoint with the 80% sign
  5. End at the Money Counter (utility analysis)

Key Takeaways

  • Reliability measures consistency; validity measures accuracy. You need both, but validity is more important.

  • The three types of validity each serve different purposes:

    • Content validity: Does the test sample job-relevant content?
    • Construct validity: Does the test measure the intended psychological trait?
    • Criterion-related validity: Does the test predict job performance?
  • Incremental validity is maximized when you have: a valid test, a low selection ratio, and a moderate base rate.

  • Adverse impact occurs when a selection method has a disproportionate negative effect on protected groups. Use the 80% rule to detect it.

  • Test unfairness: Protected group scores lower on test but not on job performance.

  • Differential validity: Test has different validity coefficients for different groups.

  • When adverse impact is found, employers can replace/modify the procedure or prove it's valid, a business necessity, or a BFOQ (but race can never be a BFOQ).

  • Utility analysis calculates the financial return on investment for selection procedures using factors like validity coefficient, number hired, and testing costs.

  • The Taylor-Russell tables help predict how much a new selection test will improve hiring success rates.

For the EPPP, focus on understanding the relationships between these concepts rather than memorizing formulas. You'll likely see scenario-based questions asking you to identify which evaluation method is appropriate, whether adverse impact is occurring, or what factors affect incremental validity. Keep the big picture in mind: all of these techniques exist to help organizations make better, fairer, and more cost-effective hiring decisions.

Ready to practice? Get started in the app.