Tag: causality

  • Berkson’s Paradox: Why Your Data Might Be Lying to You

    Berkson’s Paradox: Why Your Data Might Be Lying to You

    Have you ever felt like there’s a strange trade-off in the people you date — the kinder they are, the less attractive they seem? Or maybe doctors notice that patients either have diabetes or high blood pressure, but rarely both?

    That’s Berkson’s Paradox in action. It’s a statistical illusion that happens when we only look at a filtered or biased sample. The relationships we see inside that sample can be completely different (even reversed) from what’s true in the broader population.

    The Hospital Example

    Imagine a hospital that only admits patients who have either:

    • high diabetes score, or
    • high blood pressure score.

    In the general population, those two health issues are positively correlated: people with one often have the other.

    But in the hospital’s dataset, something strange happens: if a patient doesn’t have high diabetes, they’re likely admitted because of high blood pressure, and vice versa.

    This makes the two look negatively correlated, even though they’re not.

    Let’s illustrate that with code.

    cov = [[1, 0.4], [0.4, 1]]

    # Generate data
    diabetes_bp = np.random.multivariate_normal(mean, cov, n)
    diabetes, bp = diabetes_bp[:, 0], diabetes_bp[:, 1]

    # Define admission rule
    admitted = (diabetes > 1.2) | (bp > 1.5)

    # Create the DataFrame
    df = pd.DataFrame({
    'Diabetes': diabetes,
    'BloodPressure': bp,
    'Admitted': admitted
    })

    # Plot
    plt.figure(figsize=(6, 6))
    plt.scatter(df[~df['Admitted']]['Diabetes'], df[~df['Admitted']]['BloodPressure'],
    color='pink', label='Not Admitted', alpha=0.6)
    plt.scatter(df[df['Admitted']]['Diabetes'], df[df['Admitted']]['BloodPressure'],
    color='blue', label='Admitted', alpha=0.7)

    # Threshold lines for admission
    plt.axvline(x=1.2, color='gray', linestyle='--')
    plt.axhline(y=1.5, color='gray', linestyle='--')

    plt.xlabel("Diabetes Score")
    plt.ylabel("Blood Pressure Score")
    plt.title("Berkson's Paradox: Biased Hospital Admission (Weaker True Correlation)")
    plt.legend()
    plt.grid(True)
    plt.tight_layout()
    plt.show()

    What You’ll See

    In the full population, there’s a mild positive correlation: as diabetes goes up, so does blood pressure.

    But when you only look at admitted patients, the correlation flips. It looks like people with high diabetes are less likey to have high blood pressure, and vice versa.

    That’s Berkson’s Paradox. When you filter your data (whether it’s based on admissions, hiring decisions, or dating preferences ) you might create patterns that don’t exist in reality.

    Why It Happens

    Berkson’s Paradox is a result of conditioning on a collider: a variable that is influenced by two other variables. In this case:

    Diabetes → Admission ← Blood Pressure

    When we only look at patients who were admitted (filtering on the collider), it introduces a spurious negative correlation between diabetes and blood pressure, even if they are independent or positively correlated in the full population.

    This is a known pitfall in causal inference, especially when working with observational data.

    When It Matters

    Berkson’s Paradox isn’t just academic. It can creep into real-world decisions:

    • Healthcare: Hospitals analyzing only admitted patients might draw false conclusions about disease relationships.
    • Dating apps: If you only swipe on people who are either attractive or kind, you may think those traits never come together.
    • Hiring: Companies screening only the top of the resume pile might think strong technical skills and good communication never overlap.
    • Product feedback: Analyzing only users who contact support may show misleading patterns about product problems.

    In each case, you’re filtering the data in a way that distorts the truth.

    Real-World Origin

    Berkson’s Paradox was first described in a 1946 paper by Joseph Berkson, a statistician at the Mayo Clinic.

    He noticed that certain diseases seemed negatively correlated in hospital data, even though there was no such relationship in the broader population.

    You can find the original paper here:

    Berkson, J. (1946) — Limitations of the Application of Fourfold Table Analysis to Hospital Data

    It’s one of the earliest documented examples of how selection bias can warp statistical conclusions.

    Key Takeaway

    Be careful with filtered data. If you’re only seeing a slice of the population, the relationships in your data might be misleading. Berkson’s Paradox is a good reminder that how you collect your data can shape what you think it says.

  • How Google Used Causal ML to Optimize Gmail Search (Without A/B Testing)

    How Google Used Causal ML to Optimize Gmail Search (Without A/B Testing)

    Machine learning models typically predict outcomes based on what they’ve seen — but what about what they haven’t?

    Google tackled this issue by integrating causal reasoning into its ML training, optimizing when to show Google Drive results in Gmail search.

    The result?

    A 9.15% increase in click-through rates without costly A/B tests.

    Let’s break it down.

    The problem: biased observational data

    Traditional ML models train on historical user behavior, assuming that past actions predict future outcomes.

    But this approach is inherently biased because it only accounts for what actually happened — not what could have happened under different conditions.

    Example: Gmail sometimes displays Google Drive results in search. If a user clicks, does that mean they needed the result? If they don’t click, would they have clicked if Drive results were presented differently?

    Standard ML models can’t answer these counterfactual questions.

    Google’s approach: Causal ML in action

    Instead of treating all users the same, Google’s model categorized them into four response types based on their likelihood to click:

    1. Compliers — Click only if Drive results are shown.
    2. Always-Takers — Click regardless of whether results are shown.
    3. Never-Takers — Never click Drive results.
    4. Defiers — Click only if Drive results are not shown (a rare edge case).

    The challenge? You can’t directly observe these categories — a user only experiences one version of reality.

    Google solved this by estimating counterfactual probabilities, essentially asking: How likely is a user to click if the result were shown, given that it wasn’t?

    The key insight: optimizing for the right users

    Instead of optimizing blindly for clicks, the model focused on:

    • Prioritizing Compliers (since they benefit the most from Drive results).
    • Accounting for Always-Takers (who don’t need Drive suggestions to click).
    • Avoiding Never-Takers (who won’t click regardless).

    This logic was embedded into the training objective function, ensuring that the model learned from causal relationships rather than just surface-level patterns.

    The Results: Smarter Personalization Without Experiments

    By integrating causal logic into ML training, Google achieved:

    • +9.15% increase in click-through rate (CTR)
    • Only +1.4% increase in resource usage (not statistically significant)
    • No need for costly A/B testing

    This proves that causal modeling can reduce bias in implicit feedback, making machine learning models more adaptive, efficient, and user-friendly — all without disrupting the user experience.

    Why This Matters

    Most companies rely on A/B testing to optimize product features, but sometimes that approach can be expensive, or just not possible at all.

    Causal ML offers a way to refine decisions without running thousands of real-world experiments.

    Google’s work shows that the future of ML isn’t just about better predictions — it’s about understanding why users behave the way they do and making decisions accordingly.

    Source

    Training Machine Learning Models With Causal Logic