Luisa Perez Lacera - District-Level Analysis of MCAS Achievement Gaps

I cleaned and analyzed district-level MCAS data (2020–2024) to quantify achievement gaps across demographic subgroups and subjects. Using ANOVA and Tukey post-hoc testing, I identified statistically significant performance disparities across subjects, with especially large differences in ELA and Math. I communicated findings through distribution plots and significance visualizations to support equity-focused policy discussion.

Data Analysis Project | Education Policy & Equity
Tools: Python (pandas, numpy, matplotlib, seaborn, scipy, statsmodels)

Project Overview

Standardized assessments are frequently used to evaluate student learning and identify achievement gaps across demographic groups. The Massachusetts Comprehensive Assessment System (MCAS) provides a longitudinal dataset that allows for district-level comparisons across subjects and student subgroups.

This project asks whether statistically significant differences in MCAS performance exist across demographic subgroups and subject areas, using district-level aggregated results. The goal is to identify persistent inequities and demonstrate how assessment data can inform policy and targeted interventions.

Research Questions

Are there statistically significant differences in MCAS performance across demographic groups?
Do achievement gaps vary by subject area (ELA, Math, Science and related subjects)?

Data Source

MCAS Achievement Results (Massachusetts DESE / E2C Hub)
Years in raw file: 2017–2024
Level of analysis in this project: District-level only (excluding school- and state-level records)

Full Analysis Code

1) Imports

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

2) Load Data + Basic Inspection

df = pd.read_csv("/content/drive/MyDrive/MCAS_Achievement_Results_20250223.csv")

df.info()
df.head()

What I looked for here

row count and number of columns
data types (especially percent fields and categorical fields)
initial sanity checks on key variables (SY, SUBJECT_CODE, STUGRP)

3) Filter Years for Comparability

MCAS formats and context differ across time. To reduce confounds (including pre-pandemic and earlier formats), I exclude SY 2017–2019.

df["SY"].value_counts()

df_cleaned = df[~df["SY"].isin([2017, 2018, 2019])].copy()
df_cleaned["SY"].value_counts()

4) Drop Columns Not Needed for This Analysis

I focus on achievement outcomes and grouping variables relevant to the research questions.

columns_to_drop = [
    "DISTRICT_AND_SCHOOL",   # redundant (district/school names already present elsewhere)
    "M_CNT", "M_PCT",        # Meeting + Exceeding is the main outcome of interest
    "E_CNT", "E_PCT",
    "PM_CNT", "PM_PCT",      # partially meeting not needed for pass-style outcome framing
    "AVG_SGP_INCL",          # not needed for district-level gap comparisons
    "ACH_PERCENTILE"         # only relevant at school level in this dataset
]

df_cleaned = df_cleaned.drop(columns=columns_to_drop, errors="ignore")
df_cleaned = df_cleaned.reset_index(drop=True)

df_cleaned.info()

5) Keep District-Level Records Only

District-level records are indicated where DIST_CODE == ORG_CODE (school-level records do not match).

df_cleaned = df_cleaned[df_cleaned["DIST_CODE"] == df_cleaned["ORG_CODE"]].copy()
df_cleaned = df_cleaned.reset_index(drop=True)

df_cleaned.info()

6) Remove State-Level Aggregates

df_cleaned["DIST_NAME"].value_counts().head(10)

df_cleaned = df_cleaned[df_cleaned["DIST_NAME"] != "State"].copy()
df_cleaned = df_cleaned.reset_index(drop=True)

df_cleaned.info()

7) Remove Redundant Grade Aggregate

The dataset includes TEST_GRADE == "ALL (03-08)" in addition to individual grade records. I remove the aggregate to avoid redundancy.

df_cleaned["TEST_GRADE"].value_counts()

df_cleaned = df_cleaned[df_cleaned["TEST_GRADE"] != "ALL (03-08)"].copy()
df_cleaned = df_cleaned.reset_index(drop=True)

df_cleaned.info()

Research Question 1

Are there noticeable differences in performance across demographic groups and subjects?

8) Summary Table by Subject × Subgroup

summary = (
    df_cleaned
    .groupby(["SUBJECT_CODE", "STUGRP"])[["AVG_SCALED_SCORE", "M_PLUS_E_PCT"]]
    .mean()
    .reset_index()
)

summary

9) Remove “All Students” for Group Comparisons

df_q1 = df_cleaned[df_cleaned["STUGRP"] != "All Students"].copy()
df_q1 = df_q1.reset_index(drop=True)

df_q1["STUGRP"].value_counts().head(20)

10) Visualization: Distribution of M_PLUS_E_PCT by Subject and Subgroup

plt.figure(figsize=(15, 6))
sns.boxplot(data=df_q1, x="SUBJECT_CODE", y="M_PLUS_E_PCT", hue="STUGRP")

plt.title("Distribution of MCAS Meeting/Exceeding % by Subject and Demographic Group")
plt.xlabel("Subject")
plt.ylabel("Meeting + Exceeding Expectations Percent")
plt.xticks(rotation=45)

plt.legend(title="Demographic Group", bbox_to_anchor=(1.05, 1), loc="upper left")
plt.tight_layout()
plt.show()

Quick interpretation (what this plot suggests):

Some groups show consistently lower distributions across multiple subjects.
Spread differs across groups, hinting at persistent gaps rather than random variation.

11) One-Way ANOVA by Subject (Across Demographic Groups)

For each subject, I test whether subgroup mean differences in M_PLUS_E_PCT could plausibly be due to chance.

anova_results = {}

for subject in df_q1["SUBJECT_CODE"].unique():
    subject_data = df_q1[df_q1["SUBJECT_CODE"] == subject]
    groups = [
        group["M_PLUS_E_PCT"].dropna().values
        for _, group in subject_data.groupby("STUGRP")
    ]
    anova_results[subject] = stats.f_oneway(*groups) if len(groups) > 1 else None

anova_summary = pd.DataFrame({
    "Subject": list(anova_results.keys()),
    "F_statistic": [res.statistic if res else np.nan for res in anova_results.values()],
    "p_value": [res.pvalue if res else np.nan for res in anova_results.values()],
}).sort_values("p_value")

anova_summary

Interpretation:

If p-values are < .05 for a subject, subgroup differences in that subject are statistically significant and warrant post-hoc analysis to identify which groups differ.

12) Post-Hoc Testing: Tukey’s HSD (Per Subject)

tukey_results = {}

for subject in df_q1["SUBJECT_CODE"].unique():
    subject_data = df_q1[df_q1["SUBJECT_CODE"] == subject]

    if len(subject_data["STUGRP"].unique()) > 1:
        tukey = pairwise_tukeyhsd(
            endog=subject_data["M_PLUS_E_PCT"],
            groups=subject_data["STUGRP"],
            alpha=0.05
        )
        tukey_results[subject] = pd.DataFrame(
            data=tukey.summary().data[1:],
            columns=tukey.summary().data[0]
        )

# Display results
for subject, result in tukey_results.items():
  print(f"\nTukey's HSD Test Results - {subject}")
  print(result.to_string(index=False))

The full table is long; the visualization above is just a sample.

13) Visualize Statistically Significant Tukey Differences

This filters to only significant comparisons (p-adj < .05) and plots mean differences by subject.

significant_results = []

for subject, result in tukey_results.items():
    sig = result[result["p-adj"] < 0.05].copy()
    if not sig.empty:
        sig["Subject"] = subject
        significant_results.append(sig)

significant_df = pd.concat(significant_results, ignore_index=True)

# Keep core fields for plotting
significant_df = significant_df[["Subject", "group1", "group2", "meandiff", "p-adj"]].copy()
significant_df["Comparison"] = significant_df["group1"] + " vs " + significant_df["group2"]

# Sort for readability
significant_df = significant_df.sort_values(by=["Subject", "meandiff"])

significant_df.head(10)

subjects = significant_df["Subject"].unique()

fig, axes = plt.subplots(nrows=len(subjects), figsize=(12, 6 * len(subjects)), sharex=False)

# If there's only one subject, axes won't be an array
if len(subjects) == 1:
    axes = [axes]

for ax, subject in zip(axes, subjects):
    subject_data = significant_df[significant_df["Subject"] == subject].copy()
    subject_data = subject_data.sort_values("meandiff")

    sns.barplot(
        data=subject_data,
        x="meandiff",
        y="Comparison",
        ax=ax,
        errorbar=None
    )

    ax.axvline(0, color="black", linestyle="--")
    ax.set_title(f"Significant Tukey Differences in M_PLUS_E_PCT — {subject}")
    ax.set_xlabel("Mean Difference (M_PLUS_E_PCT)")
    ax.set_ylabel("Group Comparison")

plt.tight_layout()
plt.show()

How to read the plot:

A negative mean difference indicates the first group listed performed lower than the second.
A positive mean difference indicates the first group listed performed higher.

Key Findings

Across subjects, subgroup differences in the percent meeting/exceeding expectations are statistically significant (ANOVA p-values < .05).
The magnitude of gaps varies by subject, with ELA and Math often showing the widest differences.
Science subjects also show disparities, though patterns may be narrower depending on subgroup comparisons.

Limitations

District-level aggregation can mask within-district variation.
The analysis is correlational and cannot identify causal drivers (funding, access, staffing, etc.).
Some subgroup categories may have smaller sample sizes in certain districts, affecting stability.

References

Department of Elementary and Secondary Education. (2025). MCAS Achievement Results. E2C Hub.
Hemphill, F. C., Vanneman, A., & Rahman, T. (2011). Achievement gaps (NAEP). NCES.
Jimenez, L., & Modaffari, J. (2021). Future of testing in education. Center for American Progress.
Kim, K. H., & Zabelina, D. L. (2015). Cultural bias in assessment. International Journal of Critical Pedagogy.
Reeves, R. V., & Halikias, D. (2017). Race gaps in SAT scores. Brookings.
Wicks, A. (2020). Standardized tests are essential for equity. RealClearEducation.

District-Level Analysis of MCAS Achievement Gaps Across Demographic Groups (2020 - 2024)