Predicting Supreme Court decisions is an inherently high-stakes task: outcomes shape law, policy, and lives, yet the decision process is complex, opaque, and influenced by legal, procedural, and human factors
Machine learning promises predictive leverage, but it risks learning spurious correlations, amplifying bias, or offering performance gains that collapse under scrutiny
There are 3 research questions to address:
RQ1. How do machine learning models compare to a baseline in predicting successful or unsuccessful appeals?
RQ2. Do hearing length and vote unanimity affect the performance of the models?
RQ3. Does political orientation of SCOTUS judges introduce bias and impact decisions?
Why it matters
If machine learning is to be used, directly or advisory, in legal contexts, it must demonstrably outperform naïve baselines without relying on sensitive or ethically questionable features. Understanding where ML adds value and where it fails is essential for trust, fairness, and responsible deployment
Approach
Benchmarked multiple machine learning models against a strong majority-class baseline to establish whether meaningful predictive signal exists
Carefully preprocessed and engineered legal, temporal, and textual features, balancing model performance against interpretability and bias risk
Systematically evaluated the effect of procedural variables (hearing length, vote unanimity) and judicial political orientation on predictive performance
Key Insight
Framed legal outcome prediction as a signal-vs-bias problem, where outperforming a baseline is necessary but not sufficient for meaningful insight
Demonstrated that high-dimensional feature spaces can degrade generalization, particularly for instance-based models like KNN
Showed that ensemble methods marginally improve performance, but remain constrained by class imbalance and correlated features
Found that sensitive or intuitively relevant features (ideology, unanimity, hearing length) do not materially improve predictive power, challenging common assumptions about judicial bias in outcome prediction
Result
Random Forest marginally outperformed the Zero-R baseline, achieving the best overall F1-score, while KNN underperformed due to noise sensitivity and dimensionality
All models struggled to correctly predict successful appeals, revealing a strong bias toward the majority class
Adding procedural and ideological features did not improve performance, indicating limited predictive value and supporting fairness considerations
we were required to submit our papers anonymously to prevent any biases in grading and assessment