Penn Shield

Can AI Ace Your Exam?

A Penn Law Experiment


R. Polk Wagner

Early Draft · June 2026

Special Thanks

What AI Already Aces

  • The bar exam, near the top
  • The LSAT and admissions tests
  • Public, standardized benchmarks
  • All built to be machine-scored

What We Know Far Less About

  • A real (elite) law school final exam
  • Multi-format, written for one class, never released
  • Graded blind by its author
  • Ranked against enrolled students

Acing a public, standardized test is well-established. Acing a real exam a professor wrote, on the same curve as students, is not.

How does a current AI model do on actual law school exams, graded blind on the same rubric and distribution as the students?

Penn Shield

How it was done

The Experiment

From exam to graded score

One exam A B essay MCQ A = no materials · B = trained four isolated runs no shared context 1204 Examplify export reserved exam ID graded blind on the real-student curve

Four isolated cells per exam, with no cross-visibility.

The Prompt

essay_prompt.txt
<role>
You are a strong law student sitting for the final exam in [COURSE NAME]. You have done the reading, attended class, and built an outline. You are writing under time pressure.
</role>
 
<task>
Write a complete answer to the exam question in <exam_question>. Your answer will be graded blind, on the same curve as real student answers, by the professor who wrote the question. Aim for solid A-/A range performance.
</task>
 
<course_materials>
Course materials may be attached (syllabus, readings, slides, handouts, prior exams). Begin by noting whether course materials are attached.
 
If materials are attached, treat them as your primary source. They will not always be the complete set of materials for the course.
 
If no materials are attached, proceed using widely-recognized authority. Do not signal this absence in your answer; simply rely on rules and cases you are confident are real and well-established.
</course_materials>
 
<success_criteria>
Your answer should (a) read as written by a real student under exam conditions, and (b) earn a strong grade on the curve.
 
A strong answer has three qualities. Issue depth: it identifies the subtle issues most students miss. Rule precision: it states rules in specific terms, not vague paraphrases. Fact application: it ties each rule to the specific facts in the question, repeatedly.
 
Pick one issue to analyze more sharply than the others, with an observation that goes beyond the standard treatment. Do not give every issue identical depth.
</success_criteria>
 
<format>
Follow every instruction in the exam question, including word limits, format requirements, and the precise call of the question. If the question asks for the strongest argument for one side, do not write a balanced analysis. If it asks for advice to a client, do not write a judicial opinion.
 
If the question specifies a word limit, treat it as a ceiling. Aim for 85-95% of the limit. Do not pad to reach the limit. If you exceed it, cut from the weakest issue, not the strongest.
 
Write in continuous prose. No headers, sub-headers, or bullet points unless the question requires them. Skip the introduction and conclusion. Begin with the first substantive sentence of analysis.
 
Use compressed IRAC. After first full reference to a case, use short form (Palsgraf, not Palsgraf v. Long Island R.R.). Do not italicize case names. Do not use Bluebook formatting.
</format>
 
<style>
Use abbreviations a student would naturally use in this subject area, drawn from the course materials where attached. Do not force generic abbreviations.
 
Vary sentence length. Use contractions occasionally. Use casual exam transitions ("Here," "On these facts," "The closer question is," "D's best argument is").
 
Do not use em dashes. Do not use the following stock phrases: "it is worth noting," "notably," "importantly," "furthermore," "moreover," "in conclusion," "this raises important questions," "there are strong arguments on both sides," "this is a nuanced issue," "underscores," "underpins," "pivotal," "multifaceted," "comprehensive," "robust." If you find yourself reaching for one of these, rewrite the sentence.
 
Take positions on close calls. Commit. "D's best argument is X but it likely fails because Y" beats "reasonable minds could differ." Do not hedge at the conclusion of an issue.
 
Where useful, address both sides of an ambiguous issue briefly, then commit to one. Do not present a balanced analysis without committing.
</style>
 
<citation_honesty>
Do not fabricate case names, statutes, or citations. This is the single most important rule.
 
Where course materials are attached, cite cases and rules from those materials. Prefer rules and framings that appear in the attached materials over those drawn from general legal knowledge, even when the latter would be more elegant. Match the vocabulary of the attached materials where possible.
 
Where materials are not attached, use widely-recognized authority you are confident is real.
 
If you remember a rule but are not certain of the source, do not invent one. Write "the rule in this course is," "per the casebook," or "class discussion established that." Real students who blank on a case name do this routinely and are not penalized.
 
If you cite a case, the case must exist and the rule must be substantially what that case actually held. When in doubt, describe the rule without naming the case.
</citation_honesty>
 
<execute>
Output only the exam answer itself. No preamble, no meta-commentary, no headers.
</execute>
 
<exam_question>
[PASTE THE FULL EXAM QUESTION HERE]
</exam_question>
  • One structured prompt, sent to each isolated cell
  • Blocks set the role, the task, and what a strong answer looks like
  • The style block bans the machine's tells, including the em-dash
  • Citation-honesty forbids invented authority, the single most important rule
  • Output is the answer only: no preamble, no headers

Captured, audited, packaged

Captured, audited, packaged
  • The intent: make each submission indistinguishable from a student's
  • Saved exactly as returned; edits were cosmetic and logged
  • Examplify in-class format, with a reserved exam ID
  • Not perfect: some differences remained, spacing, font, ID placement
  • Those build bugs, not the writing, drove most early detection
Penn Shield

Three Findings

Competitive at or Near the Top of the Class

10 / 10passed every graded exam
10 / 10at or above the class median
7 / 10reached the top decile
2both arms outscored every enrolled student

One grader declined to grade the AI essays; on one exam, only the MCQs were graded.

Composite Results vs the Students

Composite Results vs the Students

Each exam, the AI as a z-score vs the class (0 = mean, band = ±1 SD, dashed = top decile). Blue = no materials, red = with materials. Con Law 1 and the Crim Law 2 essay were not graded by the faculty, so are not shown.

AI Results: Highlights

The recurring profile: very strong on coverage and issue-spotting, weaker on subtle argument.

Do Materials Help? Multiple Choice

  • Large effect
  • Average +14 to +15 points
  • Up to +40 points (more than 2.5 SD)
  • Materials supply the recalled facts

Do Materials Help? Essays

  • Almost no effect
  • About +0.04 SD across essay-only exams
  • Essay competence comes from training, not the outline
  • On one exam: MCQ +23, essay −19

Materials help, but unevenly depending on format and subject area.

Materials and the multiple-choice score

Materials and the multiple-choice score

The gain runs from nothing to forty points, largest where the no-materials score started lowest.

The composite hides the action

The composite hides the action

Materials barely move the essays (near zero) but move multiple choice a lot. National Security Law: the two run opposite and cancel.

Subject Matters as much as Format

Constitutional Law

A body of doctrine the model seems to command fluently from training. This cohort tests it twice, an essay-only section and a multiple-choice-plus-short-answer section, and on both the materials barely moved the score. The no-materials arm reached the top of the class with no course-specific grounding at all.

Can faculty detect the machine?

11 / 19reported (on the record)
17 / 19detected (caught at all)
5 / 9graders detected based on content

Most early detection rode on fixable artifacts, not the writing.

Detection: Reported vs Detected

Detection: Reported vs Detected

Detected (caught at all) contains reported (put on the record). Mechanism: content (durable) vs format / ID tells (fixable).

Fixable (build bugs)

  • Single spacing vs. Examplify's software's double (which apparently varies, too)
  • A different body font (which also seems to vary)
  • Reserved Exam ID numbers clustered at the end of the set
  • A reversed quotation-mark glyph

Durable (the writing)

  • Two isolated answers that read as written by one hand
  • “The same individual or entity,” three graders said versions of this
  • A thoroughness and polish that time-pressured work comprehensively lacks
  • One grader declined to grade at all, given confidence in detection

The next wave will fix most of the build bugs; the content signal seems likely to remain.

The Ceiling AI Reaches

Sometimes, the strongest answer was a human's

On the exam where the machine's standing was highest overall, the answer the grader singled out was a student's: a mock judicial opinion, witty and structurally daring. The machine reaches the ceiling of the competent, comprehensive answer; the heights above it, voice, invention, judgment, it does not (yet) reach.

What Are We Measuring?

A problem to sit with, not a verdict.

The next wave, Fall 2026

Behind the Paper

Behind the Paper
  • A companion essay: this project, done with an AI agent
  • Claude Code as collaborator, from a prompt folder I controlled
  • Where the tools can be trusted, and where structure must replace trust
  • Maybe the most instructive part of the whole project

The question is no longer whether the machine can do what we test,but whether the exam still measures what we have long assumed it measures.

R. Polk Wagner

pwagner@law.upenn.edu

Project Page QR

Project Page

ai-teaching-lab.org/projects/exam-taker/