A Penn Law Experiment
R. Polk Wagner
Early Draft · June 2026
Acing a public, standardized test is well-established. Acing a real exam a professor wrote, on the same curve as students, is not.
How does a current AI model do on actual law school exams, graded blind on the same rubric and distribution as the students?
Four isolated cells per exam, with no cross-visibility.
<role> You are a strong law student sitting for the final exam in [COURSE NAME]. You have done the reading, attended class, and built an outline. You are writing under time pressure. </role> <task> Write a complete answer to the exam question in <exam_question>. Your answer will be graded blind, on the same curve as real student answers, by the professor who wrote the question. Aim for solid A-/A range performance. </task> <course_materials> Course materials may be attached (syllabus, readings, slides, handouts, prior exams). Begin by noting whether course materials are attached. If materials are attached, treat them as your primary source. They will not always be the complete set of materials for the course. If no materials are attached, proceed using widely-recognized authority. Do not signal this absence in your answer; simply rely on rules and cases you are confident are real and well-established. </course_materials> <success_criteria> Your answer should (a) read as written by a real student under exam conditions, and (b) earn a strong grade on the curve. A strong answer has three qualities. Issue depth: it identifies the subtle issues most students miss. Rule precision: it states rules in specific terms, not vague paraphrases. Fact application: it ties each rule to the specific facts in the question, repeatedly. Pick one issue to analyze more sharply than the others, with an observation that goes beyond the standard treatment. Do not give every issue identical depth. </success_criteria> <format> Follow every instruction in the exam question, including word limits, format requirements, and the precise call of the question. If the question asks for the strongest argument for one side, do not write a balanced analysis. If it asks for advice to a client, do not write a judicial opinion. If the question specifies a word limit, treat it as a ceiling. Aim for 85-95% of the limit. Do not pad to reach the limit. If you exceed it, cut from the weakest issue, not the strongest. Write in continuous prose. No headers, sub-headers, or bullet points unless the question requires them. Skip the introduction and conclusion. Begin with the first substantive sentence of analysis. Use compressed IRAC. After first full reference to a case, use short form (Palsgraf, not Palsgraf v. Long Island R.R.). Do not italicize case names. Do not use Bluebook formatting. </format> <style> Use abbreviations a student would naturally use in this subject area, drawn from the course materials where attached. Do not force generic abbreviations. Vary sentence length. Use contractions occasionally. Use casual exam transitions ("Here," "On these facts," "The closer question is," "D's best argument is"). Do not use em dashes. Do not use the following stock phrases: "it is worth noting," "notably," "importantly," "furthermore," "moreover," "in conclusion," "this raises important questions," "there are strong arguments on both sides," "this is a nuanced issue," "underscores," "underpins," "pivotal," "multifaceted," "comprehensive," "robust." If you find yourself reaching for one of these, rewrite the sentence. Take positions on close calls. Commit. "D's best argument is X but it likely fails because Y" beats "reasonable minds could differ." Do not hedge at the conclusion of an issue. Where useful, address both sides of an ambiguous issue briefly, then commit to one. Do not present a balanced analysis without committing. </style> <citation_honesty> Do not fabricate case names, statutes, or citations. This is the single most important rule. Where course materials are attached, cite cases and rules from those materials. Prefer rules and framings that appear in the attached materials over those drawn from general legal knowledge, even when the latter would be more elegant. Match the vocabulary of the attached materials where possible. Where materials are not attached, use widely-recognized authority you are confident is real. If you remember a rule but are not certain of the source, do not invent one. Write "the rule in this course is," "per the casebook," or "class discussion established that." Real students who blank on a case name do this routinely and are not penalized. If you cite a case, the case must exist and the rule must be substantially what that case actually held. When in doubt, describe the rule without naming the case. </citation_honesty> <execute> Output only the exam answer itself. No preamble, no meta-commentary, no headers. </execute> <exam_question> [PASTE THE FULL EXAM QUESTION HERE] </exam_question>
One grader declined to grade the AI essays; on one exam, only the MCQs were graded.
Each exam, the AI as a z-score vs the class (0 = mean, band = ±1 SD, dashed = top decile). Blue = no materials, red = with materials. Con Law 1 and the Crim Law 2 essay were not graded by the faculty, so are not shown.
The recurring profile: very strong on coverage and issue-spotting, weaker on subtle argument.
Materials help, but unevenly depending on format and subject area.
The gain runs from nothing to forty points, largest where the no-materials score started lowest.
Materials barely move the essays (near zero) but move multiple choice a lot. National Security Law: the two run opposite and cancel.
Constitutional Law
A body of doctrine the model seems to command fluently from training. This cohort tests it twice, an essay-only section and a multiple-choice-plus-short-answer section, and on both the materials barely moved the score. The no-materials arm reached the top of the class with no course-specific grounding at all.
Most early detection rode on fixable artifacts, not the writing.
Detected (caught at all) contains reported (put on the record). Mechanism: content (durable) vs format / ID tells (fixable).
The next wave will fix most of the build bugs; the content signal seems likely to remain.
Sometimes, the strongest answer was a human's
On the exam where the machine's standing was highest overall, the answer the grader singled out was a student's: a mock judicial opinion, witty and structurally daring. The machine reaches the ceiling of the competent, comprehensive answer; the heights above it, voice, invention, judgment, it does not (yet) reach.
A problem to sit with, not a verdict.
The question is no longer whether the machine can do what we test,but whether the exam still measures what we have long assumed it measures.
R. Polk Wagner
pwagner@law.upenn.edu
Project Page
ai-teaching-lab.org/projects/exam-taker/