Fighting Fire with Fire: Scalable Oral Exams

Posted by sethbannon 1/2/2026

Fighting Fire with Fire: Scalable Oral Exams(www.behind-the-enemy-lines.com)

221 points | 279 commentspage 2

acbart 1/2/2026|

I have a lot of complicated feelings and thoughts about this, but one thing that immediately jumps to my mind: was the IRB (Institutional Review Board) consulted on this experiment? If so, I would love to know more details about the protocol used. If not, then yikes!

xmddmx 1/2/2026|

Turns out that under the USA Code of Federal Regulations, there's a pretty big exemption to IRB for research on pedagogy:

CFR 46.104 (Exempt Research):

46.104.d.1 "Research, conducted in established or commonly accepted educational settings, that specifically involves normal educational practices that are not likely to adversely impact students' opportunity to learn required educational content or the assessment of educators who provide instruction. This includes most research on regular and special education instructional strategies, and research on the effectiveness of or the comparison among instructional techniques, curricula, or classroom management methods."

https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-...

So while this may have been a dick move by the instructors, it was probably legal.

acbart 1/3/2026||

I'm afraid you misunderstand what it means to be "exempt" under the IRB. It doesn't mean "you don't have to talk to the IRB", it means "there's a little less oversight but you still need to file all the paperwork". Here's one university's explanation[1]:

> Exempt human subjects research is a specific sub-set of “research involving human subjects” that does not require ongoing IRB oversight. Research can qualify for an exemption if it is no more than minimal risk and all of the research procedures fit within one or more of the exemption categories in the federal IRB regulations. *Studies that qualify for exemption must be submitted to the IRB for review before starting the research. Pursuant to NU policy, investigators do not make their own determination as to whether a research study qualifies for an exemption — the IRB issues exemption determinations.* There is not a separate IRB application form for studies that could qualify for exemption – the appropriate protocol template for human subjects research should be filled out and submitted to the IRB in the eIRB+ system.

Most of my research is in CS Education, and I have often been able to get my studies under the Exempt status. This makes my life easier, but it's still a long arduous paperwork process. Often there are a few rounds to get the protocol right. I usually have to plan studies a whole semester in advance. The IRB does NOT like it when you decide, "Hey I just realized I collected a bunch of data, I wonder what I can do with it?" They want you to have a plan going in.

[1] https://irb.northwestern.edu/submitting-to-the-irb/types-of-...

xmddmx 1/3/2026||

The CFR is pretty clear, and I have experience with this (being both an IRB reviewer, faculty member, and researcher). When it says "is exempt" it means "is exempt".

Imagine otherwise: a teacher who wants change their final exam from a 50 item Scantron using A-D choices, to a 50 item Scantron using A-E choices, because they think having 5 choices per item is better than 4, would need to ask for IRB approval. That's not feasible, and is not what happens in the real world of academia.

It is true that local IRBs may try to add additional rules, but the NU policy you quote talks about "studies". Most IRBs would disagree that "professor playing around with grading procedures and policies" constitutes a "study".

It would be presumed exempted.

Are you a teacher or a student? If you are a teacher, you have wide latitude that a student researcher does not.

Also, if you are a teacher, doing "research about your teaching style", that's exempted.

By contrast, if you are a student, or a teacher "doing research" that's probably not exempt and must go through IRB.

acbart 1/3/2026||

You would be correct, except that this is a published blog post. It may not be in an academic journal, but this person has still conducted human subjects research that led to a published artifact. It was just "playing around" until they started posting their students' (summarized, anonymized) data to the internet.

viccis 1/2/2026||

>0.42 USD per student (15 USD total)

Reminder: This professor's school costs $90k a year, with over $200k total cost to get an MBA. If that tuition isn't going down because the professor cut corners to do an oral exam of ~35 students for literally less than a dollar each, then this is nothing more than a professor valuing getting to slack off higher than they value your education.

>And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.

No, students are supposed to learn the material and have an exam that fairly evaluates this. Anyone who has spent time on those old terrible online physics coursework sites like Mastering Physics understands that grinding away practicing exams doesn't improve your understanding of the material; it just improves your ability to pass the arbitrary evaluation criteria. It's the same with practicing leetcode before interviews. Doing yet another dynamic programming practice problem doesn't really make you a better SWE.

Minmaxing grades and other external rewards is how we got to the place we're at now. Please stop enshittifying education further.

rpcope1 1/2/2026||

Oral quals were OK and even kind of fun with faculty who I knew and who knew me especially in the context of grad school where it was more a "we know you know this but want to watch you think and haze you a little bit". Having an AI do it's poor simulacrum of this sounds like absolute hell on earth and I can't believe this person thinks it's a good idea.

bagrow 1/2/2026||

If you can use AI agents to give exams, what is stopping you from using them to teach the whole course?

Also, with all the progress in video gen, what does recording the webcam really do?

SoftTalker 1/2/2026|

What's stopping you from just using the AI to directly accomplish the ultimate goal, rather than taking the very indirect route of educating humans to do it?

semilin 1/2/2026||||

What's the end vision here? A society of useless, catatonic humans taken care of by a superintelligence? Even if that's possible, I wouldn't call that desirable. Education is fundamental for raising competent adults.

baq 1/3/2026||

Great question about what adults can be more competent about than an artificial superintelligence. ‘How to be a human’ comes to mind and not much more.

jimbokun 1/2/2026||||

Yes I feel like we still don’t have a good explanation for why AI is super human at stand alone assessments but fall down when asked to perform long term tasks.

bagrow 1/2/2026|||

Well, yes, but, perhaps shortsightedly, I assumed the goal of the professor was to teach the course.

Yossarrian22 1/2/2026||

I predict by the very next semester students still be weaponizing Reasonable Accommodation requests against any further attempts at this

jimbokun 1/2/2026|

Universities are rapidly becoming useless as a signal of knowledge and competency of their graduates.

Levitz 1/2/2026||

Humanization and responsibility issues aside (I worry that the author seems to validate AIs judgement with no second thought) education is one sector which isn't talked about enough in terms of possible progress with AI.

Ask about any teacher, scalability is a serious issue. Students being in classes above and under their level is a serious issue. non-interactive learning, leading to rote memorization, as a result of having to choose scaling methods of learning is a serious issue. All these can be adjusted to a personal level through AI, it's trivial to do so, even.

I'm definitely not sold on the idea of oral exams through AI though. I don't even see the point, exams themselves are specifically an analysis of knowledge at one point in time. Far from ideal, we just never got anything better, how else can you measure a student's worth?

Well, now you could just run all of that student's activity in class through that AI. In the real world you don't know if someone is competent because you run an exam, you know if he is competent because he consistently shows competency. Exams are a proxy for that, you can't have a teacher looking at a student 24/7 to see they know their stuff, except now you can gather the data and parse it, what do I care if a student performs 10 exercises poorly in a specific day at a specific time if they have shown they can do perfectly well, as can be ascertained by their performance the past week?

rogerrogerr 1/2/2026||

> now you could just run all of that student's activity in class through that AI. In the real world you don't know if someone is competent because you run an exam, you know if he is competent because he consistently shows competency.

But isn’t the whole point of a class to move from incompetent to competent?

Levitz 1/2/2026||

Sure, and the exam is to test that happened. There is no need to perform that test at one point in time if you continuously check the student's performance.

rogerrogerr 1/3/2026||

Ah, now I’m getting it. You’re basically measuring the derivative of competency and getting a decent idea of where they are at the end of the course without needing to do a big-bang final exam.

jimbokun 1/2/2026||

I don’t understand.

Isn’t the poor performance on those exercises also part of their overall performance? Do you mean just that their positive work outweighs the bad work?

philipallstar 1/2/2026||

> I had prepared thoroughly and felt confident in my understanding of the material, but the intensity of the interviewer's voice during the exam unexpectedly heightened my anxiety and affected my performance. The experience was more triggering than I anticipated, which made it difficult to fully demonstrate my knowledge. Throughout the course, I have actively participated and engaged with the material, and I had hoped to better demonstrate my knowledge in this interview.

This sounds as though it was written by an LLM too.

semilin 1/2/2026||

This seems like a mistake. On the one hand, other commenters' experiences provide additional evidence that oral communication is a vastly different skill from the written word and ought to be emphasized more in education. Even if a student truly understands a concept, they might struggle at talking about it in a realtime context. For many real-world cases, this is unacceptable. Therefore the skill needs to be taught.

On the other hand, can an AI exam really simulate the conditions necessary for improving at this skill? I think this is unlikely. The students' responses indicate not a general lack of expertise in oral communication but also a discomfort with this particular environment. While the author is making steps to improve the environment, I think it is fundamentally too different from actual human-to-human discussion to test a student's ability in oral communication. Even if a student could learn to succeed in this environment, it won't produce much improvement in their real world ability.

But maybe that's not the goal, and it's simply to test understanding. Well, as other commenters have stated, this seems trivially cheatable. So it neither succeeds at improving one's ability in oral communication nor at testing understanding. Other solutions have to be thought of.

wpollock 1/3/2026||

Some points:

LLM oral exams can provide assessment in a student's native language. This can be very important in some scenarios!

Unlimited attempts won't work in the presented model. No matter how many cases you have, all will eventually find their way to the various cheating sites.

There is no silver bullet. There's no solution that works for all schools. Strategies that work well for M.I.T. with competitive enrollment and large budgets won't work for a small community college in an agricultural state, with large teaching loads per professor, no TAs, and about 15-25 hours of committee or other non-teaching work. That was my situation.

Teaching five courses and eight sections, 20-30 students per section, 10-20 office hours every week (and often more if the professor cared about the students), leaves little time for grading. In desperation I turned to weekly homework assignments, 4-6 programming projects, and multiple choice exams (containing code and questions about it). Not ideal by any means, just the best I could do.

So I smile now (I'm retired) when I hear about professors with several TAs each, explaining how they do assessment of 36 students at a school with competitive enrollment.

ziofill 1/3/2026|

> 36 students examined over 9 days, 25 minutes average

I could accept this for a 300 students class, but 36? When I got my degree, ALL exams had an oral component, usually more than 30 minutes long. The prof and one or two TAs would take a couple days and just do it. For 36 students it’s more than doable. If I was a student being examined by an LLM I would feel like the professor didn’t care enough to do the work.

siscia 1/3/2026|

In general when you try a new tool or methodology you tend to start with a small class to see the results first.

More comments...