Top
Best
New

Posted by sethbannon 4 days ago

Fighting Fire with Fire: Scalable Oral Exams(www.behind-the-enemy-lines.com)
219 points | 277 commentspage 2
acbart 4 days ago|
I have a lot of complicated feelings and thoughts about this, but one thing that immediately jumps to my mind: was the IRB (Institutional Review Board) consulted on this experiment? If so, I would love to know more details about the protocol used. If not, then yikes!
xmddmx 4 days ago|
Turns out that under the USA Code of Federal Regulations, there's a pretty big exemption to IRB for research on pedagogy:

CFR 46.104 (Exempt Research):

46.104.d.1 "Research, conducted in established or commonly accepted educational settings, that specifically involves normal educational practices that are not likely to adversely impact students' opportunity to learn required educational content or the assessment of educators who provide instruction. This includes most research on regular and special education instructional strategies, and research on the effectiveness of or the comparison among instructional techniques, curricula, or classroom management methods."

https://www.ecfr.gov/current/title-45/subtitle-A/subchapter-...

So while this may have been a dick move by the instructors, it was probably legal.

acbart 4 days ago||
I'm afraid you misunderstand what it means to be "exempt" under the IRB. It doesn't mean "you don't have to talk to the IRB", it means "there's a little less oversight but you still need to file all the paperwork". Here's one university's explanation[1]:

> Exempt human subjects research is a specific sub-set of “research involving human subjects” that does not require ongoing IRB oversight. Research can qualify for an exemption if it is no more than minimal risk and all of the research procedures fit within one or more of the exemption categories in the federal IRB regulations. *Studies that qualify for exemption must be submitted to the IRB for review before starting the research. Pursuant to NU policy, investigators do not make their own determination as to whether a research study qualifies for an exemption — the IRB issues exemption determinations.* There is not a separate IRB application form for studies that could qualify for exemption – the appropriate protocol template for human subjects research should be filled out and submitted to the IRB in the eIRB+ system.

Most of my research is in CS Education, and I have often been able to get my studies under the Exempt status. This makes my life easier, but it's still a long arduous paperwork process. Often there are a few rounds to get the protocol right. I usually have to plan studies a whole semester in advance. The IRB does NOT like it when you decide, "Hey I just realized I collected a bunch of data, I wonder what I can do with it?" They want you to have a plan going in.

[1] https://irb.northwestern.edu/submitting-to-the-irb/types-of-...

xmddmx 4 days ago||
The CFR is pretty clear, and I have experience with this (being both an IRB reviewer, faculty member, and researcher). When it says "is exempt" it means "is exempt".

Imagine otherwise: a teacher who wants change their final exam from a 50 item Scantron using A-D choices, to a 50 item Scantron using A-E choices, because they think having 5 choices per item is better than 4, would need to ask for IRB approval. That's not feasible, and is not what happens in the real world of academia.

It is true that local IRBs may try to add additional rules, but the NU policy you quote talks about "studies". Most IRBs would disagree that "professor playing around with grading procedures and policies" constitutes a "study".

It would be presumed exempted.

Are you a teacher or a student? If you are a teacher, you have wide latitude that a student researcher does not.

Also, if you are a teacher, doing "research about your teaching style", that's exempted.

By contrast, if you are a student, or a teacher "doing research" that's probably not exempt and must go through IRB.

acbart 3 days ago||
You would be correct, except that this is a published blog post. It may not be in an academic journal, but this person has still conducted human subjects research that led to a published artifact. It was just "playing around" until they started posting their students' (summarized, anonymized) data to the internet.
viccis 4 days ago||
>0.42 USD per student (15 USD total)

Reminder: This professor's school costs $90k a year, with over $200k total cost to get an MBA. If that tuition isn't going down because the professor cut corners to do an oral exam of ~35 students for literally less than a dollar each, then this is nothing more than a professor valuing getting to slack off higher than they value your education.

>And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.

No, students are supposed to learn the material and have an exam that fairly evaluates this. Anyone who has spent time on those old terrible online physics coursework sites like Mastering Physics understands that grinding away practicing exams doesn't improve your understanding of the material; it just improves your ability to pass the arbitrary evaluation criteria. It's the same with practicing leetcode before interviews. Doing yet another dynamic programming practice problem doesn't really make you a better SWE.

Minmaxing grades and other external rewards is how we got to the place we're at now. Please stop enshittifying education further.

itissid 2 days ago||
A colleague of mine raised a very important point here. The class is being taught at NYU business school(co taught Konstantinos Rizakos AI/ML Product Mgmt). The fees is pretty high 60,000/year ($2,000+/credit @15 credits/sem) . How much of an ask is it on the business model to incorporate human evaluation say 25% of the cost ~15000$ to spending per student to have their exams evaluated orally by a TA or just do the damn exam in a controlled class environment?
Panos 2 days ago|
Not an issue of cost, at all.

Absolutely the easiest solution would have been to have a written exam on the cases and concepts that we discussed in class. It would take a few hours to create and grade the exam.

But at a university you should experiment and learn. What better class to experiment and learn than the “AI Product Management”. Students were actually intrigued by the idea themselves.

The key goal: we wanted to ensure that the projects that students submitted was actually their own work, not “outsourced” (in a general sense) to teammates or to an LLM.

Gemini 3 and NotebookLM with slide generation were released in the middle of the class, and we realized that it is feasible for a student to have a flaweless presentation in front of the class, without understanding deeply what they are presenting.

We could schedule oral exams during the finals week, which would be a major disruption for the students, or schedule exams during the break, violating university rules and ruining students vacation.

But as I said, we learned that AI-driven interviews are more structured and better than human-driven ones, because humans do get tired, and they do have biases based on who is the person they are interviewing. That’s why we decided to experiment with voice AI for running the oral exam.

rpcope1 4 days ago||
Oral quals were OK and even kind of fun with faculty who I knew and who knew me especially in the context of grad school where it was more a "we know you know this but want to watch you think and haze you a little bit". Having an AI do it's poor simulacrum of this sounds like absolute hell on earth and I can't believe this person thinks it's a good idea.
bagrow 4 days ago||
If you can use AI agents to give exams, what is stopping you from using them to teach the whole course?

Also, with all the progress in video gen, what does recording the webcam really do?

SoftTalker 4 days ago|
What's stopping you from just using the AI to directly accomplish the ultimate goal, rather than taking the very indirect route of educating humans to do it?
semilin 4 days ago||||
What's the end vision here? A society of useless, catatonic humans taken care of by a superintelligence? Even if that's possible, I wouldn't call that desirable. Education is fundamental for raising competent adults.
baq 4 days ago||
Great question about what adults can be more competent about than an artificial superintelligence. ‘How to be a human’ comes to mind and not much more.
jimbokun 4 days ago||||
Yes I feel like we still don’t have a good explanation for why AI is super human at stand alone assessments but fall down when asked to perform long term tasks.
bagrow 4 days ago|||
Well, yes, but, perhaps shortsightedly, I assumed the goal of the professor was to teach the course.
djoldman 3 days ago||
> Gemini lowered its grades by an average of 2 points after seeing Claude's and OpenAI's more rigorous assessments. It couldn't justify giving 17s when Claude was pointing to specific gaps in the experimentation discussion.

This is to be expected. The big commercial LLMs generally respond with text that agrees with the user.

> But here's what's interesting: the disagreement wasn't random. Problem Framing and Metrics had 100% agreement within 1 point. Experimentation? Only 57%.

> Why? When students give clear, specific answers, graders agree. When students give vague hand-wavy answers, graders (human or AI) disagree on how much partial credit to give. The low agreement on experimentation reflects genuine ambiguity in student responses, not grader noise.

The disagreement between the LLMs is interesting. I would hesitate to conclude that "low agreement on experimentation reflects genuine ambiguity in student responses." It could be that it reflects genuine ambiguity on the part of the graders/LLMs as to how a response should be graded.

Yossarrian22 4 days ago||
I predict by the very next semester students still be weaponizing Reasonable Accommodation requests against any further attempts at this
jimbokun 4 days ago|
Universities are rapidly becoming useless as a signal of knowledge and competency of their graduates.
ziofill 3 days ago||
> 36 students examined over 9 days, 25 minutes average

I could accept this for a 300 students class, but 36? When I got my degree, ALL exams had an oral component, usually more than 30 minutes long. The prof and one or two TAs would take a couple days and just do it. For 36 students it’s more than doable. If I was a student being examined by an LLM I would feel like the professor didn’t care enough to do the work.

siscia 3 days ago|
In general when you try a new tool or methodology you tend to start with a small class to see the results first.
bsenftner 3 days ago||
Lots of emotional commenting here. This guy, Panos Ipeirotis, is seriously on to the way university testing and corporate seminar testing will be done in the immediate future, as well as going forward. Complain all you want, this is inevitable. This initial version will improve. In time, more complex and multi-mod voice agents will do the teaching too, entirely individualized as well.
fn-mote 3 days ago||
Did you make it far enough to find out about his "Docent" system for AI exams? If it's not a startup yet, he's thinking about it.

[1]: https://get-docent.com/

bsenftner 3 days ago||
Does it implement the voice assessment agent?
halestock 3 days ago||
You know AI is a great solution that will succeed on its own merits when people need to be told it's "inevitable".
philipallstar 4 days ago|
> I had prepared thoroughly and felt confident in my understanding of the material, but the intensity of the interviewer's voice during the exam unexpectedly heightened my anxiety and affected my performance. The experience was more triggering than I anticipated, which made it difficult to fully demonstrate my knowledge. Throughout the course, I have actively participated and engaged with the material, and I had hoped to better demonstrate my knowledge in this interview.

This sounds as though it was written by an LLM too.

More comments...