Posted by sethbannon 2 days ago
[...]
> Take-home exams are dead. Reverting to pen-and-paper exams in the classroom feels like a regression.
Yeah, not sure the conclusion of the article really matches the data.
Students were invited to talk to an AI. They did so, and having done so they expressed a clear preference for written exams - which can be taken under exam conditions to prevent cheating, something universities have hundreds of years of experience doing.
I know some universities started using the square wheel of online assessment during covid and I can see how this octagonal wheel seems good if you've only ever seen a square wheel. But they'd be even better off with a circular wheel, which really doesn't need re-inventing.
And they didn't even bother to test the most important thing. Were the LLM evaluations even accurate! Have graders manually evaluate them and see if the LLMs were even close or were wildly off.
This is clearly someone who had a conclusion to promote regardless of what the data was going to show.
This is not true; the professor and the TAs graded every student submission. See this paragraph from the article:
(Just in case you are wondering, I graded all exams myself and I asked the TA to also grade the exams; we mostly agreed with the LLM grades, and I aligned mostly with the softie Gemini. However, when examining the cases when my grades disagreed with the council, I found that the council was more consistent across students and I often thought that the council graded more strictly but more fairly.)
https://elevenlabs.io/app/talk-to?agent_id=agent_8101k9d1pq4...
Also, given that there's so many ways for LLMs to go off the rails (it just gave me the student id I was supposed to say, for example), it feels a bit unprofessional to be using this to administer real exams.
That does not resemble any good professor I've ever heard. It's very aggressive and stern, which is not generally how oral exams are conducted. Feels much more like I'm being cross examined in court.
https://i.imgur.com/EshEhls.png
When someone at that level pretends to not understand it, there is no way to mince words.
This is malice.
Having said that, LLMs can be good tutors if used correctly.
Some universities and professors have tried to move to a take-home exam format, which allows for more comprehensive evaluation with easier logistics than a too-brief in-class exam or an hours-long outside-of-class sitting where unreasonable expectations for mental and sometimes physical stamina are factors. That "take-home exams are dead" is self-evident, not a result of the experiment in the article. There used to be only a limited number of ways to cheat at a take-home exam, and most of them involved finding a second person who also lacked a moral conscience. Now, it's trivial to cheat at a take-home exam all by yourself.
You also mentioned the hundreds of years of experience universities have at traditional written exams. But the type and manner of knowledge and skills that must be tested for vary dramatically by discipline, and the discipline in question (computer science / software engineering) is still new enough that we can't really say we've matured the art of examining for it.
Lastly, I'll just say that student preference is hardly the way to measure the quality of an exam, or much of anything about education.
Did I say "conclusion" ? Sorry, I should have said the section just before the acknowledgements, where the conclusion would normally be, entitled "The bigger point"
That is, the author concluded that AI tools provide viable alternatives to the other available options, and which solve many of their problems.
When I was a student, I would have been quite vocal with my clear preferences for all exams being open-book and/or being able to amend my answers after grading for a revised score.
What I'm saying is, "the students would prefer..." isn't automatically case closed on what's best. Obviously the students would prefer a take-home because you can look up everything you can't recall / didn't show up to class to learn, and yes, because you can trivially cheat with AI (with a light rewrite step to mask the "LLM voice").
But in real life, people really will ask you to explain your decisions and to be able to reason about the problem you're supposedly working on. It seems clear from reading the revised prompts that the intent is to force the agent to be much fairer and easier to deal with than this first attempt was, so I don't think this is a bad idea.
Finally, (this part came from my reading of the student feedback quotes in the article) consider that the current cohort of undergrads is accustomed to communicating mainly via texting. To throw in a further complication, they were around 13-17 when COVID hit, decreasing human contact even more. They may be exceedingly nervous about speaking to anyone who isn't a very close friend. I'm sympathetic to them, but helping them overcome this anxiety with relatively low stakes is probably better than just giving up on them being able to communicate verbally.
How do you expect that to work? After the exam, you talk to your friends (and to ChatGPT) and know the correct answers even if you could have never produced them during the exam.
This was pre-LLM, but you could cheat back then too. LLMs make it a bit easier by showing you the work to "show" on your corrections.
Such classes do not have the luxury of pen-and-paper exams, and asking people to go to testing centers is a huge overkill.
Take home exams for such settings (or any other form of written exam) are becoming very prone to cheating, just because the bar to cheating is very low. Oral exams like that make it a bit harder to cheat. Not impossible, but harder.
Have exams ever been about humanity and the optics of it ?
In addition to being non-deterministic LLMs can product vastly different output from very slightly different input.
That’s ignoring how vulnerable LLMs are to prompt injection, and if this becomes common enough that exams aren’t thoroughly vetted by humans, I expect prompt attacks to become common.
Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
They mention getting 100% agreement between the LLMs on some questions and lower rates on other, so if an exam was composed of only questions where there is near 100% convergence, we'd be pretty close to a stable state.
I agree it would be reassuring to have a human somewhere in the loop, or perhaps allow the students to appeal the evaluation (at cost?) if they is evidence of a disconnect between the exam and the other criteria. But depending on how the questions and format is tweaked we could IMHO end up with something reliable for very basic assessments.
PS:
> Also if this is about avoiding in person exams, what prevents students from just letting their AI talk to test AI.
Nothing indeed. The arms race hasn't started here, and will keep going IMO.
So the whole thing is a complete waste of time then as an evaluation exercise.
>council of AIs
This only works if the errors and idiosyncrasies of different models are independent, which isn’t likely to be the case.
>100% agreement
When different models independently graded tests 0% of grades matched exactly and the average disagreement was huge.
They only reached convergence on some questions when they allowed the AIs to deliberate. This is essentially just context poisoning.
1 model incorrectly grading a question will make the other models more likely to incorrectly grade that question.
If you don’t let models see each other’s assessments, all it takes is one person writing an answer in a slightly different way that causes disagreement among models to vastly alter the overall scores by tossing out a question.
This is not even close to something you want to use to make consequential decisions.
Appeals aren't a solution either, because students won't appeal (or possibly even notice) a small bias given the variability of all the other factors involved, nor can it be properly adjucated in a dispute.
If the goal is to assess whether a student properly understood the work they submitted or more generally if they assimilated most concepts of a course, the evaluation can have a bar low enough for let's say 90% of the student to easily pass. That would give enough of margin of error to account for small biases or misunderstandings.
I was comparing to mark sheet tests as they're subject to similar issues, like students not properly understanding the wording (and usually the questions and answer have to be worded in pretty twisted ways to properly) or straight checking the wrong lines or boxes.
To me this method, and other largely scalable methods, shouldn't be used for precise evaluations, and the teachers proposing it also seem to be aware of these limitations.
Humans are incredibly good at solving problems, but while one person is solving 'how do we prevent students from cheating' a student is thinking 'how I bypass this limitation preventing me from cheating'. And when these problems are digital and scalable, it only takes one student to solve that problem for every other student to have access to the solution.
Right on point. I find particularly striking how little is said about whether the best students achieve the best grades. Authors are even candid that different LLMs asses differently, but seem to conclude that LLMs converging after a few rounds of cross reviews indicate they are plausible so who cares. The apparences are safe.
In 1789 there were 1,000 enrolled college students total, in a country of 2.8M. In 2025, it is 19M students in a country of 340M. https://educationalpolicy.org/wp-content/uploads/2025/11/251...
In 1950, 5.5% of adults ages 25-34 had completed a 4 year college degree. In 2018, it was 39%. https://www.highereddatastories.com/2019/08/changes-in-educa...
With attendance increasing at this rate (not to mention the exploding costs of tuition), it seems possible that the methods need to change as well.
The objections to SFH exist and are strikingly similar to objections to WFH, but the economics are different. Some universities already see value in offering that option, and they (of course) leave it to the faculty to deal with the consequences.
Even for distance education though, proctored testing centers have been around longer than the internet.
It is about a third of the students I teach, which amounts to several hundreds per term. It may be niche, but it is not insignificant, and definitely a problem for some of us.
> Even for distance education though, proctored testing centers have been around longer than the internet.
I don't know how much experience you have with those. Mine is extensive enough that I have a personal opinion that they are not scalable (which is the focus of the comment I was replying to). If you have hundreds of students disseminated around the world, organising a proctored exam is a logistical challenge.
It is not a problem at many universities yet, because they haven't jumped on the bandwagon. However domestic markets are becoming saturated, visas are harder to get for international students, and there is a demand for online education. I would be surprised that it doesn't develop more in the near future.
I think the end result though is that schools either limit their students to a smaller number of locations where they can have proctored exams, or they don’t and they effectively lose their credentialing value.
If you are going to set an exam that can be graded in 5-10 min, you are not getting a lot of signal out of it.
I wanted to do oral exams, but they are much more exhausting for the prof. Nominally, each student is with you for 30 min, but (1) you need to think of slightly different question for each student (2) you need to squeeze all the exams in only a couple of days to avoid giving later students too much extra time to prepare.
That's entirely false; this is why we have multiple-choice tests.
- They have a base marks of 20-25% (by random guessing) instead of 0.
- You never see the working. So you can't check if students are thinking correctly. Slightly wrong thinking can get you right answers.
- They don't even remotely reflect real life at all. Written worked through problems on the other hand - I still do those in my professional life as a scientist all the time. It's just that I am setting the questions for myself.
- The format doesn't allow for extended thought questions.
In my undergrad, I had some excellent profs who would set long work through exam question in such a way that you learned something even in the exams. Simply a joy taking those exams that gave a comprehensive walk through of the course. As a prof, I have always tried to replicate that.
Thinking deeper, though, multiple choice tests require SIGNIFICANTLY more preparation. I would go so far as to say almost all individual professors are completely unqualified to write valid multiple choice tests.
The time investment in multiple choice comes at the start - 12 hours writing it instead of 12 hours grading it - but it’s still a lot of time and frankly there is only very general feedback on student misunderstandings.
I don't believe that your argument is more than an ad-hoc value judgment lacking justification. And it's obvious that if you think so little of your colleagues, that they would also struggle to implement AI tests.
At least in Germany, if there are only 36 students in a class, usually oral exams are used because in this case oral exams are typically more efficient. For written exams, more like 200-600 students in a class is the common situation.
Some people dream that technology (preferably duly packaged by for-profit SV concerns) can and will eventually solve each and every problem in the world; unfortunately what education boils down to is good, old-fashioned teaching. By teachers. Nothing whatsoever replaces a good, talented, and attentive teacher, all the technologies in the world, from planetariums to manim, can only augment a good teacher.
Grading students with LLMs is already tone-deaf, but presenting this trainwreck of a result and framing it as any sort of success... Let's just say it reeks of 2025.
If a student is willing and desire to learn, an LLM is better than a bad teacher.
If a student doesn't want to learn, and is instead being forced to (either as a minor, or via certification required to obtain work & money), then they have every incentive to cheat. An LLM is insufficient in this case - a teacher is both the enforcer and the tutor in this case.
There's also nothing wrong with a teacher using an LLM to help with the grading imho.
One of these is not like the others.
Work study and TA jobs were abundant when I was in college. It wasn't a problem in the past and shouldn't be a problem now.
> In our new "AI/ML Product Management" class, the "pre-case" submissions (short assignments meant to prepare students for class discussion) were looking suspiciously good. Not "strong student" good. More like "this reads like a McKinsey memo that went through three rounds of editing," good...Many students who had submitted thoughtful, well-structured work could not explain basic choices in their own submission after two follow-up questions. Some could not participate at all...Oral exams are a natural response. They force real-time reasoning, application to novel prompts, and defense of actual decisions. The problem? Oral exams are a logistical nightmare. You cannot run them for a large class without turning the final exam period into a month-long hostage situation.
Written exams do not do the same thing. You can't say 'just do a written exam'. So sure, the students may prefer them, but so what? That's apples and oranges.
I went to school long before LLMs were even a Google Engineer's brianfart for the transformer paper and the way I took exams was already AI proof.
Everything hand written in pen in a proctored gymnasium. No open books. No computers or smart phones, especially ones connected to the internet. Just a department sanctioned calculator for math classes.
I wrote assembly and C++ code by hand, and it was expected to compile. No, I never got a chance to try to compile it myself before submitting it for grading. I had three hours to do the exam. Full stop. If there was a whiff of cheating, you were expelled. Do not pass go. Do not collect $200.
Cohorts for programs with a thousand initial students had less than 10 graduates. This was the norm.
You were expected to learn the gd material. The university thanks you for your donation.
I feel like i'm taking crazy pills when I read things about trying to "adapt" to AI. We already had the solution.
And why is this a flex exactly? Almost sounds like fraud. Get sold on how you'll be taught well and become successful. Pay. Then be sent through an experience that filters so severely, only 1% of people pass. Receive 100% of the blame when you inevitably fail. Repeat for the other 990 students. The "university thanks you for your donation" slogan doesn't sound too hot all of a sudden.
It's like some malicious compliance take on both teaching and studying. Which shouldn't even be surprising, considering the circumstances of the professors e.g. where I studied, as well as the students'.
Mind you, I was (for some classes) tested the same way. People still cheated, and grading stringency varied. People still also forgot everything shortly after wrapping up their finals on the given subjects and moved on. People also memorized questions and compiled a solutions book, and then handed them down to next year's class. Because this method does jack against that on its own. You still need to keep crafting novel questions, vary them more than just by swapping key values, etc.
Perhaps lifetimerubyist means "1000 students took the mandatory philosophy and ethics 101 class, but only 10 graduated as philosophy majors"
If it is, I'd be fascinated to learn more.
I mean, the logistics would be pretty wild - even a large university's largest lecture theatres might only have 500 seats. And they'd only have one or two that large. It'd be expensive as hell to build a university that could handle multiple subjects each admitting over a thousand students.
That's quite a high non-completion rate - but it's nowhere near 99%.
[1] https://nieuws.kuleuven.be/en/content/2023/42-6-of-new-stude...
Do you think you're just purchasing a diploma? Or do you think you're purchasing the opportunity to gain an education and potential certification that you received said education?
It's entirely possible that the University stunk at teaching 99% of it's students (about as equally possible that 99% of the students stunk at learning), but "fraud" is absolute nonsense. You're not entitled to a diploma if you fail to learn the material well enough to earn it.
You could easily raise the bar without sacrificing quality of education (and likely you'd improve it just from the improvement in student:teacher ratio).
In another European country, schools get paid for students that passed.
If teaching was so simple that you could just tell people to go RTFM then recite it from memory, I don't know why people are bothering with pedagogy at all. It'd seem that there's more to teaching and learning than the bare minimum, and that both parties are culpable. Doesn't sound like you disagree on that either.
> you're purchasing the opportunity to
We can swap out fraud for gambling if you like :) Sounds like an even closer analogy now that you mention!
Jokes aside though, isn't it a gamble? You gamble with yourself that you can [grow to] endure and succeed or drop out / something worse. The stake is the tuition, the prize is the diploma.
Now of course, tuition is per semester (here at least, dunno elsewhere), so it's reasonable to argue that the financial investment is not quite in such jeopardy as I painted it. Not sure about the emotional investment though.
Consider the Chinese Gaokao exam, especially in its infamous historical context between the 70s and 90s. The number of available seats was way lower than the number of applications [0]. The exams grueling. What do you reckon, was it the people's fault for not winning an essentially unspoken lottery? Who do you think received the blame? According to a cursory search, the individual and their families (wasn't there, cannot know) received the blame. And no, I don't think in such a tortured scheme it is the students' fault for not making the bar.
If there are fewer seats than what there is demand for, then that's overbooking, and you the test authoring / conducting authority are biased to artificially induce test failures. It is no longer a fair assessment, nor a fair dynamic. Conversely, passing is no longer an honest signal of qualification. Or rather, not passing is no longer an honest signal of unqualification. And this doesn't have to come from a single test, it can be implemented structurally too, so that you shed people along the way. Which is what I'm actually alluding to.
[0] ~4.8%, so ~95% of people failed it by design: https://en.wikipedia.org/wiki/Class_of_1977%E2%80%931978_%28...
I do not! A situation where roughly 1% of the class is passing suggests that some part of the student group is failing, and also that there is likely a class design issue or a failure to appropriately vet incoming students for preparedness (among, probably, numerous other things I'm not smart enough to come up with).
And I did take issue with the "fraud" framing; apologies for not catching your tone! I think there is a chronic issue of students thinking they deserve good grades, or deserve a diploma simply for showing up, in social media and I probably read that into your comment where I shouldn't have.
> Jokes aside though, isn't it a gamble?
Not at all. If you learn the material, you pass and get a diploma. This is no more a gamble than your paycheck. However, I think that also presumes that the university accepts only students it believes are capable of passing it's courses. If you believe universities are over-accepting students (and I think the evidence says they frequently are not, in an effort to look like luxury brands, though I don't have a cite at hand), then I can see thinking the gambling analogy is correct.
Yeah, that's fine, I can definitely appreciate that angle too.
As you can probably surmise, I've had quite some struggles during my college years specifically, hence my angle of concern. It used to be the other way around, I was doing very well prior to college, and would always find people's complaints to be just excuses. But then stuff happened, and I was never really the same. The rest followed.
My personal sob story aside, what I've come to find is that while yes, a lot of the things slackers say are cheap excuses or appeals to fringe edge-cases, some are surprisingly valid. For example, if this aforementioned 99% attrition rate is real, that is very very suspect. Worse still though, I'd find things that people weren't talking about, but were even more problematic. I'll have to unfortunately keep that to myself though for privacy reasons [0] [1].
Regarding grading, I find grade inflation very concerning, and I don't really see a way out. What affects me at this point though is certifications, and the same issue is kind of present there as well. I have a few colleagues who are AWS Certified xyz Engineers for example, but would stare at the AWS Management Console like a deer in the headlights, and would ask exceedingly stupid questions. The "fee extraction" practice wouldn't be too unfamiliar for the certification industry either - although that one doesn't bother me much, since I don't have to pay for these out of my own pocket, thankfully.
> If you learn the material, you pass and get a diploma. This is no more a gamble than your paycheck
I'd like to push back on this just a little bit. I'm sure it depends on where one lives, but here you either get your diploma or tough luck. There are no partial credentials. So while you can drop out (or just temporarily suspend your studies) at the end of semester, there's still stuff on the line. Not so much with a paycheck. I guess maybe a promotion is a closer analog, depending on how a given company does it (vibes vs something structured). This is further compounded by the social narrative, that if you don't get a degree then xyz, which is also not present for one's next monthly paycheck.
[0] What I guess I can mention is that I generally found the usual cycle of study season -> exam season to be very counter-productive. In general, all these "building up hype and then releasing it all at once" type situations were extremely taxing, and not for the right reasons. I think it's pretty agreeable at least that these do not result in good knowledge retention, do not inspire healthy student engagement, nor are actually necessary. Maybe this is not even a thing in better places, I don't know.
[1] I have absolutely no training in psychology or pedagogy, so take this with a mountain of salt, but I've found that people can be not just uninterested in learning, but grow downright hostile to it, often against their own self-recognized best interests. I've experienced it on myself, as well as seen it with others. It can be very difficult to snap someone out of such a state, and I have a lingering suspicion that it kind of forms a pipeline, with the lack of interest preceding it. I'm not sure that training and evaluating people in such a state results in a reasonable assessment, not for them, nor for the course they're taking.
Colleges exist to collect tuition, especially from international students who pay more. Teaching anything at all, or punishing cheating, just isn’t that important.
For comparison we had lengthy sessions in a jailed terminal, week after week, writing C programs covering specific algorithms, compiling and debugging them within these sessions and assistants would follow our progress and check we're getting it. Those not finishing in time get additional sessions.
Last exam was extremely simple and had very little weight in the overall evaluation.
That might not scale as much, but that's definitely what I'd long for, not the Chuck Norris style cram school exam you are drawing us.
> I wrote assembly and C++ code by hand, and it was expected to compile. No, I never got a chance to try to compile it myself before submitting it for grading.
Do you, like, really think this is the best way to assess someone's ability? Can't we find a place between the two extremes?
Personally, I'd go with a school-provided computer with a development environment and access to documentation. No LLMs, except maybe (but probably not) for very high-level courses.
Lots of my tests involved writing pseudocode, or "Just write something that looks like C or Java". Don't miss the semicolon at the end of the line, but if you write "System.print()" rather than "System.out.printLn()" you might lose a single point. Maybe.
If there were specific functions you need to call, it would have a man page or similar on the test itself, or it would be the actual topic under test.
I hand wrote a bunch of SQL queries. Hand wrote code for my Systems Programming class that involved pointers. I'm not even good with pointers. I hand wrote Java for job interviews.
It's pretty rare that you need to actually test someone can memorize syntax, that's like the entire point of modern development environments.
But if you are completely unable to function without one, you might not know as much as you would hope.
The first algorithms came before the first programming languages.
Sure, it means you need to be able to run the code in your head and be able to mentally "debug" it, but that's a feature
If you could not manage these things, you washed out in the CS101 class that nearly every STEM student took. The remaining students were not brilliant, but most of them could write code to solve problems. Then you got classes that could actually teach and test that problem solving itself.
The one class where we built larger apps more akin to actual jobs, that could have been done entirely in the lab with locked down computers if need be, but the professor really didn't care if you wanted to fake the lab work, you still needed to pass the book learning for "Programming Patterns" which people really struggled with and you still needed to be able to give a "Demo" and presentation, and you still needed to demonstrate that you understood how to read some requests from a "Customer" and turn it into features and requirements and UX
Nobody cares about people sabotaging their own education except in programming because no matter how much MBAs insist that all workers are replaceable, they cannot figure out a way to actually evaluate the competency of a programmer without knowing programming. If an engineer doesn't actually understand how to evaluate static stresses on a structure, they are going to have a hard time keeping a job. Meanwhile in the world of programming, hopping around once a year is "normal" somehow, so you can make a lot of money while literally not knowing fizzbuzz. I don't think the problem is actually education.
Computer Science isn't actually about using a laptop.
This applies to prose as much as code. A computer completely changes the experience of writing, for the better.
Yes, obviously people made do with analog writing for hundreds of years, yadda yadda, I still think it's a stupid restriction.
It is a sad world we live in.
My experience is the same except I think ~50% or so graduated[0].
[0]: Disclaimer that my programme was pretty competitive to get into, which is an earlier filter. Statistics looked worse for programmes at similar level with less people applying.
Also, IMO oral examinations are quite powerful for detecting who is prepared and who isn't. On the down side they also help the extroverts and the confident, and you have to be careful about preventing a bias towards those.
This is true, but it is also why it is important to get an actual expert to proctor the exam. Having confidence is good and should be a plus, but if you are confident about a point that the examiner knows is completely incorrect, you may possibly put yourself in an inescapable hole, as it will be very difficult to ascertain that you actually know the other parts you were confident (much less unconfident) in.
The old ways do not scale well once you pass a certain number of students.
You have a very weird idea of education if a teaching method that results in a 99% failure rate is seen as good by yourself. Do you imagine a professional turning out work that was 99% suboptimal?
Don't tell me about GenZ. I had oral exams in calculus as undergrad, and our professor was intimidating. I barely passed each time when I got him as examiner, though I did reasonably well when dealing with his assistant. I could normally keep my emotions in check, but not with my professor. Though, maybe in that case the trigger was not just the tone of professor, but the sheer difference in the tone he used normally (very friendly) and at the exam time. It was absolutely unexpected at my first exam, and the repeated exposure to it didn't help. I'd say it was becoming worse with each time. Today I'd overcome such issues easily, I know some techniques today, but I didn't when I was green.
OTOH I wonder, if an AI could have such an effect on me. I can't treat AI as a human being, even if I wanted to, it is just a shitty program. I can curse a compiler refusing to accept a perfectly valid borrow of a value, so I can curse an AI making my life difficult. Mostly I have another emotional issue with AI: I tend to become impatient and even angry at AI for every small mistake it does, but this one I could overcome easily.
I wish that wasn't a thing.
Interviews are similar, but different: I'm presenting myself.
In my graduate studies in Germany, most of my courses used oral exams. It's fine, and it's battle-tested.
Just like vote-counting, testing students is perfectly scalable without anything but teachers. But: In Europe, I have witnessed oral exams at the Matura, and at the final Diploma test. In the US, I understand all PhDs need a oral defense session.
To me, this mindset of delegating to AI because of laziness is perfectly embodied in "Experimenta Felicitologica" (sp?) By Stanislaw Lem.
AI is great when performing somewhat routine tasks, but for anything inherently adversarial, I'm skeptical we'll soon see good solutions. Building defeating AIs is just too inexpensive.
I wonder what that means for AI warfare.
This is a summary of sorts:
"Trurl, having decided to make the entire Universe happy, first sat down and developed a General Theory of All-Possible Happiness... Eventually, however, Trurl grew weary of the work. To speed things up, he built a great computer and provided it with a programmatic duplicate of his own mind, that it might conduct the necessary research in his stead.
But the machine, instead of setting to work, began to expand. It grew new stories, wings, and outbuildings, and when Trurl finally lost his patience and commanded it to stop building and start thinking, the machine—or rather, the Trurl-within-the-machine—replied that it couldn't possibly think yet, for it still didn't have enough room. It claimed it was currently housing the Sub-Trurls—specialized programs for General Felicitology, Experimental Hedonistics, and Happiness-Machine-Building—who were currently occupied with their quarterly reports.
The 'Clone-Trurl' told him marvelous tales of the results these sub-Trurls had already achieved in their digital simulations. Trurl, however, soon discovered that these were all cut from the same cloth of lies; not a single sub-Trurl existed, no research had been done, and the machine had simply been using its processing power to enjoy itself and expand its own architecture. In a fit of rage, Trurl took a hammer to the machine and for a long time thereafter gave up all thought of universal happiness."
It's a great allegory. A real shame there is no english translation.
When I was doing a lot of hiring we offered the option (don’t roast me, it was an alternative they could choose if they wanted) of a take-home problem they could do on their own. It was reasonably short, like the kind of problem an experienced developer could do in 10-15 minutes and then add some polish, documentation, and submit it in under an hour.
Even though I told candidates that we’d discuss their submission as part of the next step, we would still get candidates submitting solutions that seemed entirely foreign to them a day later. This was on the cusp of LLMs being useful, so I think a lot of solutions were coming from people’s friends or copied from something on the internet without much thought.
Now that LLMs are both useful and well known, the temptation to cheat with them is huge. For various reasons I think students and applicants see using LLMs as not-cheating in the same situations where they wouldn’t feel comfortable copying answers from a friend. The idea is that the LLM is an available tool and therefore they should be able to use it. The obvious problem with that argument is that we’re not testing students or applicants on their abilities to use an LLM, we’re using synthetic problems to explore their own skills and communication.
Even some of the hiring managers I know who went all in on allowing LLMs during interviews are changing course now. The LLM-assisted interviewed were just turning into an exercise of how familiar the candidate was with the LLM being used.
I don’t really agree with some of the techniques they’re using in this article, but the problem they’re facing is very real.
You've piqued my interest!
Where do we go from there? At some point soon I think this is going to have to come firmly back to real people.
Next steps are bone conduction microphones, smart glasses, earrings...
And the weeding out of anyone both honest and with social anxiety.
As an aside, I'm surprised oral exams aren't possible at 36 students. I feel like I've taken plenty of courses with more participants and oral exams. But the break even point is probably very different from country to country.
> And here is the delicious part: you can give the whole setup to the students and let them prepare for the exam by practicing it multiple times. Unlike traditional exams, where leaked questions are a disaster, here the questions are generated fresh each time. The more you practice, the better you get. That is... actually how learning is supposed to work.
If you're looking for suggestions, I'd love for you to start with a problem that isn't trivially fixable.
this is also known as 'logistical nightmare', but yeah it's the only reasonable way if you want to avoid being questioned by robots.
I think the most I experienced at the physics department in Aarhus was 70ish students. 200 sounds like a big undertaking.
They're even more possible if you do an oral exam only on the highest grades. That's the purpose, isn't it? To see if a good, very good, or excellent student actually knows what they're talking about. You can't spare 10 minutes to talk to each student scoring over 80% or something? Please
It depends on how frequent and how in-depth you want the exams to be. How much knowledge can you test in an oral exam that would be similar to a two-hour written exam? (Especially when I remember my own experience where I would have to sketch ideas for 3/4th of the time alloted before spending the last 1/4th writing frenetically the answer I found _in extremis_).
If I were a teacher, my experience would be to sample the students. Maybe bias the sample towards students who give wrong answers, but then it could start either a good feedback loop ("I'll study because I don't want to be interrogated again in front of the class") or a bad feedback loop ("I am being picked on, it is getting worse than I can improve, I hate this and I give up")
If this is the only way to keep the existing approach working, it feels like the only real solution for education is something radically different, perhaps without assessment at all
I did, however, pepper my answers with statements like "it is widely accepted that the industry standard for this concept is X". I would feel bad lying to a human, but I feel no such remorse with an AI.
No, we do not want to eliminate the pen and paper exam. It works well. We use it.
The oral exam is yet another tool. Not a solution for everything.
In our case, we wanted to ensure that the students who worked on the team project: (a) contributed enough to understand the project, (b) actually understood their own project and did not rely solely on an LLM. (We do allow them to use LLMs, it would be stupid not to.)
The students who did badly in the oral exam were exactly the students who we expected to do badly in the exam, even though they aced their (team) project presentations.
Could we do it in person? Sure, we could schedule personalized interviews for all the 36 students. With two instructors, it would have taken us a couple of days to go through. Not a huge deal. At 100 students and one instructor, we would have a problem doing that.
But the key reason was the following: research has shown that human interviewers are actually worse when they get tired, and that AI is actually better for conducting more standardized and more fair interviews. That result was a major reason for us to trust a final exam on a voice agent.
I'm not sure why you're saying this so confidently. Using LLMs on school work is like using a forklift at the gym. You'll technically finish the task you set out to do, and it will be much easier. So why not use a forklift at the gym?
>But the key reason was the following: research has shown that human interviewers are actually worse when they get tired, and that AI is actually better for conducting more standardized and more fair interviews. That result was a major reason for us to trust a final exam on a voice agent.
I think that in an "AI class" for MBA students, the material is probably not complex enough to require much more than a Zork interpreter, but if you tried this on something in which nuance is required, that comparison would change dramatically. For something like this, which is likely going to be little more than knowledge spot checks to catch the most blatant cheaters, why not just have students do multiple choice questions at a kiosk?
For the use of LLM in classes: I understand the reasoning, but I found LLMs to be extremely educational for parsing through dense material (eg parsing an NTSB report for an Uber self-driving crash). Prohibiting students from using LLMs would be counterproductive.
But I still want students to use LLMs responsibly, hence the oral exam.
In my BSc and MSc we were all basically locals who are in all aspects about the same except from the aptitude to study. In the university where I did my PhD there were much more divisions (aka diversity) in which every oral examiner would need to navigate so one group does not feel to be made preferential over another.