HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No

Posted by sambellll 18 hours ago

HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88(danunparsed.com)

902 points | 386 commentspage 3

achalxyz 4 hours ago|

If I know the truth value of p and I also know p=>q, then an LLM would be able to deduce the truth value of q - even if the statements aren’t exactly in this form. Generally, LLMs are good with logical inference.

But logical inference itself is limited. You still have to find out if p is true or not - the ground truth.

How do you find that? You would be able to define in the prompt that if resume has p, infer q and do this. But determining the truth value of p is something LLM cannot do.

It’s not a limitation of the LLM. It’s the limitation of logic itself. You take 10 humans and give them the resumes with the same rubrics as the LLM. You’ll get a similar range of scores because everyone would assign different values.

The issue is not in logical inference. It’s in determining the value of p, which takes much more than logic. And current LLMs are limited to being logical.

rsanek 7 hours ago||

It's fair to call out issues with the tool. But I think for individuals searching for jobs, using LLMs as the scapegoat for why it's hard to find a role is not terribly helpful.

In my experience, cold-applying has always worked essentially as a black hole, and LLMs haven't changed that much. The reality is that alternative avenues are always necessary to get the job you want. That could be a third-party recruiter; reaching out to a hiring manager on LinkedIn; or using your network to get referrals. Those continue to work whether the company is using a bone-headed tool like this or not.

us-merul 7 hours ago|

I entered an interview with a hiring manger where they had received a "summary" of my resume that contained information blatantly not in my experience. The recruiter claimed they mixed my name up with another applicant, but the summary the hiring manager showed me had parts that were correct.

tasuki 13 hours ago||

> Sometimes my projects “lack architectural complexity”

Well done you! It is difficult to avoid architectural complexity, but imho well worth it.

bartread 10 hours ago||

The takeaway from this for me is that, using an LLM to score anything takes multiple (maybe even many) runs and the result you’ll get is, at best, a sane-ish distribution.

Which sort of sounds workable until you scale it up to larger datasets, where at some point compute/time/energy costs will render it non-viable.

I am sure there’s some reasonable rule of thumb estimation on distribution that could be applied based off fewer runs per data artifact, but you’re always going to be trading off against confidence by doing this.

Beyond this, I’d bet that almost no implemented systems that use LLMs for scoring, ranking, or decision making use such a multi-run approach. Partly because people don’t understand their behaviour is stochastic, perhaps because a lot of people without a background in statistics don’t understand what stochastic actually means, and no doubt partly because of budget concerns: if you have to ask an LLM to do the same thing 10, 50, 100 times to get a sufficiently good result, then the cost saving argument is either weakened or completely destroyed.

There is at least one more aspect worth considering in the specific case of resumes/CVs: is the inconsistency of scoring by LLM worse than the inconsistency of scoring by a human following a similar process?

Because the reality is that, even for an experienced recruiter, reviewing hundreds or thousands of resumes or CVs gets pretty fatiguing. People get hungry, bored, tired, restless, irritable, etc.

That inevitably leads to inconsistencies creeping in, so there’s always an element of “luck” (or, perhaps better, uncertainty) as to whether your resume/CV passes screening.

So is that inconsistency better or worse with LLM screening? I don’t know. But, at least, if it’s not worse maybe it doesn’t matter for this specific use case. And if it’s notably better then maybe it’s raised the bar on what “good enough” screening looks like?

(And I’m sure other use cases warrant similar, “does it matter?”, questions, with the answers no doubt landing differently.)

CuriouslyC 9 hours ago|

My experience with benchmarks and evals is that it can take ~20 runs of a problem for the distribution of answers to start to converge. Ideally you'd know the convergence properties of your algorithm ahead of time and make a Bayesian solution that makes the uncertainty explicit.

saidnooneever 12 hours ago||

Count to three, no more, no less. Four shalt thou not count, neither count thou two—excepting that thou then proceed to three. Five is right out.

Arch-TK 6 hours ago||

The list of "bonus" criteria and how they come about makes me feel sick.

I am not currently looking for employment, nor am I currently particularly worried about future prospects if I was suddenly in the position of looking for employment.

But if I ended up in a position with nothing to lean on but scattering my CV everywhere, well…

A lot of my major contributions are littered across the internet, private, or even just verbal/consultancy. They're things I did for free, in my spare time.

I also avoid GitHub. If you just look at my GitHub page for extra context, you would likely miss that delivering that very GitHub page likely involved a few bits of code I wrote.

Now, I could do a better job of trying to document this stuff, so it could be easier to find… But also I can't quite imagine how that would work.

bsoles 6 hours ago||

> 35 points for open source contributions

> 30 for personal projects

These are insane weights for scoring a software engineer's resume.

morphology 5 hours ago|

Insane how? I would expect more points for open source contributions. It is trivial to create a personal project, but that does not carry with it any indicator of quality. Having your work accepted by other maintainers is one indicator at least.

morphology 5 hours ago||

It's funny that even after all these years and all this money invested in technology, we still haven't come up with anything better than word-of-mouth for hiring great people. Many serial founders have said that, despite the most stringent interview processes and the most sophisticated filtering pipelines, they still have a higher hit rate with people they've worked with in the past.

This isn't to diminish the whispernet. Rather, it shows just how many important signals cannot be quantized.

makeavish 2 hours ago|

True, I have found it to be valid as well

jedimastert 8 hours ago||

The blog post itself has pretty a pretty strong un-copy-edited ChatGPT vibes.

passivepinetree 1 hour ago|

Yeah, this type of thing makes me sad. It's a good idea and the work behind it is interesting, but there's something magic about a human voice. It deadens me a little bit reading this type of writing, which I'm seeing increasingly in both my work and personal life.

davidpapermill 13 hours ago|

A better way to reformulate this problem is for the LLM to be tasked with making a _comparative_ judgement between two CVs. This should prove much more reliable, especially if you give it a third “too close to call” option. You can also ask for clear justifications of preference.

srdjanr 12 hours ago|

That's a good idea.

The only drawback I see is that you should compare every pair of CVs for best results, and that grows quadraticly with number of CVs. Of course you can settle for fewer comparisons and not perfect results. But then I'm not sure if you can hit a good ratio of quality and token spend.

skribb 12 hours ago|||

Could probably do an elo system and sample pairs. E.g.

1. Set the elo of all CVs to 1000 elo

2. Randomly pair up CVs and compare. Winners gain elo, losers lose elo.

3. Repeat #2 for a few iterations, then remove bottom X% of CVs.

4. Repeat 2-3 until the amount of remaining CVs is small enough to do an exhaustive comparison.

I don't have a mathematical proof, but I suspect that this is a decent cost-effective approximation of comparing every pair (depending on the parameters)

swiftcoder 11 hours ago||||

> you should compare every pair of CVs for best results

Or compare each one to a reference set? Take 5 resumes of existing employees, rank all candidates against that set, maybe you get some useful level prediction into the bargain

davidpapermill 9 hours ago|||

I'd just do a quick filter, probably deterministic, then perform a deeper comparison on the selected few.

More comments...