Top
Best
New

Posted by tigerlily 14 hours ago

Google's AI is being manipulated. The search giant is quietly fighting back(www.bbc.com)
246 points | 171 commentspage 3
JKCalhoun 12 hours ago|
Yeah, the internet seems like a big poison pill. Training on the whole internet feels like citing the National Enquirer (or the Daily Mail?) for a school essay.

Having an archive of "curated" training data seems like it is going to be important. Otherwise you need "AS" (artificial skepticism) introduced into future models. ("But I read it on the internet!", ha ha.)

Or perhaps there are ways to bucket training data such that the model is aware of which data leans factual (quantifiable) and which data leans opinion (fuzzy, qualifiable?).

(I recently asked Claude about the existence of ball lightning, spontaneous human combustion. I got replies that ultimately did not leave me satisfied. It's probably just as well that I read this article though—I now have an even stronger degree of skepticism with regard to their replies—specifically, I suppose, with topics that are likely to be biased.)

(I'm not quite convinced from the article though that Google is "fighting back". In fact, this feels like another moment where a "player" could try to establish their LLM as more factual. Is that the row Grok is trying to hoe? Or is Grok just trying to be anti-woke?)

dijksterhuis 12 hours ago||
> Having an archive of "curated" training data seems like it is going to be important

the justification for not doing that is probably "prohibitively expensive given the amount of data involved". they'd need a bunch of human reviewers combing through massive troves of data. it's probably cheaper to "sort of fix" it after the fact.

> perhaps there's ways to bucket training data such that the model is aware of which data leans factual (quantifiable) and which data leans opinion (fuzzy, qualifiable)

as a lecturer once said to me about my idea for a masters dissertation project that would classify news sites based on right/left tendencies -- "that sounds dangerously political". especially given the current let's all shout at each other political climate.

aside: someone built this and it was a fully fledged company, which has always annoyed me.

JKCalhoun 10 hours ago||
"…they'd need a bunch of human reviewers combing through massive troves of data…"

Yeah, I concede that. It doesn't need to be done over night. Having a static repo of data though that you can work through over time (years)—removing some data, add pre-curated data to. In so many years you can have a pretty good "reference dataset".

gowld 10 hours ago||
I think some of the thousands of people working on training LLMs have tried some of the low-hanging-fruit ideas we can brainstorm of the top of our head 5 years later.
ajross 11 hours ago||
> Training on the whole internet feels like citing the National Enquirer

It's not, though, because the refutations are in the training data too. This isn't actually the problem being described.

The weights in the LLM are fine. It's that the task the LLM is being asked to do is to search and summarize new content that isn't in its training data. And it does it too much like a naive reader and not enough like a cynical HN commenter.

But that's a problem with prompt writing, not training. It's also of a piece with most of the other complaints about current AI solutions, really: AI still lacks the "context" that an experienced human is going to apply, so it doesn't know when it's supposed to reason and when it's supposed to repeat.

If you were to ask it "Is this site correct or is it just spin?" it will probably get it right. But it doesn't know to ask itself that question if it's not in the prompt somewhere.

JKCalhoun 10 hours ago||
"…the LLM is being asked to do is to search and summarize new content that isn't in its training data…"

If it fails at that then it is a pretty significant problem. As you say earlier "the refutations are in the training data too", then the LLM should in fact be able to use "both sides" and land with a little better confidence when presented with new data.

(Hopefully your point regarding prompting issues is resolved then.)

ajross 9 hours ago||
Well, yeah, "should be" and "does" are different and this is new technology and has bugs and misfeatures and different limitations than what came before, and the market will have a learning curve as we all adapt.

I was just refuting your contention that this is somehow inherent in the idea of "training", and it's not.

sva_ 9 hours ago||
Creative ways of dropping your site's pagerank
tencentshill 12 hours ago||
It's all over the place. It's the new SEO. Marketing scumbags don't care.

https://www.hubspot.com/aeo-grader

https://enterprise.semrush.com/solutions/ai-optimization/

NoSalt 9 hours ago||
Whose AI isn't being manipulated???
BrenBarn 3 hours ago||
> Google and other AI companies are now trying to fix the problem.

There is one simple way to do that and that is to JUST GET RID OF THE AI CRAP.

nonameiguess 9 hours ago||
This feels like a basic critical thinking/epistemology thing that you (hopefully) pick up at some point in life, usually from experience finding reliable, canonical primary sources for data. You can't do that for everything. Being wrong about trivial factoids isn't the end of the world. You should, however, at least be capable of doing further investigation, realizing that Major League Eating has its own website, and that there is no event in South Dakota sanctioned by them. If you look at actual results, or even just think for a few seconds, you'd also realize that 7.5 hot dogs in 10 minutes is bush-league level nonsense that would not win a local church contest, let alone an international championship. That may not be obvious to all users of the Internet, but it would be if you've ever watched a real contests, looked at the results for a real contest, or try yourself to eat a high volume of hot dogs rapidly. You only need to do it once in your life and a basic smell alarm should go off in your head forever if someone puts out a claim that is very far from something you know to be true.

This is what human reasoning is and we're supposed to be good at it. At its best, this is what any reasonable education should do for you if you take it at all seriously, arming you with some capacity for doing prima facie sanity checks of poorly sourced claims.

csomar 8 hours ago||
I wrote about this a few months ago: https://codeinput.com/blog/google-seo

The tl;dr is, if you can rank within the top 1-20 results for the grounding query, you can poison the LLM “overview” if you convince it your information is legitimate.

BurakSakmak 8 hours ago||
[flagged]
clownpenis_fart 6 hours ago||
[dead]
wotsdat 3 hours ago|
[dead]
More comments...