Individual humans don't care about a proof-of-work challenge if the information is valuable to them - many web sites already load slowly through a combination of poor coding and spyware ad-tech. But companies care, because that changes their ability to scrape from a modest cost of doing business into a money pit.
In the earlier periods of the web, scraping wasn't necessarily adversarial because search engines and aggregators were serving some public good. In the AI era it's become belligerent - a form of raiding and repackaging credit. Proof of work as a deterrent was proposed to fight spam decades ago(Hashcash) but it's only now that it's really needed to become weaponized.
If you make it more expensive to request a documents at scale, you make this type of crawling prohibitively expensive. On a small scale it really doesn't matter, but if you're casting an extremely wide net and re-fetching the same documents hundreds of times, yeah it really does matter. Even if you have a big VC budget.
Anubis helps combat this because even if the scrapers upgrade to running automated copies of full-featured web browsers that are capable of solving the challenges (which means it costs them a lot more to scrape than it currently does), their server costs would balloon even further because each time they load a page, it requires them to solve a new challenge. This means they use a ton of CPU and their throughput goes way down. Even if they solve a challenge, they can't share the cookie between bots because the IP address of the requestor is used as part of the challenge.
Saving and re-using the JWT cookie isn't that helpful, as you can effectively rate limit using the cookie as identity, so to reach the same request rates you see now they'd still need to solve hundreds or thousands of challenges per domain.
Regardless of how they solve the challenges, creating an incentive to be efficient is a victory in itself. GPUs aren't cheap either, especially not if you're renting them via a browser farm.
You can do more underneath Anubis using the JWT as a sort of session token though, like rate limiting on a per proof-of-work basis, if a client using X token makes more than Y requests in a period of time, invalidate the token and force them to generate a new one. This would force them to either crawl slowly or use many times more resources to crawl your content.
It works in the short term, but the more people that use it, the more likely that scrapers start running full browsers.
Proof-of-work selects for those with the computing power and resources to do it. Bitcoin and all the other cryptocurrencies show what happens when you place value on that.
> Your visit has been flagged. Please select: Login, PoW, Cloudflare, Google.
The goal is to make web scraping unfeasible because of computational costs for OCR. It's a cat and mouse game right now and I want to change the odds a little. The HTML source would be effectively void without the user session, meaning an OTP like behavior could also make web pages unreadable once the assets go uncached.
This would allow to effectively create a captcha that would modify the local seed window until the user can read a specified word. "Move the slider until you can read the word Foxtrott", for example.
I sure would love to hear your input, Xe. Maybe we can combine our efforts?
My tech stack is go, though, because it was the only language where I could easily change the webfont files directly without issues.
With the enigma webfont idea you can even just select a random seed for each user/cache session. If you map the URLs based on e.g. SHA512 URLs via the Web Crypto API, there's no cheap way of finding that out without going full in cracking mode or full in OCR/tesseract mode.
And cracking everything first, wasting gigabytes of storage for each amount of rotations and seeds...well, you can try but at this point just ask the admin for the HTML or dataset instead of trying to scrape it, you know.
In regards to accessibility: that's sadly the compromise I am willing to do, if it's a technology that makes my specific projects human eyes only (Literally). I am done taking the costs for hundreds of idiots that are too damn stupid to clone my website from GitHub, letting alone violating every license in each of their jurisdictions. If 99% of traffic is bots, it's essentially DDoSing on purpose.
We have standards for data communication, it's just that none of these vibe coders gives a damn about building semantically correct HTML and parsers for RDF, microdata etc.
The HTML is garbage without a correctly rendered webfont that is specific to the shifts and replacements in the source code itself. The source code does not contain the source of the correct text, only the already shifted text.
Inside the TTF/OTF files themselves each letter is shifted, meaning that the letters only make sense once you know the seed for the multiple shifts, and you cannot map 1:1 the glyphs in the font to anything in the HTML without it.
The web browser here is pretty easy to trick, because it will just replace the glyphs available in the font, and fallback to the default font if they aren't available. Which, by concept, also allows partial replacements and shifts for further obfuscation if needed, additionally you can replace whole glyph sequences with embedded ligatures, too.
The seed can therefore be used as an instruction mapping, instead of only functioning as a byte sequence for a single static rotation. (Hence the reference to enigma)
How would control points in the webfont files be able to map it back?
If you use multiple rotations like in enigma, and that is essentially the seed (e.g. 3,74,8,627,whatever shifts after each other). The only attack I know about would be related to alphabet statistical analysis, but that won't work once the characters include special characters outside the ASCII range because you won't know when words start nor when they end.
Let's assume that "HELLO" is remapped under your scheme. You would have a base font that will be used to dynamically generate mangled fonts, and it surely has at least four glyphs, which I'll refer as gH, gE, gL and gO (let's ignore advanced features and ligatures for now). Your scheme for example will instead map, say, a decimal digit 1 to gH, 2 to gE, 3 to gL and 4 to gO so that the HTML will contain "12334" instead of "HELLO". Now consider which attacks are possible.
The most obvious attack, as you have considered, is to ignore HTML and only deal with the rendered page. This is indeed costly compared to other attacks, but not very expensive either because the base font should have been neutral enough in the first place. Neutral and regular typefaces are the ideal inputs for OCR, and this has been already exploited massively in Fax documents (search keyword: JBIG2). So I don't think this ultimately poses a blocker for crawlers, even though it will indeed be very annoying.
But if the attacker does know webfonts are generated dynamically, one can look at the font itself and directly derive the mapping instead. As I've mentioned, glyphs therein would be very regular and can easily be recognized because a single glyph OCR (search keyword: MNIST) is even much simpler than a full-text OCR where you first have to detect letter-like areas. The attacker will render each glyph to a small virtual canvas, run OCR and generate a mapping to undo your substitution cipher.
Since the cost of this attack is proportional to the number of glyphs, the next countermeasure would be putting more glyphs to make it a polyalphabetic cipher: both 3 and 5 will map to gL and the HTML will contain "12354" instead. But it doesn't scale well, especially because OpenType has a limit of 65,535 glyphs. Furthermore, you have to make each of them unique so that the attacker has to run OCR on each glyph (say, 3 maps to gL and 5 maps to gL' which is only slightly different from gL), otherwise it can cache the previously seen glyph. So the generated font would have to be much larger than the original base font! I have seen multiple such fonts in the wild and almost all of them are for CJKV scripts, and those fonts are harder to deploy as webfonts for the exactly same reason. Even Hangul with only ~12,000 letters poses a headache for deployment.
This attack also applies to ligatures by the way, because OpenType ligatures are just composite glyphs plus substitution rules. So you have the same 65,535 glyph limit anyway [1], and it is trivial to segment two or more letters from those composite glyphs anyway. The only countermeasure would be therefore describing and mangling each glyph independently, and that would take even more bytes to deploy.
[1] This is the main reason Hangul needlessly suffers in this case too. Hangul syllables can be generated from a very simple algorithm so you only need less than 1,000 glyphs to make a functional Hangul font, but OpenType requires one additional glyph for each composite glyphs so that all Hangul fonts need to have much more glyphs even though all composite glyphs would be algorithmically simple.
I don't think mangling the text would help you, they will just hit you anyway. The traffic patterns seem to indicate that whoever programmed these bots, just... <https://www.youtube.com/watch?v=ulIOrQasR18>
> I sure would love to hear your input, Xe. Maybe we can combine our efforts?
From what I've gathered, they need help in making this project more sustainable for the near and far future, not to add more features. Anubis seems to be doing an excellent job already.
It is really sad that the worldwide web has been taken to the point where this is needed.
Seems like a good solution to the badly behaved scrapers, and I feel like the web needs to move away from the client-server model towards a swarm model like Bittorrent anyway.
* the server appears on the outside as an https server/reverse proxy * the server supports self-signed-certificates or letsencrypt * when a client goes to a certain (sub)site or route, http auth can be used * after http auth, all traffic tunnel over that subsite/route is protected against traffic analysis, for example like the obfsproxy does it.
Does anyone know something like that? I am tempted to ask xeiaso to add such features, but i do not think his tool is meant for that...
> his
I believe it's their.
In any situation, you're going to need some custom client code to route your traffic through the tunnel you opened, so I'm not sure why the login page that opens the tunnel needs to be browser-compatible?
What is the problem with bots asking for traffic, exactly?
Context of my perspective: I am a contractor for a team that hosts thousands of websites on a Kubernetes cluster. All of the websites are on a storage cluster (combination of ZFS and Ceph) with SATA and NVMe SSDs. The machines in the storage cluster and also the machines the web endpoints run on have tons of RAM.
We see a lot of traffic from what are obviously scraping bots. They haven't caused any problems.
So the point is not to be faster than the bear. It’s to be faster than your fellow campers.
$ mkdir -p ./tmp/anubis/static && anubis --extract-resources=./tmp/anubis/static