Posted by todsacerdoti 4/19/2025
If instead we had a content addressed model, we could drop the uniqueness constraint. Then these AI scrapers could be gossiping the data to one another (and incidentally serving it to the rest of us) without placing any burden on the original source.
Having other parties interested in your data should make your life easier (because other parties will host it for you), not harder (because now you need to work extra hard to host it for them).
I know, as far as possible it's a good idea to have content-immutable URLs. But at some point, I need to make www.myexamplebusiness.com show new content. How would that work?
But as for updating, you just format your URLs like so: {my-public-key}/foo/bar
And then you alter the protocol so that the {my-public-key} part resolves to the merkle-root of whatever you most recently published. So people who are interested in your latest content end up with a whole new set of hashes whenever you make an update. In this way, it's not 100% immutable, but the mutable payload stays small (it's just a bunch of hashes) and since it can be verified (presumably there's a signature somewhere) it can be gossiped around and remain available even if your device is not.
You can soft-delete something just by updating whatever pointed to it to not point to it anymore. Eventually most nodes will forget it. But you can't really prevent a node from hanging on to an old copy if they want to. But then again, could you ever do that? Deleting something on on the web has always been a bit of a fiction.
True in the absolute sense, but the effect size is much worse under the kind of content-addressable model you're proposing. Currently, if I download something from you and you later delete that thing, I can still keep my downloaded copy; under your model, if anyone ever downloads that thing from you and you later delete that thing, with high probability I can still acquire it at any later point.
As you say, this is by design, and there are cases where this design makes sense. I think it mostly doesn't for what we currently use the web for.
It's the same functionality you get with permalinks and sites like archive.org--forgotten unless explicitly remembered by anybody, dynamic unless explicitly a permalink. It's just built into the protocol rather than a feature to be inconsistently implemented over and over by many separate parties.
AI scrapers aren't trying to find things they already know exist, they're trying to discover what they didn't know existed.
But there's a lot middle ground to explore here. Loading a modern web page involves making dozens of requests to a variety of different servers, evaluating some javascript, and then doing it again a few times, potentially moving several Mb of data. The part people want, the thing you don't already know exist, it's hidden behind that rather heavy door. It doesn't have to be that way.
If you already know about one thing (by its cryptographic hash, say) and you want to find out which other hashes it's now associated with--associations that might not have existed yesterday--that's much easier than we've made it. It can be done:
- by moving kB not Mb, we're just talking about a tuple of hashes here, maybe a public key and a signature
- without placing additional burden on whoever authored the first thing, they don't even have to be the ones who published the pair of hashes that your scraper is interested in
Once you have the second hash, you can then reenter immutable-space to get whatever it references. I'm not sure if there's already a protocol for such things, but if not then we can surely make one that's more efficient and durable than what we're doing now.
It is entirely possible to serve a fully cached response that says "you already have this". The problem is...people don't implement this well.
If content were handled independently of server names, anyone who cares to distribute metadata for content they care about can do so. One doesn't need write access, or even to be on the same network partition. You could just publish a link between content A and content B because you know their hashes. Assembling all of this can happen in the browser, subject to the user's configs re: who they trust.
"Content-addressable" has a broader meaning than what you seem to be thinking of -- roughly speaking, it applies if any function of the data is used as the "address". E.g., git commits are content-addressable by their SHA1 hashes.
It's a legit limitation on what content addressing can do, but it's one we can overcome by just not having everything be content addressed. The web we have now is like if you did a `git pull` every time you opened a file.
The web I'm proposing is like how we actually use git--periodically pulling new hashes as a separate action, but spending most of our time browsing content that we already have hashes for.
See https://arxiv.org/abs/1905.11880 [Hydras and IPFS: A Decentralised Playground for Malware]
That's not to say that it is a ready replacement for the web as we know it. If you have hash-linked everything then you wind up with problems trying to link things together, for instance. Once two pages exist, you can't after-the-fact create a link between them because if you update them to contain that link then their hashes change so now you have to propagate the new hash to people. This makes it difficult to do things like have a comments section at the bottom of a blog post. So you've got to handle metadata like that in some kind of extra layer--a layer which isn't hash linked and which might be susceptible to all the same problems that our current web is--and then the browser can build the page from immutable pieces, but the assembly itself ends up being dynamic (and likely sensitive to the users preference, e.g. dark mode as a browser thing not a page thing).
But I still think you could move maybe 95% of the data into an immutable hash-linked world (think of these as nodes in a graph), the remaining 5% just being tuples of hashes and pubic keys indicating which pages are trusted by which users, which ought to be linked to which others, which are known to be the inputs and output of various functions, and you know... structure stuff (these are our graph's edges).
The edges, being smaller, might be subject to different constraints than the web as we know it. I wouldn't propose that we go all the way to a blockchain where every device caches every edge, but it might be feasible for my devices to store all of the edges for the 5% of the web I care about, and your devices to store the edges for the 5% that you care about... the nodes only being summoned when we actually want to view them. The edges can be updated when our devices contact other devices (based on trust, like you know that device's owner personally) and ask "hey, what's new?"
I've sort of been freestyling on this idea in isolation, probably there's already some projects that scratch this itch. A while back I made a note to check out https://ceramic.network/ in this capacity, but I haven't gotten down to trying it out yet.
We are working on an open‑source fraud prevention platform [1], and detecting fake users coming from residential proxies is one of its use cases.
[1] https://jan.wildeboer.net/2025/02/Blocking-Stealthy-Botnets/
Regarding the first post, it's rare to see both datacenter network IPs and mobile proxy IP addresses used simultaneously. This suggests the involvement of more than one botnet. The main idea is to avoid using IP addresses as the sole risk factor. Instead, they should be considered as just one part of the broader picture of user behavior.
Both are pretty easy to mitigate with a geoip database and some smart routing. One "residential proxy" vendor even has session tokens so your source IP doesn't randomly jump between each request.
Trying to understand your product, where is it intended to sit in a network? Is it a standalone tool that you use to identify these IPs and feed into something else for blockage or is it intended to be integrated into your existing site or is it supposed to proxy all your web traffic? The reason I ask is it has fairly heavyweight install requirements and Apache and PHP are kind of old school at this point, especially for new projects and companies. It's not what they would commonly be using for their site.
Thank you for your question. tirreno is a standalone app that needs to receive API events from your main web application. It can work perfectly with 512GB Postgres RAM or even lower, however, in most cases we're talking about millions of events that request resources.
It's much easier to write a stable application without dependencies based on mature technologies. tirreno is fairly 'boring software'.
Finally, as mentioned earlier, there is no silver bullet that works for every type of online fraudster. For example, in some applications, a TOR connection might be considered a red flag. However, if we are talking about hn visitors, many of them use TOR on a daily basis.
I’ve found TOR browsing ok, but login via TOR to just be a great alternative to snow shoeing credential stuffing.
Not sure how this could work for browsers, but the other 99% of apps I have on my phone should work fine with just a single permitted domain.
It should also do something similar for apps making chatty background requests to domains not specified at app review time. The legitimate use cases for that behaviour are few.
And, AFAIK, you already need special permission for anything other than HTTPS to specific domains on the public Internet. That's why apps ping you about permissions to access "local devices".
They should need special permission for that too.
That's how it works with other permissions most applications should not have access to, like accessing user locations. (And private entitlements third party applications can't have are one way Apple makes sure nobody can compete with their apps, but that's a separate issue.)
You mean, good bye using my bandwidth without my permission? That's good. And if I install a bittorrent client on my phone, I'll know to give it permission.
> such as companion apps for watches and other peripherals
That's just apple abusing their market position in phones to push their watch. What does it have to do with p2p?
What are you talking about?
> What does it have to do with p2p?
It’s an example of when you design sandboxes/firewalls it’s very easy to assume all apps are one big homogenous blob doing rest calls and everything else is malicious or suspicious. You often need strange permissions to do interesting things. Apple gives themselves these perms all the time.
> What are you talking about?
That’s the main use case for p2p in an application isn’t it? Reducing the vendors bandwidth bill…
The equivalent would be to say that running local workloads or compute is to reduce the vendors bill. It’s a very centralized view of the internet.
There are many reasons to do p2p. Such as improving bandwidth and latency, circumventing censorship, improve resilience and more. WebRTC is a good example of p2p used by small and large companies alike. None of this is any more ”without permission” than a standard app phoning home and tracking your fingerprint and IP.
Great respect for the user's resources.
I just brought it up as a technology that at the very least is both legitimate and common.
The system may have some such functions built in, and asking permission might be a reasonable thing to include by default.
I've used all of them, and it's a deluge: it is too much information to reasonably react to.
Your broad is either deny or accept but there's no sane way to reliably know what you should do.
This is not and cannot be an individual problem: the easy part is building high fidelity access control, the hard part is making useful policy for it.
> it is too much information to reasonably react to.
Even if it asks, does not necessarily mean it has to ask every time if the user lets it keep the answer (either for the current session for until the user deliberately deletes this data). Also, if it asks too much because it tries to access too many remote servers, then might be spyware, malware, etc anyways, and is worth investigating in case that is what it is.
> the hard part is making useful policy for it.
What the default settings should be is a significant issue. However, changing the policies in individual cases for different uses, is also something that a user might do, since the default settings will not always be suitable.
If whoever manages the package repository, app store, etc is able to check for malware, then this is a good thing to do (although it should not prohibit the user from installing their own software and modifying the existing software), but security on the computer is also helpful, and neither of these is the substitute for the other; they are together.
Except the platform providers hold the trump card. Fuck around, if they figure it out you'll be finding out.
I am waiting for Apple to enable /etc/hosts or something similar on iOS devices.
If you are being bombarded by suspicious IP addresses, please consider using our free service and blocking IP addresses by ASN or Country. I think ASN is a common parameter for malicious IP addresses. If you do not have time to explore our services/tools (it is mostly just our CLI: https://github.com/ipinfo/cli), simply paste the IP addresses (or logs) in plain text, send it to me and I will let you know the ASNs and corresponding ranges to block.
In cybersecurity, decisions must be guided by objective data, not assumptions or biases. When you’re facing abuse, you analyze the IPs involved and enrich them with context — ASN, country, city, whether it’s VPN, hosting, residential, etc. That gives you the information you need to make calculated decisions: Should you block a subnet? Rate-limit it? CAPTCHA-challenge it?
Here’s a small snapshot from my own SSH honeypot:
Summary of 1,413 attempts
- Hosting IPs: 981 (69%)
- VPNs: 35
- Top ASNs:
- AS204428 (SS-Net): 152
- AS136052 (PT Cloud Hosting Indonesia): 83
- AS14061 (DigitalOcean): 76
- Top Countries:
- Romania: 238 (16.8%)
- United States: 150 (10.6%)
- China: 134 (9.5%)
- Indonesia: 115 (8.1%)
One single /24 from Romania accounts for over 10% of the attacks. That’s not about nationality or ethnicity — it's about IP space abuse from a specific network. If a network or country consistently shows high levels of hostile traffic and your risk tolerance justifies it, blocking or throttling it may be entirely reasonable.Security teams don’t block based on "where people come from" — they block based on where the attacks are coming from.
We even offer tools to help people explore and understand these patterns better. But if someone doesn’t have the time or resources to do that, I'm more than happy to assist by analyzing logs and suggesting reasonable mitigations.
I hope nobody does cybersecurity in 2025 by analysing and enriching IP addresses. Not on a market where a single residential proxy provider (which you fail to identify) offers 150M+ exit nodes. Even a JA3 fingerprinting could be more useful than looking at IP addresses. I bet you, romanian ips were not operated by romanians. yet you're banning all romanians?
Cybersecurity is a probabilistic game. You build a threat model based on your business, audience, and tolerance for risk. Blocking combinations of metadata — such as ASN, country, usage type, and VPN/proxy status — is one way to make informed short-term mitigations while preserving long-term accessibility. For example:
If an ASN is a niche hosting provider in Indonesia, ask: “Do I expect real users from here?”
If a /24 from a single provider accounts for 10% of your attacks, ask: “Do I throttle it or add a CAPTCHA?”
The point isn’t to permanently ban regions or people. It’s to reduce noise and protect services while staying responsive to legitimate usage patterns.
As for IP enrichment — yes, it's still extremely relevant in 2025. Just like JA3, TLS fingerprinting, or behavioral patterns — it's one more layer of insight. But unlike opaque “fraud scores” or black-box models, our approach is fully transparent: we give you raw data, and you build your own model.
We intentionally don’t offer fraud scoring or IP quality scores. Why? Because we believe it reduces agency and transparency. It also risks penalizing privacy-conscious users just for using VPNs. Instead, we let you decide what “risky” means in your own context.
We’re deeply committed to accuracy and evidence-based data. Most IP geolocation providers historically relied on third-party geofeeds or manual submissions — essentially repackaging what networks told them. We took a different route: building a globally distributed network of nearly 1,000 probe servers to generate independent, verifiable measurements for latency-based geolocation. That’s a level of infrastructure investment most providers haven’t attempted, but we believe it's necessary for reliability and precision.
Regarding residential proxies: we’ve built our own residential proxy detection system (https://ipinfo.io/products/residential-proxy) from scratch, and it’s maturing fast. One provider may claim 150M+ exit nodes, but across a 90-day rolling window, we’ve already observed 40,631,473 unique residential proxy IPs — and counting. The space is noisy, but we’re investing heavily in research-first approaches to bring clarity to it.
IP addresses aren’t perfect but nothing is! But with the right context, they’re still one of the most powerful tools available for defending services at the network layer. We provide the context and you build the solution.
Why jump to that conclusion?
If a scraper clearly advertises itself, follows robots.txt, and has reasonable backoff, it's not abusive. You can easily block such a scraper, but then you're encouraging stealth scrapers because they're still getting your data.
I'd block the scrapers that try to hide and waste compute, but deliberately allow those that don't. And maybe provide a sitemap and API (which besides being easier to scrape, can be faster to handle).
If the app isn't a web browser, none are legit?
Is the premise that users should not be allowed to use vpns in order to participate in ecommerce?
What good is all the app vetting and sandbox protection in iOS (dunno about Android) if it doesn't really protect me from those crappy apps...
If you treat platforms like they are all-powerful, then that's what they are likely to become...
Network access settings should really be more granular for apps that have a legitimate need.
App store disclosure labels should also add network usage disclosure.
https://krebsonsecurity.com/?s=infatica
https://krebsonsecurity.com/tag/residential-proxies/
https://bright-sdk.com/ <- way bigger than infatica