Posted by coloneltcb 4 days ago
I have 1542766 domains. Might not be much, but it is an honest work.
It is available as a github repo, so anybody that wants to start crawling has some initial data to kick off.
Links
FYI there's a broken link in your readme:
https://rumca-js.github.io/internet full internet search
I know that you cannot count and visit every domain, so the list will never be finished, but I am happy with the results.
This is why we have computer-variants of Library Science and Archeology, Forensic Science and a bunch of other advanced knowledge (not AI, mind you).
I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...
While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.
I'll add it to the mile-long list of things that should exist and be online public goods.
"An error has occurred building the search results."
He can then exhaust the remaining server heat through the dryer vent stack.
However the exhausted hot air never had the same feel of a sauna. It left the air stale and dry.
Some bits and pieces:
> his new search engine, the robust Search-a-Page <https://searcha.page>, which has a privacy-focused variant called Seek Ninja <https://seek.ninja>
> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding
> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.
> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech
And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"
why do I never get deals like that when I am shopping for the homelab on eBay?
I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.
I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.
A 7532 CPU is now ewaste for all the datacenters out there 1/10 of original price is reasonable, but the latest Nvidia GPU for 200 bucks is obviously a scam.
The real issue is being a seller and solving the "and then the customer claims I shipped them a box of rocks" problem.
I've personally never had that problem after over a decade and hundreds of purchases on eBay. I've had some defective parts, but never outright fraud. IME eBay favors buyers.
I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.
Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.
We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.
None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?
Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?
Thank you for those who tried it, and I'm sorry if you were one of the people it didn't perform for. As far as load goes this was the first day it truly had a "trial by fire".