Top
Best
New

Posted by coloneltcb 4 days ago

Guy running a Google rival from his laundry room(www.fastcompany.com)
245 points | 149 comments
renegat0x0 4 days ago|
Well, I created my own domain index. I have not crawled every page inside domains, but it is not my goal.

I have 1542766 domains. Might not be much, but it is an honest work.

It is available as a github repo, so anybody that wants to start crawling has some initial data to kick off.

Links

https://github.com/rumca-js/Internet-Places-Database

raybb 4 days ago||
What a nice project. What inspired this initially?

FYI there's a broken link in your readme:

    https://rumca-js.github.io/internet full internet search
renegat0x0 4 days ago||
thanks, I replaced it with a other link demo
hobs 4 days ago|||
Cant you just request the ICANN’s zone files and have the canonical list of the day?
renegat0x0 4 days ago|||
Any link list, or domain list is not worth much without any rating, or meta. I lead a hobby project, and I am not expert, so I provide ratings based on what kind of data pages provide (title, social, description), and my own manual voting system. It is not ideal, but it is something. Also I provide tags, so it is easily known what the domain provides, or domains can be filtered by tags.

I know that you cannot count and visit every domain, so the list will never be finished, but I am happy with the results.

hobs 4 days ago||
Well, if you are curating every link them its a different story, and looks like a more classic webring - I missed that part of the work - I thought it looked like a big set of crawler data that wasn't as manually curated.
egberts1 4 days ago|||
Avoiding GIGO (Garbage In, Garbage Out).

This is why we have computer-variants of Library Science and Archeology, Forensic Science and a bunch of other advanced knowledge (not AI, mind you).

hobs 4 days ago||
I don't see how this applies as its aggregating a bunch of stuff from random crawlers - if you want to crawl a list of actual domains that's generally considered the list of things that could resolve, so seems like a good starting place.
didip 4 days ago|||
This is amazing. Thanks for sharing!
bufferoverflow 4 days ago||
[dead]
luizfelberti 4 days ago||
I was trying to do this in 2023! The hardest part about building a search engine is not the actual searching though, it is (like others here have pointed out), building your index and crawling the (extremely adversarial) internet, especially when you're running the thing from a single server in your own home without fancy rotating IPs.

I hope this guy succeeds and becomes another reference in the community like the marginalia dude. This makes me want to give my project another go...

mhitza 4 days ago||
You might want to bookmark https://openwebsearch.eu/open-webindex/

While the index is currently not open source, it should be at some point. Maybe when they get out of the beta stage (?) details are yet unclear.

3RTB297 4 days ago||
You know, it's possible the cure to an adversarial internet is to just have some non-profit serve as a repo for a universal clearnet index that anyone can access to build their own search engine. That way we don't have endless captchas and anubis and Cloudflare tests every time I try and look for a recipe online. Why send AI scrapers to crawl literally everything when you're getting the data for free?

I'll add it to the mile-long list of things that should exist and be online public goods.

moduspol 4 days ago|||
Is the common crawl usable for something like this?

https://commoncrawl.org

chiefsearchaco 4 days ago|||
I'm the creator of searcha.page and seek.ninja, those are the basis of my index. The biggest problem with ONLY using that is freshness. I've started my own crawling too, but for sure common crawl will backfill a TON of good pages. It's priceless and I would say common crawl should be any search engines starting point. I have 2 billion pages from common crawl! There were a lot more but I had to scrub them out due to resources. My native crawling is much more targeted and I'd be lucky to pull 100k but as long as my heuristics for choosing the right targets it will be very high value pulls.
giancarlostoro 4 days ago|||
Most likely it is, the issue then becomes being able to store and afford the storage for all the files.
moduspol 4 days ago||
Sure, and that's not easy, but it's a lot easier than having to crawl the entire public Internet yourself.
wordpad 4 days ago|||
Why can't crawling be crowd sourced? It would solve ip rotation and spread the load
6510 4 days ago|||
https://yacy.net
catlikesshrimp 4 days ago||
Too bad it doesn't support android. It is much more energy efficient than anything else I can spare (for 100% uptime contribution)
Poomba 4 days ago||||
That’s how residential proxies work, in a perverse way
chiefsearchaco 4 days ago|||
Common crawl sort of serves this function. I use it. It's a really good foundation.
6510 4 days ago|||
The crawl seems hard but the difference between having something and not having it is is very obvious. Ordering the results is not. What should go on page 200 and do those results still count as having them?
ge96 4 days ago||
The IP thing is interesting, I was trying to make this CSGO bot one time to scrape steam's prices and there are proxy services out there you rent, tried at least one and it was blocked by steam. So I wonder if people buy real IPs.
kccqzy 4 days ago||
Yeah people buy residential IPs on the black market. They are essentially infected home PCs and botnets.
Bratmon 4 days ago|||
Not just the black market anymore!

https://www.proxyrack.com/residential-proxies/

immibis 4 days ago|||
you can get paid about $0.10/GB in cryptocurrency (at a few GB per month) to run one on your PC. Apparently they also just buy actual connections sometimes. It's not even unethical - it's just two groups of equally bad businesspeople trying to spend money to block the other one.
typpilol 4 days ago||
I've heard a few horror stories... Since the people using residential proxies aren't necessarily always good people
cheema33 4 days ago||
I tried the search site at https://searcha.page/ by searching for something random and got the following message:

"An error has occurred building the search results."

authnopuz 4 days ago||
hug of death? I fear the temperature will get very high in his laundry room
DannyBee 4 days ago|||
I'm sure it depends on how much laundry he is doing - his dryer is probably heated entirely by servers.

He can then exhaust the remaining server heat through the dryer vent stack.

debo_ 4 days ago|||
Keep going. I love dry humor.
egberts1 4 days ago|||
Its dryer sheets soften the soul.
ArekDymalski 4 days ago|||
Untill the exhaust starts "Feeling leaky" I guess.
robofanatic 4 days ago||||
Might not even need a dryer :-)
ape4 4 days ago|||
Change it to a sauna?
doublerabbit 4 days ago||
I thought of this a whole ago when I was a Datacentre monkey. In the winter it was pleasant to walk down the hot aisles.

However the exhausted hot air never had the same feel of a sauna. It left the air stale and dry.

chiefsearchaco 4 days ago|||
Yep, my usage increased 20x week over week. It was actually the context expansion that was my bottleneck, not the search itself. My usage graph looks almost vertical. Not sure if this counts as a good week or a bad week.
HelloUsername 4 days ago|||
Yup; same at https://seek.ninja/s?q=beatles
eschulz 4 days ago||
Before this happened to me, my first search returned an impressive SERP.
lucb1e 4 days ago||
It claims I reached the article limit. The last time I saw a fastcompany link must have been a decade ago! I was nostalgically looking forward to read another article of theirs. Alas...

https://archive.is/HA7y4

Some bits and pieces:

> his new search engine, the robust Search-a-Page <https://searcha.page>, which has a privacy-focused variant called Seek Ninja <https://seek.ninja>

> The secret to making it all happen? Large language models. “What I’m doing is actually very traditional search,” Pearce says. “It’s what Google did probably 20 years ago, except the only tweak is that I do use AI to do keyword expansion and assist with the context understanding

> Fellow ambitious hobbyist Wilson Lin, who on his personal blog <https://blog.wilsonl.in/search-engine/> recently described his efforts to create a search engine of his own, took the opposite approach from Pearce.

> And then there’s the concept of doing a small-site search, along the lines of the noncommercial search engine Marginalia <https://marginalia-search.com>, which favors small sites over Big Tech

And the obvious answer to the title: "Why the laundry room? Two reasons: Heat and noise." It runs on a a 32-core AMD EPYC 7532, half a terabyte of RAM, and "all in, cost $5,000, with about $3,000 of that going toward storage"

udkl 4 days ago||
I absolutely devoured Wilson Lins articles recently .. they are very high quality and informative for any amateur interested in search engines and LLMs! - https://blog.wilsonl.in/search-engine/
wvenable 4 days ago||
Reader mode in Firefox (plus sometimes a page refresh) gets me past most paywalls -- including this article.
ofrzeta 4 days ago||
"The beefy CPU running this setup, a 32-core AMD EPYC 7532, underlines just how fast technology moves. At the time of its release in 2020, the processor alone would have cost more than $3,000. It can now be had on eBay for less than $200"

why do I never get deals like that when I am shopping for the homelab on eBay?

progval 4 days ago||
You need to spend a lot of time looking through badly labeled offers, and be willing to buy from sellers with no reputation.
_fat_santa 4 days ago|||
Not for a CPU but earlier this year I bought a Thinkpad workstation off eBay for $500. It's a machine from 2020 and when it was new cost $5,700.

I see this for pretty much all hardware out on eBay, just go back 5 years and watch the price fall 10x.

saalweachter 4 days ago||
Has eBay fixed their "and then they ship you a box of rocks" problem?

I feel like there was a five year span where everyone I talked to said buying or selling electronics on eBay was a nightmare, so I'm a little curious if I need to re-evaluate my priors.

buildbot 4 days ago|||
Yes, it’s extremely rare to be stuck with a broken/wrong/missing item as a buyer on eBay. Selling is quite risky in some ways because eBay will nearly always side with a buyer. Every missing or broken thing I have purchased has been refunded or replaced. On the other hand, 3 things I have sold were claimed to not arrive. The only case where eBay decided in my favor was when the buyer had signed for the package in a literal USPS office :)
throwawayffffas 4 days ago||||
You don't get that with used old stuff, you get it with unrealistic low prices for new stuff.

A 7532 CPU is now ewaste for all the datacenters out there 1/10 of original price is reasonable, but the latest Nvidia GPU for 200 bucks is obviously a scam.

apetresc 4 days ago||||
My understanding is that eBay sides with the buyer on all disputes, to the point of ridiculousness. So you should be fine.

The real issue is being a seller and solving the "and then the customer claims I shipped them a box of rocks" problem.

buildbot 4 days ago||
Yep selling is way more risky. Ebay might be the most safe (refund wise) marketplace for buyers… I have more trouble with amazon.
accrual 4 days ago||||
> Has eBay fixed their "and then they ship you a box of rocks" problem?

I've personally never had that problem after over a decade and hundreds of purchases on eBay. I've had some defective parts, but never outright fraud. IME eBay favors buyers.

mjh2539 4 days ago|||
Every single laptop I've bought off of ebay (all of which were used) over the past ten years has functioned perfectly and flawlessly. You just pay attention to the number of recent sales the account has had and their overall rating.
robrtsql 4 days ago|||
I searched "AMD EPYC 7532" and there are a ton of listings for $150-$200. Are you just regretful that it wasn't like this when you were shopping parts for your homelab?
throwawayffffas 4 days ago||
I got a 7551p plus motherboard and ram for about 600 bucks from China this January. I may have overpaid but it works great, and gets the job done.
Gormo 4 days ago|||
TheServerStore.com often has good deals. I actually bought a brand new 64-core EPYC 7702 server with 256 GB RAM and 8TB NVMe storage for about $3K fully assembled earlier this year.
ThatMedicIsASpy 4 days ago|||
Epyc7000+MB+256GB-512GB RAM (from china) usually starts at 800 euros + import tax
chiefsearchaco 4 days ago|||
Get a QC type chip and roll the dice, that's how I got mine. The biggest cost for me is disk and to a lesser extent ram, the chip itself was relatively cheap.
renewiltord 4 days ago||
AliExpress broseph. You'll get it in no time. I've gotten. Go do QS if you have some risk tolerance and ES if you also have time tolerance.
phendrenad2 4 days ago||
This is a cool project, and I hope he has fun with it.

I've daydreamed about how I'd create my own search engine so, so many times. But I always run into an impassable wall: The internet now isn't at all the same as the internet in 1999.

Discovery isn't really that useful. If you find someone's self-hosted blog about dinosaurs, it probably hasn't been updated since 2004, all the links and images are broken, and it's just thoroughly upstaged by Wikipedia and the Smithsonian. Sure, it's fun to find these quirky sites, but they aren't as valuable as they once were.

We've basically come full circle to the AOL model, where there are "hubs" of content that cater to specific categories. YouTube has ALL the long-form essays. Tiktok has ALL the humorous videos. Medium has ALL the opinion pieces. Reddit has ALL the flame wars. Mayo Clinic has ALL the drug side-effects. Amazon has ALL the shopping. Ebay has ALL the collectables.

None of these big companies want nasty little web crawlers poking and prodding their site. But they accept Google crawlers, because Google brings them users. Are they going to be that friendly to your crawler?

Of course, I still dream. Maybe a hub-based internet needs a hub-aware search engine?

chiefsearchaco 4 days ago||
Well I can't respond to everyone - I am the one running the search engine. And yes, it did crash today from load. Usage increased 20x this week vs last and I was totally unprepared. I don't know if that counts as a good launch or a bad one. For some reason in my head I imagined usage would be some slow steady ramp.

Thank you for those who tried it, and I'm sorry if you were one of the people it didn't perform for. As far as load goes this was the first day it truly had a "trial by fire".

OJFord 4 days ago||
'Google rival' is quite a stretch, surely 'search engine' is not just more accurate, but clearer too with all that Google does today, as if that's new.
amelius 4 days ago||
https://archive.ph/HA7y4
BLKNSLVR 4 days ago|
Great innovation plus cloud-skeptic self-hosting. There should be much much more of this!
More comments...