Posted by mpweiher 2 days ago
If you have a router that lets you inspect data flowing out, you'll be astonished at what your little Nanit cam exfiltrates from your home network. Even if you don't pay for their subscription service, they still attempt to exfil all of the video footage caught on your camera to their servers. You can block it and it will still work, but you shouldn't have to do that in the first place if you don't pay for their cloud service.
Stay away if you value your privacy.
The end of the article has this:
> Consider custom infrastructure when you have both: sufficient scale for meaningful cost savings, and specific constraints that enable a simple solution. The engineering effort to build and maintain your system must be less than the infrastructure costs it eliminates. In our case, specific requirements (ephemeral storage, loss tolerance, S3 fallback) let us build something simple enough that maintenance costs stay low. Without both factors, stick with managed services.
Seems they were well aware of the tradeoffs.
Without cloud, saving a file is as simple as "with open(...) as f: f.write(data)" + adding a record to DB. And no weird network issues to debug.
Save where? With what redundancy? With what access policies? With what backup strategy? With what network topology? With what storage equipment and file system and HVAC system and...
Without on-prem, saving a file is as simple as s3.put_object() !
> Save where? With what redundancy? With what access policies? With what backup strategy? With what network topology? With what storage equipment and file system and HVAC system and...
Most of these concerns can be addressed with ZFS[0] provided by FreeBSD systems hosted in triple-A data centers.
See also iSCSI[1].
> Save where? With what redundancy? With what access policies? With what backup strategy? With what network topology? With what storage equipment and file system and HVAC system and...
Wow that's a lot to learn before using s3... I wonder how much it costs in salaries.
> With what network topology?
You don't need to care about this when using SSDs/HDDs.
> With what access policies?
Whichever is defined in your code, no restrictions unlike in S3. No need to study complicated AWS documentation and navigate through multiple consoles (this also costs you salaries by the way). No risk of leaking files due to misconfigured cloud services.
> With what backup strategy?
Automatically backed up with rest of your server data, no need to spend time on this.
You do need to care when you move beyond a single server in a closet that runs your database, webserver and storage.
> No risk of leaking files due to misconfigured cloud services.
One misconfigured .htaccess file for example, could result in leaking files.
First, I hope nobody is using Apache anymore, second, you typically store files outside of web directory.
> One misconfigured .htaccess file for example, could result in leaking files.
I don't think you are making a compelling case here, since both scenarios result in an undesirable exposure. Unless your point is both cloud services and local file systems can be equally exploited?
There may be some additional features that S3 has over a direct filesystem write to a SSD in your closet. The people paying for cloud spend are paying for those features.
Question: How do you save a small fortune in cloud savings?
Answer: First start with a large fortune.
I think you mean a small fraction of 3 engineers. And small fractions aren't that small.
But then you also have to think about file uploads and file downloads. You cannot have a single server fulfilling all the roles, otherwise you have a bottleneck.
So this file storage became a private backend service that end-users never access directly. I have added upload services, whose sole purpose is to allow users to upload files and only then upload them to this central file store, essentially creating a distributed file upload queue(there is also a bit more logic regarding file id creation and validation).
Secondly, own CDN was needed for downloads. But only because I use custom access handling and could not have used any of the commercial services(though they do support access via tokens, it just was not working for me). This was tricky because I wanted for the nodes to distribute files between themselves and not always fetch them from the origin to avoid network costs on the origin server. So they had to find each other, talk to each other and know who has which file.
In short, rolling your own is not as hard as it might seem and should be preferable. Maybe to save time, use cloud at the beginning, but once you are up and running and your business idea is validated by having customer, immediately move to your own infra in order to avoid astronomical costs of cloud services.
btw, i also do video processing like mentioned in the blog post :)
Have you ever thought of using a postgresql db (also on aws) to store those files and use CDC to publish messages about those files to a kafka topic? In your original way, we need 3 aws services: s3, lambda and sqs. With this way, we need 2: postgresql and kafka. I'm not sure how well this method works though :-)
1GB with the bytea data type (https://www.postgresql.org/docs/current/datatype-binary.html) and 4TB with the BLOB data type (https://wiki.postgresql.org/wiki/BinaryFilesInDB).
So if you have experience with this and it did work well, I'm curious to hear about it! That's why i asked about if it worked well, not about the maximum size postgres allowed in various data types.
If you have no experience with it, but are just posting advice based on what AI tells you about max sizes of data allowed by pg that I can get from the same source too, then okay, fair enough, and certainly no need to give me any more of that!
Why hesitant? Just ask AI. It'll tell you how to do it and then you can experiment it yourself.
just sounded less attractive
We could just use something like that
Or there is that other Object storage solution called R1 from Cloudflare.