Posted by mfrw 4 days ago
My mental model:
With 1 trust (the default) any trusted machine with credentials is provided access and therefore gets one unit of access. With 2-trust, we’d need at least units of trust, so two machines. Equivalently, each credential-bearing machine is half trusted (think ssh bastion hosts or 2FA / mobikeys for 2 trust).
This generalizes to 1/N, so for zero trust, we place 1/0 = infinite units of trust in every machine that has a credential. In other words, if we provision any one machine for access, we necessarily provision an unbounded number of other machines for the same level of access.
As snarky as this math is, I’ve yet to see a more accurate formulation of what zero trust architectures actually provide.
YMmV.
The point of Zero Trust (TM) is to authenticate and authorize the human being behind the machine, not the machine itself.
(Clearly, that doesn't work for all kinds of automated access and it comes with a lot of question in terms of implementation details (E.g., do we trust the 2FA device?) but that's the gist.)
Getting there is a long walk through the woods on a moonless night.
> With 2-trust, we’d need at least units of trust, so two machines. Equivalently, each credential-bearing machine is half trusted (think ssh bastion hosts or 2FA / mobikeys for 2 trust).
You might be interested in OpenPubkey[0, 1] which was developed at BastionZero. It has 1/2 trust for OpenIDConnect and can be used for SSH.
> As snarky as this math is, I’ve yet to see a more accurate formulation of what zero trust architectures actually provide.
I prefer the term epsilon-trust to reflect the nature of security and trust reduction as an iterative process. The trust in a system approaches but never fully reaches zero.
[0]: OpenPubkey: Augmenting OpenID Connect with User held Signing Keys https://eprint.iacr.org/2023/296
Zero Trust just means you stop inherently trusting your private network and verify every user/device/request regardless. If you opt in to using Cloudflare to do this then it requires running Cloudflare software.
Zero trust done correctly done not have those same drawbacks.
On the one hand, entirely trusting Cloudflare isn't really zero trust.
On the other hand, not trusting any network is one narrow definition.
I'll give you SSH keys when you pry them from my cold, dead FDE SSDs.
TLS having trusted CA cert publisher is not context of “Zero Trust”.
Would this mean that a PostgreSQL listening on localhost and always asking for user and password is considered Zero Trust, but peer authentication is not?
This part is always a bit confusing for me because there's already been authentication (OS login) creating a session for a specific user (OS user) accessing a specific service (through a unix domain socket) with the specific connection being validated (the unix domain socket permissions).
And from my limited knowledge, the OS login looks like an IdP (Identity Provider), the OS session looks like a JWT already validated by a middleware (the OS vs some API Gateway), connecting to a service using this "token" (OS session vs JWT), and only allowing access to this specific connection (the connection to the socket) if the token is valid (OS session has permissions vs JWT has good signature) and has permissions to the application itself (PostgreSQL checking the connecting user has access to this resource vs the application checking the connecting user has access to this resource).
So I can see this as Zero Trust because the pattern is kinda matching ("the letter"), but also as Not Zero Trust because I feel like this would still be considered a "trusted context" by what the term tries to convey ("the spirit").
The truth-yness of "zero trust" really depends on who's trusting who.
Would you like to explain what you mean by this?
Provide command logs and session recordings to allow administrators to audit and replay their developers’ interactions with the organization’s infrastructure.
The only way they can do this is if they record and store the session text, effectively a keylogger between you and the machine you are SSH'ing into.
Of course, CloudFlare is making it their business to be and convince others that they are that trusted third-party.
I think the sales pitch for these sorts of service is: "Get an SSH-like experience, but it integrates with your corporate single-sign-on system, has activity logs that can't be deleted even if you're root on the target, sorts out your every-ephemeral-cloud-instance-has-a-different-fingerprint issues, and we'll sort out all the reverse-tunnelling-through-NAT and bastion-host-for-virtual-private-cloud stuff too"
Big businesses pursuing SOC2 compliance love this sort of thing.
https://fly.io/blog/soc2-the-screenshots-will-continue-until...
The second order problem I've found is when you dig in there are plenty of people who ask for certs but when push comes to shove really want functionality where when user access is cancelled all active sessions get torn down immediatly as well.
I don't mean to detract from this, because short-lived creds are always better, but for my money I hope I never have sshd running on any machine again
Not generally. In one particular class of deployments allowing ssh access to root enabled accounts without auditing may be.. but this is an exceptionally narrowed definition.
> I hope I never have sshd running on any machine again
Sounds great for production and ridiculous for development and testing.
I believe that "practice how you're going to play" to get devs into the habit of not using a crutch to treat deployments like they are their local machine. The time to anticipate failures is in the thinking time, not "throw it over the wall and we'll think later"
Disclaimer: I work at Hashicorp.
Stuff I work on is write heavy so spawning dozens of app copies doesn’t make sense if I just hog the db with Erie locks.
But, seriously, Teleport (back before they did a licensing rug-pull) is great at that and no SSH required. I'm super positive there are a bazillion other "don't use ssh as a poor person's VPN" solutions
Among other tech, it involves the founding of a megacorp that exploits the discovery and monopolization of wormhole technology for profit, causing a rift between the two founders, who each remind me of Steve Jobs and Steve Wozniak in their cooperation and divergence.
In that specific case, one can also have "systemd for normal people" via its support for static Pod definitions, so one can run containerized toys on boot even without being a formal member of a kubernetes cluster
AWS SSM provides auditing of what a person might normally type via ssh, and kubelet similarly, just at a different abstraction level. For clarity, I am aware that it's possible via some sshd trickery one could get similar audit and log egress, but I haven't seen one of those in practice whereas kubelet and AWS SSM provide it out of the box
You can use it to tunnel arbitrary traffic inside your VPC.
Allow SSH in non prod environments and reproduce issue there?
In prod you are aiming for "not broken" rather than "do whatever I want as admin".
https://etcha.dev/docs/guides/shell-access/
It works well and I can "expose" servers using reverse proxies since the entire shell session is over HTTP using SSE.
- Sane, secure defaults
- HTTP-based--no fingerprinting, requires the correct path (which can be another secret), plays nicely with reverse proxies and forwarders (no need for jump boxes)
- Rate limited by default
- Only works with PKI auth
- Clients verify/validate HTTPS certificates, no need for SSHFP records.
Do you know how many times a few packets can be replayed in 5 seconds?
We are still stuck there, but we're striving to get to the place where we can turn off sshd on Prod and rely on the CI/CD pipeline to blow away and reprovision instances, and be 100% confident we can test and troubleshoot in dev and stage and by looking at off-instance logs from Prod.
How important it is to get there is something I ponder about my motivations for - it's cleary not worthwhile if your project is one or 2 prod servers perhaps running something like HA WordPress, but it's obvious that at Netflix type scale that nobody is sshing into individual instances to troubleshoot. We are a long way (a long long long long way) from Netflix scale, and are unlikely to ever get there. But somewhere between dozens and hundreds of instances is about where I reckon the work required to get close to there stars paying off.
Have you worked at Netflix?
I haven't, but I have worked with large scale operations, and I wouldn't hesitate to say that the ability to ssh (or other ways to run commands remotely, which are all either built on ssh or likely not as secure and well tested) is absolutely crucial to running at scale.
The more complex and non-heterogenous environments you have, the more likely you are to encounter strange flukes. Handshakes that only fail a fraction of a percent of all times and so on. Multiple products and providers interaction. Tools like tcpdump and eBPF becomes essential.
Why would you want to deploy on a mature operating system such as Linux and not use tools such as eBPF? I know the modern way is just to yolo it and restart stuff that crashes, but as a startup or small scale you have other things to worry about. When you are at scale you really want to understand your performance profile and iron out all the kinks.
Achieving this is difficult, but we have the tools to do it. The hurdles are often organizational rather than technical.
Yeah. And in my opinion "organizational" reasons can (and should) include "we are just not at the scale where achieving that makes sense".
If you have single digit numbers of machines, the whole solid observability/ automated node replacement/self-healing setup overhead is unlikely to pay off. Especially if the SLAs don't require 2am weekend hair-on-fire platform recovery. For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).
Scrappy startups, and web/mobile platforms for anything where a few hours of downtime is not going to be an existential threat to the money flow or a big story in the tech press - probably have more important things to be doing than setting up log aggregation and request tracing. Work towards that, sure, but probably prioritise the dev productivity parts first. Get your CI/CD pipeline rock solid. Get some decent monitoring of the redundant components of your HA setup (as well as the Prod load balancer monitoring) so you know when you're degraded but not down (giving you some breathing space to troubleshoot).
And aspire to fully resilient systems and have a plan for what they might look like in the future to avoid painting yourself into a corner that makes it harder then necessary to get there one day.
But if you've got a guy spending 6 months setting up chaos monkey and chaos doctor for your WordPress site that's only getting a few thousand visits a day, you're definitely going it wrong. Five nines are expensive. If your users are gonna be "happy enough" with three nines or even two nines, you've probably got way better things to do with that budget.
For a lot of things the lack of complexity inherent in a single VPS server will mean you have better availability than any of those bizarrely complex autoscaling/recovery setups
The thing is that all companies regardless of their scale would benefit from these good practices. Scrappy startups definitely have more important things to do than maintaining their infra, whether that involves setting up observability and automation or manually troubleshooting and deploying. Both involve resources and trade-offs, but one of them eventually leads to a reduction of required resources and stability/reliability improvements, while the other leads to a hole of technical debt that is difficult to get out of if you ever want to improve stability/reliability.
What I find more harmful is the prevailing notion that "complexity" must be avoided at smaller scales, and that somehow copying a binary to a single VPS is the correct way to deploy at this stage. You see this in the sibling comment from Aeolun here.
The reality is that doing all of this right is an inherently complex problem. There's no getting around that. It's true that at smaller scales some of these practices can be ignored, and determining which is a skill on its own. But what usually happens is that companies build their own hodgepodge solutions to these problems as they run into them, which accumulate over time, and they end up having to maintain their Rube Goldberg machines in perpetuity because of sunk costs. This means that they never achieve the benefits they would have had they just adopted good practices and tooling from the start.
I'm not saying that starting with k8s and such is always a good idea, especially if the company is not well established yet, but we have tools and services nowadays that handle these problems for us. Shunning cloud providers, containers, k8s, or any other technology out of an irrational fear of complexity is more harmful than beneficial.
that's great and completely correct when you are one of the very few places in the universe where everything is fully mature and stable. the rest of us work on software. :)
Moreover, if you are in the cloud, part of your infrastructure is not under your control, making it even harder to reproduce a problem.
I've worked with companies at Netflix's scale and they still have last-resort ssh access to their systems.
Even as I type all these answers out, I'm super cognizant that there's not one hammer for all nails, and I am for sure guilty of yanking Nodes out of the ASG in order to figure out what the hell has gone wrong with them, but I try very very hard not to place my Nodes in a precarious situation to begin with so that such extreme troubleshooting becomes a minor severity incident and not Situation Normal
I agree that you should set a maximum lifetime for a node on the order of a few weeks.
I also agree that you shouldn’t be giving randos access to production infra, but and the end of the day there needs to be some people at the company who have the keys to the kingdom because you don’t know what you don’t know and you need to be able to deal with unexpected faults or outages of the telemetry and logging systems.
I once bootstrapped an entire datacenter with tens of thousands of nodes from an SSH terminal after an abrupt power failure. It turns out infrastructure has lots of circular dependencies and we had to manually break that dependency.
Personal ssh access is always better (from a security standpoint) than bot tokens and keys.
And even got SSH you can have special keys with access authorised to only specific commands, so a service account would be better than personal in that case.
This seems like it’s conceding the point since SSM also allows you to run commands on nodes - I use it interchangeably with SSH to have Ansible manage legacy servers. Maybe what you’re trying to say is that it shouldn’t be routine and that there should be more of a review process so it’s not just a random unrestricted shell session? I think that’s less controversial, and especially when combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.
> combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.
Oh, I love that idea: thanks for bringing it to my attention. I'll for sure incorporate that into my process going forward
Actually not every problem is solved with the "have you tried turning it off and back on again" trick.
I did start the whole thread by saying "and then I grew up," and not everyone is at the same place in their organizational maturity model. So, if you're happy with the process you have now, keep using it. I was unhappy, so I studied hard, incorporated supporting technology, and lobbied my heart out for change. Without maturity levels we'd all still be using telnet and tar based version control
Fixing bad design is an art, not an organizational discipline. (Sadly.)
this just moves the trusted component from the SSH key to Cloudflare, and you still must trust something implicitly. except now it's a company that has agency and a will of its own instead of just some files on a filesystem.
I'll stick to forced key rotation, thanks.
Some keys on a file system on a large number of user endhosts is a security nightmare. At big companies user endhosts are compromised hourly.
When you say forced key rotation how do you accomplish that and how often do you rotate? What if you want to disallow access to a user on a faster tempo then your rotation period? How do you ensure that you are giving out the new keys to only authorized people?
My experience has been, when you really invest in building a highly secure key rotation system, you end up building something similar to our system.
1. You want SSO integration with policy to ensure only the right people get the right keys to ensure the right keys end up on the right hosts. This is a hard problem.
2. You end up using a SSH CA with short lived certificates because "key expires after 3 minutes" is far more secure than "key rotated every 90 days".
3. Compliance requirements typically require session recording and logging, do you end up creating a MITM SSH Proxy to do this?
Building all this stuff is expensive and it needs to be kept up to date. Instead of building it in-house and hoping you build it right, buy a zero trust SSH product.
For many companies the alternative isn't key rotation it just an endless growing set of keys that never expire. To quote Tatu Ylonen the inventor of SSH:
> "In analyzing SSH keys for dozens of large enterprises, it has turned out that in many environments 90% of all authorized keys are no longer used. They represent access that was provisioned, but never terminated when the person left or the need for access ceased to exist. Some of the authorized keys are 10-20 years old, and typically about 10% of them grant root access or other privileged access. The vast majority of private user keys found in most enviroments do not have passphrases."
Challenges in Managing SSH Keys – and a Call for Solutions https://ylonen.org/papers/ssh-key-challenges.pdf
I would think even a simple "sorry, this change does not align with the project's goals" -> closed would help the submitter (and others) have some clarity versus the PR limbo it's currently in
That aside, thanks so much for pointing this out: it looks like good fun, especially the Asciicast support!
In practice, I get it - a network zone shouldn’t require a lower authn/z bar on the implicit assumption that admission to that zone must have required a higher bar.
But all these systems are built on trust, and if it isn’t based on network zoning, it’s based on something else. Maybe that other thing is better, maybe not. But it exists and it needs to be understood.
An actual zero trust system is the proverbial unpowered computer in a bunker.
I also like how that makes it easier to understand how variation is normal: for example, authentication comes in various flavors and that’s okay whereas some of that zero trust vendors will try to claim that something is or isn’t ZT based on feature gaps in their competitors’ and it’s just so tedious to play that game.
The gain here is minimal.
If that alone weren't reason enough to dismiss this, the article has marketing BS throughout. For instance, "SSH access to a server often comes with elevated privileges". Ummm... Every authentication system ever has whatever privileges that come with that authentication system. This is the kind of bull you say / write when you want to snow someone who doesn't know any better. To those of us who do understand this, this is almost AI level bullshit.
The same is true of their supposed selling points:
> Author fine-grained policy to govern who can SSH to your servers and through which SSH user(s) they can log in as.
That's exactly what ssh does. You set up precisely which authentication methods you accept, you set up keys for exactly that purpose, and you set up individual accounts. Do Cloudflare really think we're setting up a single user account and giving access to lots of different people, and we need them to save us? (now that I think about it, I bet some people do this, but this is still a ridiculous selling point)
> Monitor infrastructure access with Access and SSH command logs
So they're MITM all of our connections? We're supposed to trust them, even though they have a long history of not only working with scammers and malicious actors, but protecting them?
I suppose there's a sucker born every minute, so Cloudflare will undoubtedly sell some people on this silliness, but to me it just looks like yet another way that Cloudflare wants to recentralize the Internet around them. If they had their way, then in a few years, were they to go down, a majority of the Internet would literally stop working. That should scare everyone.
We (BastionZero) recently got bought by Cloudflare and it is exciting bringing our SSH ideas to Cloudflare.
So far my experience with joining and working at Cloudflare has been fantastic. Coming from a background of startups and academia, the size and scope of what Cloudflare is building and currently runs is overwhelming.
In academia I've seen lots of excellent academic computer science papers that never benefit anyone because they never get turned into a tool that someone can just pick up and use. Ideas have inherent value, even useless ideas, but it feels good to see great ideas have impact. What appealed to me the most about getting acquired by Cloudflare is seeing research applied directly to products and used by people. Cloudflare does an excellent job both inventing innovative ideas and then actually making them real. There used to be a lot of companies that did this 10 years ago, but Cloudflare now seems rare in that respect.