Top
Best
New

Posted by mfrw 10/23/2024

Fearless SSH: Short-lived certificates bring Zero Trust to infrastructure(blog.cloudflare.com)
151 points | 160 comments
edelbitter 10/23/2024|
Why does the title say "Zero Trust", when the article explains that this only works as long as every involved component of the Cloudflare MitM keylogger and its CA can be trusted? If hosts keys are worthless because you do not know in advance what key the proxy will have.. than this scheme is back to trusting servers merely because they are in Cloudflare address space, no?
hedora 10/24/2024||
Every zero trust architecture ends up trusting an unbounded set of machines. Like most marketing terms, it’s probably easier to assume it does the inverse of what it claims.

My mental model:

With 1 trust (the default) any trusted machine with credentials is provided access and therefore gets one unit of access. With 2-trust, we’d need at least units of trust, so two machines. Equivalently, each credential-bearing machine is half trusted (think ssh bastion hosts or 2FA / mobikeys for 2 trust).

This generalizes to 1/N, so for zero trust, we place 1/0 = infinite units of trust in every machine that has a credential. In other words, if we provision any one machine for access, we necessarily provision an unbounded number of other machines for the same level of access.

As snarky as this math is, I’ve yet to see a more accurate formulation of what zero trust architectures actually provide.

YMmV.

choeger 10/24/2024|||
I think your model is absolutely right. But there's a catch: Zero Trust (TM) is about not giving any machine any particular kind of access. So it's an infinite amount of machines with zero access.

The point of Zero Trust (TM) is to authenticate and authorize the human being behind the machine, not the machine itself.

(Clearly, that doesn't work for all kinds of automated access and it comes with a lot of question in terms of implementation details (E.g., do we trust the 2FA device?) but that's the gist.)

glitchc 10/24/2024||||
That's not the intention of zero-trust. As others have said, it's about authenticating the user and associated privilege, not the machine itself. Simply put, zero trust means machines on the intranet must undergo a user-centric authentication and authorization step prior to accessing any resource. Additionally, once authenticated, a distinct secure channel can be established between the specific endpoint and the resource that cannot be observed or manipulated by others on the same network.
EthanHeilman 10/24/2024|||
In my view the eventual goal of security is to reduce all excess trust to zero. Excess trust is all trust which is not fundamental to thing you are trying to do. If you want a feature that let's Alice update policy, you need to trust Alice to update policy. I believe that a system without any excess trust is worth building that's why I founded BastionZero and why I joined Cloudflare to work on this.

Getting there is a long walk through the woods on a moonless night.

> With 2-trust, we’d need at least units of trust, so two machines. Equivalently, each credential-bearing machine is half trusted (think ssh bastion hosts or 2FA / mobikeys for 2 trust).

You might be interested in OpenPubkey[0, 1] which was developed at BastionZero. It has 1/2 trust for OpenIDConnect and can be used for SSH.

> As snarky as this math is, I’ve yet to see a more accurate formulation of what zero trust architectures actually provide.

I prefer the term epsilon-trust to reflect the nature of security and trust reduction as an iterative process. The trust in a system approaches but never fully reaches zero.

[0]: OpenPubkey: Augmenting OpenID Connect with User held Signing Keys https://eprint.iacr.org/2023/296

[1]: https://github.com/openpubkey/openpubkey/

varenc 10/24/2024|||
https://www.cloudflare.com/learning/security/glossary/what-i...

Zero Trust just means you stop inherently trusting your private network and verify every user/device/request regardless. If you opt in to using Cloudflare to do this then it requires running Cloudflare software.

PLG88 10/24/2024|||
Thats one interpretation... ZT also posits assuming the network is compromised and hostile, that also applies to CF and their cloud/network. It blows my mind that so many solutions claim ZT while mandating TLS to their infra/cloud, you can trust their decryption of your date, and worst IMHO, they will MITM your OICD/SAML key to ensure the endpoint can authenticate and access services... that is a hell of a lot of implicit trust in them, not least of them being served a court order to decrypt your data.

Zero trust done correctly done not have those same drawbacks.

sshine 10/24/2024||
One element is buzzword inflation, and another is raising the bar.

On the one hand, entirely trusting Cloudflare isn't really zero trust.

On the other hand, not trusting any network is one narrow definition.

I'll give you SSH keys when you pry them from my cold, dead FDE SSDs.

michaelt 10/24/2024||||
Zero Trust means you stop trusting your private network, and start trusting Cloudflare, and installing their special root certificate so they can MITM all your web traffic. To keep you safe.
gobip 10/24/2024||
Same thing with their "serverless" servers where you host everything there.
bdd8f1df777b 10/24/2024|||
But with public key auth I'm already distrusting everyone on my private network.
resoluteteeth 10/24/2024||
Technically I guess that's "zero trust" in the sense of meeting the requirement of not trusting internal connections more than external ones, but in practice I guess "zero trust" also typically entails making every connection go through the same user-based authentication system, which uploading specific keys to specific servers manually definitely doesn't achieve.
fs111 10/24/2024|||
Zero Trust is a marketing label that executives can seek out and buy a thing for because it is super-hot thing to have these days. That's mostly it.
ozim 10/24/2024|||
“Zero Trust” is not assuming user has access or he is somehow trusted just because he is in trusted context. So you always check users access rights.

TLS having trusted CA cert publisher is not context of “Zero Trust”.

quectophoton 10/24/2024||
Question. Not specifically for you, but related to this comment.

Would this mean that a PostgreSQL listening on localhost and always asking for user and password is considered Zero Trust, but peer authentication is not?

This part is always a bit confusing for me because there's already been authentication (OS login) creating a session for a specific user (OS user) accessing a specific service (through a unix domain socket) with the specific connection being validated (the unix domain socket permissions).

And from my limited knowledge, the OS login looks like an IdP (Identity Provider), the OS session looks like a JWT already validated by a middleware (the OS vs some API Gateway), connecting to a service using this "token" (OS session vs JWT), and only allowing access to this specific connection (the connection to the socket) if the token is valid (OS session has permissions vs JWT has good signature) and has permissions to the application itself (PostgreSQL checking the connecting user has access to this resource vs the application checking the connecting user has access to this resource).

So I can see this as Zero Trust because the pattern is kinda matching ("the letter"), but also as Not Zero Trust because I feel like this would still be considered a "trusted context" by what the term tries to convey ("the spirit").

znpy 10/24/2024|||
> Why does the title say "Zero Trust", when the article explains that this only works as long as every involved component of the Cloudflare MitM keylogger and its CA can be trusted?

The truth-yness of "zero trust" really depends on who's trusting who.

pjc50 10/24/2024||
> Cloudflare MitM keylogger

Would you like to explain what you mean by this?

jdbernard 10/24/2024|||
Different responder, but I imagine they are referring to CloudFlare's stated ability to:

Provide command logs and session recordings to allow administrators to audit and replay their developers’ interactions with the organization’s infrastructure.

The only way they can do this is if they record and store the session text, effectively a keylogger between you and the machine you are SSH'ing into.

acdha 10/24/2024||
Keylogger has a specific meaning which doesn’t refer to audit logging. Trying to scare people by misusing loaded terms has the opposite effect from what you intend.
be_erik 10/24/2024|||
Keyloggers are absolutely used for audit logging. I've implemented these MiTM patterns specifically so we could log all keystrokes. The addition of a keylogger is only an issue if you don't trust Cloudflare, but usually a checklist item for these kinds of bastion hosts in certain compliance environments.
acdha 10/24/2024||
Yes, but it’s not a man in the middle attack when it’s monitoring your own servers any more than it’s a privacy breach when HR looks at your file. My intent was simply that trying to make things sound scary by using language normally used in adversarial contexts really isn’t helpful when talking about things companies need to do. There isn’t an expectation of privacy what what you do on company servers.
michaelt 10/24/2024|||
"mitm keylogger" has a specific meaning and refers to a party in the middle of a connection, logging the keystrokes.
acdha 10/24/2024||
Both terms are used to refer to attacks, not oversight of your own systems.
hiatus 10/24/2024||
No, here's a counter example: https://www.interguardsoftware.com/keylogger-software/
acdha 10/24/2024||
I’ll concede that keylogger is sometimes used in a corporate workstation monitoring context but it isn’t really the same as session monitoring on servers. The main thrust of my comment was simply that using loaded language to make common needs sound scary is distracting from rather than helping matters.
jdbernard 10/24/2024||
I think the original poster's intention was to be somewhat inflammatory as a way to draw attention to the very high level of trust you are granting to CloudFlare in this model. You are effectively giving them whatever privileges you yourself have on those boxes.

Of course, CloudFlare is making it their business to be and convince others that they are that trusted third-party.

debarshri 10/24/2024|||
In Privilege access management platform (including ours [1]), every operation that a user does is multiplexed via (stdout/stdin) and captured for auditing. This is a compliance requirement for SOX, PCI etc.

[1] https://adaptive.dev

tptacek 10/23/2024||
I'm a fan of SSH certificates and cannot understand why anyone would set up certificate authentication with an external third-party CA. When I'm selling people on SSH CA's, the first thing I usually have to convince them of is that I'm not saying they should trust some third party. You know where all your servers are. External CAs exist to solve the counterparty introduction problem, which is a problem SSH servers do not have.
michaelt 10/24/2024||
> I'm a fan of SSH certificates and cannot understand why anyone would set up certificate authentication with an external third-party CA.

I think the sales pitch for these sorts of service is: "Get an SSH-like experience, but it integrates with your corporate single-sign-on system, has activity logs that can't be deleted even if you're root on the target, sorts out your every-ephemeral-cloud-instance-has-a-different-fingerprint issues, and we'll sort out all the reverse-tunnelling-through-NAT and bastion-host-for-virtual-private-cloud stuff too"

Big businesses pursuing SOC2 compliance love this sort of thing.

tptacek 10/24/2024||
We've been SOC2 Type 2 for several years and I'd push back on that.

https://fly.io/blog/soc2-the-screenshots-will-continue-until...

kevin_nisbet 10/24/2024|||
I'm with you, I imagine it's mostly people just drawing parallels, they can figure out how to get a web certificate so think SSH is the same thing.

The second order problem I've found is when you dig in there are plenty of people who ask for certs but when push comes to shove really want functionality where when user access is cancelled all active sessions get torn down immediatly as well.

xyst 10/23/2024||
Same reasons for companies still buying “CrowdStrike” and installing that crapware. It’s all for regulatory checkboxes (ie, fedramp cert).
tptacek 10/23/2024||
I do not believe you in fact need any kind of SSH CA, let alone one run by a third party, to be FedRAMP-compliant.
mdaniel 10/23/2024||
I really enjoyed my time with Vault's ssh-ca (back when it had a sane license) but have now grown up and believe that any ssh access is an antipattern. For context, I'm also one of those "immutable OS or GTFO" chaps because in my experience the next thing that happens after some rando ssh-es into a machine is they launch vi or apt-get or whatever and now it's a snowflake with zero auditing of the actions taken to it

I don't mean to detract from this, because short-lived creds are always better, but for my money I hope I never have sshd running on any machine again

akira2501 10/24/2024||
> any ssh access is an antipattern.

Not generally. In one particular class of deployments allowing ssh access to root enabled accounts without auditing may be.. but this is an exceptionally narrowed definition.

> I hope I never have sshd running on any machine again

Sounds great for production and ridiculous for development and testing.

mdaniel 10/24/2024||
> Sounds great for production and ridiculous for development and testing.

I believe that "practice how you're going to play" to get devs into the habit of not using a crutch to treat deployments like they are their local machine. The time to anticipate failures is in the thinking time, not "throw it over the wall and we'll think later"

akira2501 10/24/2024||
Your devs should not be managing deployments. Making deployable software and actually worrying about the deployment environment don't need particularly tight coordination. I'd also worry that you're overfitting your software to whatever third party deployment target you've selected.
advael 10/24/2024|||
Principle of least privilege trivially prevents updating system packages. Like if you don't want people using apt, don't give people root on your servers?
blueflow 10/24/2024|||
Even for immutable OSes, SSH is a great protocol for bidirectionally authenticated data / file transfer.
ashconnor 10/24/2024|||
You can audit if you put something like hoop.dev, Tailscale, Teleport or Boundary in between the client and server.

Disclaimer: I work at Hashicorp.

LtWorf 10/24/2024||
But I avoid hascicorp stuff whenever I can!
ozim 10/23/2024|||
How do you handle db.

Stuff I work on is write heavy so spawning dozens of app copies doesn’t make sense if I just hog the db with Erie locks.

mdaniel 10/23/2024||
I must resist the urge to write "users can access the DB via the APIs in front of it" :-D

But, seriously, Teleport (back before they did a licensing rug-pull) is great at that and no SSH required. I'm super positive there are a bazillion other "don't use ssh as a poor person's VPN" solutions

zavec 10/23/2024||
This led me to google "teleport license," which sounds like a search from a much more interesting world.
aspenmayer 10/24/2024|||
You might be interested in Peter F. Hamilton's Commonwealth Saga sci-fi series, then.

Among other tech, it involves the founding of a megacorp that exploits the discovery and monopolization of wormhole technology for profit, causing a rift between the two founders, who each remind me of Steve Jobs and Steve Wozniak in their cooperation and divergence.

https://en.wikipedia.org/wiki/Commonwealth_Saga

Hikikomori 10/24/2024||
Yo, dudes, how’s it hanging?
aspenmayer 10/24/2024||
Is this a reference to the books? It's been a while since I read them.
Hikikomori 10/24/2024||
Its what Ozzie or Nigel say over the radio after they landed.
aspenmayer 10/24/2024||
Ah yeah, that's a great scene! The bravado and hubris of gatecrashing an interplanetary livestream to launch your startup out of stealth is just chef's kiss.
mdaniel 10/24/2024|||
To save others the search: https://github.com/gravitational/teleport/pull/35259 Apache to AGPLv3
namxam 10/23/2024|||
But what is the alternative?
mdaniel 10/23/2024|||
There's not one answer to your question, but here's mine: kubelet and AWS SSM (which, to the best of my knowledge will work on non-AWS infra it just needs to be provided creds). Bottlerocket <https://github.com/bottlerocket-os/bottlerocket#setup> comes batteries included with both of those things, and is cheaply provisioned with (ahem) TOML user-data <https://github.com/bottlerocket-os/bottlerocket#description-...>

In that specific case, one can also have "systemd for normal people" via its support for static Pod definitions, so one can run containerized toys on boot even without being a formal member of a kubernetes cluster

AWS SSM provides auditing of what a person might normally type via ssh, and kubelet similarly, just at a different abstraction level. For clarity, I am aware that it's possible via some sshd trickery one could get similar audit and log egress, but I haven't seen one of those in practice whereas kubelet and AWS SSM provide it out of the box

cyberax 10/23/2024|||
Be careful with SSM, it can provide pretty much unlimited access: https://github.com/Cyberax/gimlet

You can use it to tunnel arbitrary traffic inside your VPC.

_hyn3 10/23/2024|||
[dead]
ndndjdueej 10/24/2024||||
IaC, send out logs to Splunk, health checks, slow rollouts, feature flags etc?

Allow SSH in non prod environments and reproduce issue there?

In prod you are aiming for "not broken" rather than "do whatever I want as admin".

candiddevmike 10/24/2024|||
I built a config management tool, Etcha, that uses short lived JWTs. I extended it to offer a full shell over HTTP using JWTs:

https://etcha.dev/docs/guides/shell-access/

It works well and I can "expose" servers using reverse proxies since the entire shell session is over HTTP using SSE.

artificialLimbs 10/24/2024|||
I don’t understand why this is more secure than limiting SSH to local network only and doing ‘normal’ ssh hardening.
candiddevmike 10/24/2024||
None of that is required here? Etcha can be exposed on the Internet with a smaller risk profile than SSH:

- Sane, secure defaults

- HTTP-based--no fingerprinting, requires the correct path (which can be another secret), plays nicely with reverse proxies and forwarders (no need for jump boxes)

- Rate limited by default

- Only works with PKI auth

- Clients verify/validate HTTPS certificates, no need for SSHFP records.

g-b-r 10/24/2024|||
“All JWTs are sent with low expirations (5 seconds) to limit replability”

Do you know how many times a few packets can be replayed in 5 seconds?

candiddevmike 10/24/2024||
Sure, but this is all happening over HTTPS (Etcha only listens on HTTPS), it's just an added form of protection/expiration.
riddley 10/23/2024||
How do you troubleshoot?
bigiain 10/23/2024|||
I think ssh-ing into production is a sign of not fully mature devops practices.

We are still stuck there, but we're striving to get to the place where we can turn off sshd on Prod and rely on the CI/CD pipeline to blow away and reprovision instances, and be 100% confident we can test and troubleshoot in dev and stage and by looking at off-instance logs from Prod.

How important it is to get there is something I ponder about my motivations for - it's cleary not worthwhile if your project is one or 2 prod servers perhaps running something like HA WordPress, but it's obvious that at Netflix type scale that nobody is sshing into individual instances to troubleshoot. We are a long way (a long long long long way) from Netflix scale, and are unlikely to ever get there. But somewhere between dozens and hundreds of instances is about where I reckon the work required to get close to there stars paying off.

xorcist 10/24/2024|||
> at Netflix type scale that nobody is sshing into individual instances to troubleshoot

Have you worked at Netflix?

I haven't, but I have worked with large scale operations, and I wouldn't hesitate to say that the ability to ssh (or other ways to run commands remotely, which are all either built on ssh or likely not as secure and well tested) is absolutely crucial to running at scale.

The more complex and non-heterogenous environments you have, the more likely you are to encounter strange flukes. Handshakes that only fail a fraction of a percent of all times and so on. Multiple products and providers interaction. Tools like tcpdump and eBPF becomes essential.

Why would you want to deploy on a mature operating system such as Linux and not use tools such as eBPF? I know the modern way is just to yolo it and restart stuff that crashes, but as a startup or small scale you have other things to worry about. When you are at scale you really want to understand your performance profile and iron out all the kinks.

Hikikomori 10/24/2024||
Can also use stuff like Datadog NPM/APM that uses eBPF to pick up most of what you need. Its been a long time since I've needed anything else.
xorcist 11/1/2024||
Yes, there are numerous other ways to run remote commands than ssh, all of them less secure. (Running commands via your monitoring system can even be a very handy back door in a pinch.)

The argument here was that remote commands was less useful at scale, not that ssh was a particularly bad way of implementing it. Which doesn't make sense. You tend to have more complex system interactions at scale, not less.

imiric 10/23/2024||||
Right. The answer is having systems that are resilient to failure, and if they do fail being able to quickly replace any node, hopefully automatically, along with solid observability to give you insight into what failed and how to fix it. The process of logging into a machine to troubleshoot it in real-time while the system is on fire is so antiquated, not to mention stressful. On-call shouldn't really be a major part of our industry. Systems should be self-healing, and troubleshooting done during working hours.

Achieving this is difficult, but we have the tools to do it. The hurdles are often organizational rather than technical.

bigiain 10/23/2024|||
> The hurdles are often organizational rather than technical.

Yeah. And in my opinion "organizational" reasons can (and should) include "we are just not at the scale where achieving that makes sense".

If you have single digit numbers of machines, the whole solid observability/ automated node replacement/self-healing setup overhead is unlikely to pay off. Especially if the SLAs don't require 2am weekend hair-on-fire platform recovery. For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).

Scrappy startups, and web/mobile platforms for anything where a few hours of downtime is not going to be an existential threat to the money flow or a big story in the tech press - probably have more important things to be doing than setting up log aggregation and request tracing. Work towards that, sure, but probably prioritise the dev productivity parts first. Get your CI/CD pipeline rock solid. Get some decent monitoring of the redundant components of your HA setup (as well as the Prod load balancer monitoring) so you know when you're degraded but not down (giving you some breathing space to troubleshoot).

And aspire to fully resilient systems and have a plan for what they might look like in the future to avoid painting yourself into a corner that makes it harder then necessary to get there one day.

But if you've got a guy spending 6 months setting up chaos monkey and chaos doctor for your WordPress site that's only getting a few thousand visits a day, you're definitely going it wrong. Five nines are expensive. If your users are gonna be "happy enough" with three nines or even two nines, you've probably got way better things to do with that budget.

Aeolun 10/23/2024|||
> For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).

For a lot of things the lack of complexity inherent in a single VPS server will mean you have better availability than any of those bizarrely complex autoscaling/recovery setups

imiric 10/24/2024|||
I'm not so sure about all of that.

The thing is that all companies regardless of their scale would benefit from these good practices. Scrappy startups definitely have more important things to do than maintaining their infra, whether that involves setting up observability and automation or manually troubleshooting and deploying. Both involve resources and trade-offs, but one of them eventually leads to a reduction of required resources and stability/reliability improvements, while the other leads to a hole of technical debt that is difficult to get out of if you ever want to improve stability/reliability.

What I find more harmful is the prevailing notion that "complexity" must be avoided at smaller scales, and that somehow copying a binary to a single VPS is the correct way to deploy at this stage. You see this in the sibling comment from Aeolun here.

The reality is that doing all of this right is an inherently complex problem. There's no getting around that. It's true that at smaller scales some of these practices can be ignored, and determining which is a skill on its own. But what usually happens is that companies build their own hodgepodge solutions to these problems as they run into them, which accumulate over time, and they end up having to maintain their Rube Goldberg machines in perpetuity because of sunk costs. This means that they never achieve the benefits they would have had they just adopted good practices and tooling from the start.

I'm not saying that starting with k8s and such is always a good idea, especially if the company is not well established yet, but we have tools and services nowadays that handle these problems for us. Shunning cloud providers, containers, k8s, or any other technology out of an irrational fear of complexity is more harmful than beneficial.

LtWorf 10/24/2024|||
If you don't know why they failed, replacing them is pointless.
naikrovek 10/24/2024||||
> I think ssh-ing into production is a sign of not fully mature devops practices.

that's great and completely correct when you are one of the very few places in the universe where everything is fully mature and stable. the rest of us work on software. :)

otabdeveloper4 10/24/2024||||
A whole lot of words to say "we don't troubleshoot and just live with bugs, #yolo".
sleepydog 10/24/2024|||
It's a good mindset to have, but I think ssh access should still be available as a last resort on prod systems, and perhaps trigger some sort of postmortem process, with steps to detect the problem without ssh in the future. There is always going to be a bug, that you cannot reproduce outside of prod, that you cannot diagnose with just a core dump, and that is a show stopper. It's one thing to ignore a minor performance degradation, but if the problem corrupts your state you cannot ignore it.

Moreover, if you are in the cloud, part of your infrastructure is not under your control, making it even harder to reproduce a problem.

I've worked with companies at Netflix's scale and they still have last-resort ssh access to their systems.

mdaniel 10/23/2024||||
In my world, if a developer needs access to the Node upon which their app is deployed to troubleshoot, that's 100% a bug in their application. I am cognizant that being whole-hog on 12 Factor apps is a journey, but for my money get on the train because "let me just ssh in and edit this one config file" is the road to ruin when no one knows who edited what to set it to what new value. Running $(kubectl edit) allows $(kubectl rollout undo) to put it back, and also shows what was changed from what to what
megous 10/24/2024|||
Your world is very narrow and limited. Some devs also have to deal with customer provisioned HW infrastructure, with buggy interactions between HW/virtualization solutions that every 5 minutes duplicate all packets for a few seconds; with applications that interact with customer only onsite HW you only have remote access to via production deployment; with quirky virtualization like vmware stopping the vCPU on you for hundreds of ms if you load it too much which you'll not replicate locally; with things you can't predict you'll need to observe ahead of time, etc. And it does not involve editing any configs. It's just troubleshooting.
yjftsjthsd-h 10/23/2024|||
How do you debug the worker itself?
mdaniel 10/23/2024|||
Separate from my sibling comment about AWS SSM, I also believe that if one cannot know that a Node is sick by the metrics or log egress from it, that's a deployment bug. I'm firmly in the "Cattle" camp, and am getting closer and closer to the "Reverse Uptime" camp - made easier by ASG's newfound "Instance Lifespan" setting to make it basically one-click to get onboard that train

Even as I type all these answers out, I'm super cognizant that there's not one hammer for all nails, and I am for sure guilty of yanking Nodes out of the ASG in order to figure out what the hell has gone wrong with them, but I try very very hard not to place my Nodes in a precarious situation to begin with so that such extreme troubleshooting becomes a minor severity incident and not Situation Normal

__turbobrew__ 10/24/2024|||
If accidentally nuking a single node while debugging causes issues you have bigger problems. Especially if you are running kubernetes any node should be able to fall off the earth at any time without issues.

I agree that you should set a maximum lifetime for a node on the order of a few weeks.

I also agree that you shouldn’t be giving randos access to production infra, but and the end of the day there needs to be some people at the company who have the keys to the kingdom because you don’t know what you don’t know and you need to be able to deal with unexpected faults or outages of the telemetry and logging systems.

I once bootstrapped an entire datacenter with tens of thousands of nodes from an SSH terminal after an abrupt power failure. It turns out infrastructure has lots of circular dependencies and we had to manually break that dependency.

ramzyo 10/24/2024||
Exactly this. Have heard it referred to as "break glass access". Some form of remote access, be it SSH or otherwise, in case of serious emergency.
viraptor 10/24/2024||||
Passive metrics/logs won't let you debug all the issues. At some point you either need a system for automatic memory dumps and submitting bpf scripts to live nodes... or you need SSH access to do that.
otabdeveloper4 10/24/2024||
This "system for automatic dumps" 100 percent uses ssh under the hood. Probably with some eternal sudo administrator key.

Personal ssh access is always better (from a security standpoint) than bot tokens and keys.

viraptor 10/24/2024||
There's a thousand ways to do it without SSH. It can be built into the app itself. It can be a special authenticated route to a suid script. It can be built into the current orchestration system. It can be pull-based using the a queue for system monitoring commands. It can be part of the existing monitoring agent. It can be run through AWS SSM. There's really no reason it has to be SSH.

And even got SSH you can have special keys with access authorised to only specific commands, so a service account would be better than personal in that case.

acdha 10/24/2024|||
> Separate from my sibling comment about AWS SSM,

This seems like it’s conceding the point since SSM also allows you to run commands on nodes - I use it interchangeably with SSH to have Ansible manage legacy servers. Maybe what you’re trying to say is that it shouldn’t be routine and that there should be more of a review process so it’s not just a random unrestricted shell session? I think that’s less controversial, and especially when combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.

mdaniel 10/24/2024||
Yes, you nailed it with "it shouldn't be routine" and there for sure should be a review process. My primary concern with the audit logs actually isn't security it's lowering the cowboy of the software lifecycle

> combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.

Oh, I love that idea: thanks for bringing it to my attention. I'll for sure incorporate that into my process going forward

acdha 10/24/2024||
The first time I heard it was a very simple idea: they had a wrapper for the command which installed SSH keys on an EC2 instance which also set a delete-after tag which CloudCustodian queried.
from-nibly 10/24/2024|||
You don't. you shoot it in the head and get a new one. If you need logging / telemetry bake it into the image.
otabdeveloper4 10/24/2024||
Are you from techsupport?

Actually not every problem is solved with the "have you tried turning it off and back on again" trick.

mdaniel 10/24/2024||
No, what we're talking about is (to extend your very condescending tech support analogy) shipping the customer a new PC from the factory, and telling them to throw the old one away because it doesn't matter. It will only start to matter if they have 3 bad PCs in a row, at which time it becomes (a) a demonstrable failure and not just stray neutron rays (b) an incident which will carry a postmortum of how the organization could have prevented that failure for next time

I did start the whole thread by saying "and then I grew up," and not everyone is at the same place in their organizational maturity model. So, if you're happy with the process you have now, keep using it. I was unhappy, so I studied hard, incorporated supporting technology, and lobbied my heart out for change. Without maturity levels we'd all still be using telnet and tar based version control

otabdeveloper4 10/25/2024||
Once you reach the next maturity level you realize that some problems and bugs are from bad design, and cannot be fixed by restarting the server.

Fixing bad design is an art, not an organizational discipline. (Sadly.)

LtWorf 10/24/2024|||
He asks the senior developer to do it.
TechnicalVault 10/24/2024||
The whole MITM just makes me deeply uncomfortable, it's introducing a single point of trust with the keys to the kingdom. If I want to log what someone is doing, I do it server side e.g. some kind of rsyslog. That way I can leverage existing log anomaly detection systems to pick up and isolate the server if we detect any bad behaviour.
naikrovek 10/24/2024|
yeah the MITM thing is ... concerning.

this just moves the trusted component from the SSH key to Cloudflare, and you still must trust something implicitly. except now it's a company that has agency and a will of its own instead of just some files on a filesystem.

I'll stick to forced key rotation, thanks.

EthanHeilman 10/24/2024||
> you still must trust something implicitly. except now it's a company that has agency and a will of its own instead of just some files on a filesystem.

Some keys on a file system on a large number of user endhosts is a security nightmare. At big companies user endhosts are compromised hourly.

When you say forced key rotation how do you accomplish that and how often do you rotate? What if you want to disallow access to a user on a faster tempo then your rotation period? How do you ensure that you are giving out the new keys to only authorized people?

My experience has been, when you really invest in building a highly secure key rotation system, you end up building something similar to our system.

1. You want SSO integration with policy to ensure only the right people get the right keys to ensure the right keys end up on the right hosts. This is a hard problem.

2. You end up using a SSH CA with short lived certificates because "key expires after 3 minutes" is far more secure than "key rotated every 90 days".

3. Compliance requirements typically require session recording and logging, do you end up creating a MITM SSH Proxy to do this?

Building all this stuff is expensive and it needs to be kept up to date. Instead of building it in-house and hoping you build it right, buy a zero trust SSH product.

For many companies the alternative isn't key rotation it just an endless growing set of keys that never expire. To quote Tatu Ylonen the inventor of SSH:

> "In analyzing SSH keys for dozens of large enterprises, it has turned out that in many environments 90% of all authorized keys are no longer used. They represent access that was provisioned, but never terminated when the person left or the need for access ceased to exist. Some of the authorized keys are 10-20 years old, and typically about 10% of them grant root access or other privileged access. The vast majority of private user keys found in most enviroments do not have passphrases."

Challenges in Managing SSH Keys – and a Call for Solutions https://ylonen.org/papers/ssh-key-challenges.pdf

antoniomika 10/23/2024||
I wrote a system that did this >5 years ago (luckily was able to open source it before the startup went under[0]). The bastion would record ssh sessions in asciicast v2 format and store those for later playback directly from a control panel. The main issue that still isn't solved by a solution like this is user management on the remote (ssh server) side. In a more recent implementation, integration with LDAP made the most sense and allows for separation of user and login credentials. A single integrated solution is likely the holy grail in this space.

[0] https://github.com/notion/bastion

mdaniel 10/23/2024|
Out of curiosity, why ignore this PR? https://github.com/notion/bastion/pull/13

I would think even a simple "sorry, this change does not align with the project's goals" -> closed would help the submitter (and others) have some clarity versus the PR limbo it's currently in

That aside, thanks so much for pointing this out: it looks like good fun, especially the Asciicast support!

antoniomika 10/23/2024||
Honestly never had a chance to merge it/review it. Once the company wound down, I had to move onto other things (find a new job, work on other priorities, etc) and lost access to be able to do anything with it after. I thought about forking it and modernizing it but never came to fruition.
shermantanktop 10/23/2024||
I didn’t understand the marketing term “zero trust” and I still don’t.

In practice, I get it - a network zone shouldn’t require a lower authn/z bar on the implicit assumption that admission to that zone must have required a higher bar.

But all these systems are built on trust, and if it isn’t based on network zoning, it’s based on something else. Maybe that other thing is better, maybe not. But it exists and it needs to be understood.

An actual zero trust system is the proverbial unpowered computer in a bunker.

athorax 10/24/2024||
It means there is zero trust of a device/service/user on your network until they have been fully authenticated. It is about having zero trust in something just because it is inside your network perimeter.
shermantanktop 10/24/2024||
Maybe it should be called "zero trust of a device/service/user on your network until they have been fully authenticated." But that wouldn't sell high-dollar consulting services.
wmf 10/23/2024|||
The something else is specifically user/service identity. Not machine identity, not IP address. It is somewhat silly to have a buzzword that means "no, actually authenticate users" but here we are.
ngneer 10/23/2024|||
With you there. The marketing term makes Zero Sense to me.
acdha 10/24/2024||
Yeah, it’s not a great name. Twenty years ago we called it “end to end authentication” and I think that’s better because it focuses on the most important aspect, but it probably doesn’t sound as cool for marketing purposes.

I also like how that makes it easier to understand how variation is normal: for example, authentication comes in various flavors and that’s okay whereas some of that zero trust vendors will try to claim that something is or isn’t ZT based on feature gaps in their competitors’ and it’s just so tedious to play that game.

blueflow 10/24/2024||
Instead of stealing your password/keypair, the baddies will now have to spoof your authentication with cloudflare. If thats just a password, you gained nothing. If you have 2FA set up for that, you could equally use that for SSH directly, using a ssh key on a physical FIDO stick. OpenSSH already has native support for that (ecdsa-sk and ed25519-sk key formats).

The gain here is minimal.

keepamovin 10/24/2024||
Does this give CloudFlare a backdoor to all your servers? That would not strictly be ZT, as some identify in the comments here.
udev4096 10/24/2024||
For cloudflare, all their fancy ZT excludes themselves. It's just like the well known MiTM they perform while using their CA
megous 10/24/2024||
Sounds like their modus operandi for most of their products, incl. the original one.
keepamovin 10/24/2024||
And if China hacks CloudFlare? I guess we're all fucked.
knallfrosch 10/24/2024|||
Everything rests on CloudFlare's key.
ChoHag 10/24/2024||
[dead]
johnklos 10/23/2024||
So... don't trust long lived ssh keys, but trust Cloudflare's CA. Why? What has Cloudflare done to earn trust?

If that alone weren't reason enough to dismiss this, the article has marketing BS throughout. For instance, "SSH access to a server often comes with elevated privileges". Ummm... Every authentication system ever has whatever privileges that come with that authentication system. This is the kind of bull you say / write when you want to snow someone who doesn't know any better. To those of us who do understand this, this is almost AI level bullshit.

The same is true of their supposed selling points:

> Author fine-grained policy to govern who can SSH to your servers and through which SSH user(s) they can log in as.

That's exactly what ssh does. You set up precisely which authentication methods you accept, you set up keys for exactly that purpose, and you set up individual accounts. Do Cloudflare really think we're setting up a single user account and giving access to lots of different people, and we need them to save us? (now that I think about it, I bet some people do this, but this is still a ridiculous selling point)

> Monitor infrastructure access with Access and SSH command logs

So they're MITM all of our connections? We're supposed to trust them, even though they have a long history of not only working with scammers and malicious actors, but protecting them?

I suppose there's a sucker born every minute, so Cloudflare will undoubtedly sell some people on this silliness, but to me it just looks like yet another way that Cloudflare wants to recentralize the Internet around them. If they had their way, then in a few years, were they to go down, a majority of the Internet would literally stop working. That should scare everyone.

EthanHeilman 10/23/2024|
I'm a member of the team that worked on this happy to answer any questions.

We (BastionZero) recently got bought by Cloudflare and it is exciting bringing our SSH ideas to Cloudflare.

lenova 10/23/2024||
I'd love to hear about the acquisition story with Cloudflare.
EthanHeilman 10/24/2024|||
Are particular questions?

So far my experience with joining and working at Cloudflare has been fantastic. Coming from a background of startups and academia, the size and scope of what Cloudflare is building and currently runs is overwhelming.

In academia I've seen lots of excellent academic computer science papers that never benefit anyone because they never get turned into a tool that someone can just pick up and use. Ideas have inherent value, even useless ideas, but it feels good to see great ideas have impact. What appealed to me the most about getting acquired by Cloudflare is seeing research applied directly to products and used by people. Cloudflare does an excellent job both inventing innovative ideas and then actually making them real. There used to be a lot of companies that did this 10 years ago, but Cloudflare now seems rare in that respect.

FlyingSnake 10/24/2024|||
You can read the details here: https://blog.cloudflare.com/cloudflare-acquires-bastionzero/
mdaniel 10/23/2024||
I just wanted to offer my congratulations on the acquisition. I don't know any details about your specific one, but I have been around enough to know that it's still worth celebrating o/
More comments...