Top
Best
New

Posted by thorel 1 day ago

Finding and fixing Ghostty's largest memory leak(mitchellh.com)
529 points | 117 commentspage 2
dangoodmanUT 21 hours ago|
waiting for someone to say "this wouldn't have happen if you chose rust"
woodruffw 21 hours ago|
You’ll probably be waiting a long time, since Rust very explicitly doesn’t have “leak safety” as a constructive property. Safe Rust programs are allowed to leak memory, because memory leaks themselves don’t cause safety issues.

There’s even a standard, non-unsafe API for leaking memory[1].

(What Rust does do is make it harder to construct programs that leak memory unintentionally. It’s possible but not guaranteed that a similar leak would be difficult to express idiomatically in Rust.)

[1]: https://doc.rust-lang.org/std/boxed/struct.Box.html#method.l...

tialaramex 18 hours ago||
The specific language feature you want if you insist that you don't want this kind of leak is Linear Types.

Rust has Affine Types. This means Rust cares that for any value V of type T, Rust can see that we did not destroy V twice (or more often).

With Linear Types the compiler checks that you destroyed V exactly once, not less and not more.

However, one reason I don't end up caring about Leak Safety of this sort is that in fact users do not care that you didn't "leak" data in this nerd sense. In this nerd sense what matters is only leaks where we lost all reference to the heap data. But from a user's perspective it's just as bad if we did have the reference but we forgot - or even decided explicitly not - to throw it away and get back the RAM.

The obvious way to make this mistake "by accident" in Rust is to have two things which keep each other alive via reference counting and yet have been disconnected and forgotten by the rest of the system. A typical garbage collected language would notice that these are garbage and destroy them both, but Rust isn't a GC language of course. Calling Box::leak isn't likely to happen by accident (though you might mistakenly believe you will call it only once but actually use it much more often)

I think the main part of Ghostty's design mentioned here that - as a Rust programmer - I think is probably a mistake is the choice to use a linked list. To me this looks exactly like it needs VecDeque, a circular buffer backed by a growable array type. Their "clever" typical case where you emit more text and so your oldest page is scrapped and re-used to form your newest page, works very nicely in VecDeque, and it seems like they never want the esoteric fast things a linked list can do, nor do they need multi-writer concurrency like the guts of an OS kernel, they want O(1) pop & push from opposite ends. Zig's Deque is probably that same thing but in Zig.

vlovich123 17 hours ago|||
The issue isn’t linked list vs dequeue but type confusion about what was in the container. They didn’t forget to drop it - they got confused about which type was in the list when popping and returned it to the pool instead of munmap.

The way to solve this in Rust would be to put this logic in the drop and hide each page type in an enum. That way you can’t ever confuse the types or what happens when you drop.

hardwaresofton 13 hours ago||
Was going to say this, but I don't think anyone actually wants to hear that Rust actually would have helped here.

As you're saying, the bug was the equivalent of an incorrectly written Drop implementation.

Nothing against Zig, and people not using Rust is just fine, but this is what happens when you want C-like feel for your language. You miss out on useful abstractions along with the superfluous ones.

"We don't need destructors, defer/errdefer is enough" is Zig's stance, and it was mostly OK.

Impossible to predict this kind of issue when choosing a project language (and it's already been discussed why Zig was chosen over Rust for Ghostty, which is fine!), so it's not a reason to always choose Rust over Zig, but sometimes that slightly annoying ceremony is useful!

Maybe some day I'll be smart enough to write Zig as a default over Rust, but until that day I'm going to pay the complexity price to get more safety and keep more safety mechanisms on the shotgun aimed at my foot. I've got plenty of other bugs I can spend time writing.

Another good example is the type vs type alias vs wrapper type debate. It's probably not reasonable to use a wrapper type every single time (e.g. num_seconds probably can probably be a u32 and not a Seconds type), but it's really a Rorschach test because some people lean towards one end versus the other for whatever reason, and the plusses/minuses are different depending on where you land on the spectrum.

[EDIT] also some good discussion here

https://ziggit.dev/t/zig-what-i-think-after-months-of-using-...

weebull 10 hours ago|||
> "We don't need destructors, defer/errdefer is enough" is Zig's stance, and it was mostly OK.

There's more than that. Zig has leak detecting memory allocators as well, but they only detect the leak if it happens. Nobody had a reliable reproduction method until recently.

dnautics 12 hours ago||||
I don't know if this particular error would have been findable with zig-clr, but you don't need RAII. Errdefer/defer is enough, if you have an alogrithm checking your work.
hardwaresofton 6 hours ago||
It’s not that you NEED RAII (or any other language abstraction), it’s that this case would have been avoided with that usage.

Clearly, the current state of things was not enough.

AndyKelley 12 hours ago|||
If you wanted to match Ghostty's performance in Rust, you'd need to use unsafe in order to use these memory mapping APIs, then you'd be in the exact same boat. Actually you'd be in a worse boat because Zig is safer than unsafe Rust.
hardwaresofton 7 hours ago|||
> If you wanted to match Ghostty's performance in Rust, you'd need to use unsafe in order to use these memory mapping APIs, then you'd be in the exact same boat.

Yea, but not for all the parts — being able to isolate the unsafe and build abstractions that ensure certain usage parts of the unsafe stuff is a key part of high quality rust code that uses unsafe.

In this case though I think the emphasis is on the fact that there is a place where that code should have been in Rust land, and writing that function would have made it clear and likely avoided the confusion.

Less about unsafe and more about the resulting structure of code.

> Actually you'd be in a worse boat because Zig is safer than unsafe Rust

Other people have mentioned it but I disagree with this assertion.

Its a bit simplistic but I view it this way — every line of C/Zig is unsafe (lots of quibbling to do about what “unsafe” means of course) while some lines of rust are unsafe. Really hard for that assertion to make sense under that world view.

That said, I’m not gonna miss this chance to thank you and the Zig foundation and ecosystem for creating and continuously improving Zig! Thanks for all the hard work and thoughtful API design that has sparked conversation and progress.

tialaramex 8 hours ago||||
I don't buy your theory that a Rust terminal would need to directly use mmap to deliver matching performance. In fact I doubt Ghostty's author would endorse this claim either, they've never tried any alternatives, they tried this and it works for their purpose which is a long way from other ways wouldn't work or would all be slower or whatever.
Zakis1 9 hours ago||||
[dead]
vlovich123 12 hours ago|||
Please don’t get defensive and spread silly FUD. You can be proud of what you’ve accomplished without feeling sad that a different language has strengths that yours doesn’t.

Calling unsafe mmap APIs not only is unlikely to run into the corner cases where unsafe Rust is tricky to get right, there’s “millions” of crates that offer safe APIs to do so and it’s fundamentally not hard to write it safely (it would be very hard to write it to have any issues).

And fundamentally I think Rust is much more likely to be easier to get high performance because the vast majority of safe code you write is amenable to the compiler performing safe optimizations that Zig just can’t do regarding pointer aliasing (or if it does brings all the risks of of unsafe Rust when the user annotates something incorrectly).

uecker 11 hours ago||
I don't think this is silly FUD. The article describes a scenario where the low-level abstractions itself was buggy in a subtle way, the comparison to "unsafe" Rust seems entirely fair to me. (edited for typos)
tialaramex 9 hours ago|||
With Rust you always could unsafely do whatever went wrong in somebody's C or Zig or whatever, but the question is whether you would. Rust's technical design reinforces a culture where the answer is usually "No".

I don't find the claim that weird low level mmap tricks here are perf critical at all persuasive. The page recycling makes sense - I can see why that's helping performance, but the bare metal mmap calls smell to me like somebody wanted to learn about mmap and this was their excuse. Which is fine - I need to be clear about that - but it's not actually crucial to end users being happy with this software.

vlovich123 5 hours ago|||
I think we can agree that Mitchell knows what he’s doing and isn’t playing around with mmap just because. It’s probably quite important to ensure a low memory footprint. But mmap in rust is not extra risky in some weird mystical way. It’s just a normal FFI function to get a pointer back and you can trivially build safe abstractions around it to ensure the lifetime of a slice doesn’t exceed the lifetime of the underlying map. It’s rust 101 and there’s nothing weird here that can cause the unsafe bits here to be extra dangerous (in general unsafe rust can be difficult to get right with certain constructs, but it doesn’t apply here).
uecker 9 hours ago|||
Also in C or Zig you do not need to create your own memory management using mmap. Whether this is necessary in this case or not is a different question.

In the end, if the Rust advantage is that "Rust's technical design reinforces a culture" where one tries to avoid this, then this is a rather weak argument. We will see how this turns out in the long run though.

vlovich123 5 hours ago|||
The long run has already spoken. Go look at the reports out of Microsoft and Android. It’s screamingly clear that the philosophy of Rust that most code can be written in safe with small bits in unsafe is inherently safer. The defect rate plummets by one or two orders of magnitude if I recall correctly. C is an absolute failure (since it’s the baseline) and Zig has no similar adoption studies. You could argue it will be similar if you always compile releasesafe, but then performance will be worse than C or Rust due to all the checks and it’s unclear how big a while the places that aren’t dynamically checked are.

Oh and of course rust is inherently slightly faster because no reference aliasing is allowed and automatically annotated everywhere which allows for significant aggressive compiler optimizations that neither C nor Zig can do automatically and is risky to do by hand.

uecker 16 minutes ago||
I don't put too much wait on the self-reporting by Microsoft or Google. I agree though that the strategy to write safe bits and abstractions is good. What I know not to be true is the idea that similar strategies would not work also in C.
tialaramex 8 hours ago||||
> We will see how this turns out in the long run though.

Rust 1.0 was in 2015. This is the long run. And I disagree that safety culture is a "weak argument". It's foundational, this is where you must start, adding it afterwards is a Herculean task, so no surprise that people aren't really trying.

uecker 38 minutes ago||
I am not saying that safety culture is irrelevant, not at all. I am saying that if the advantage of Rust is the culture that emphasizes safety (or rather memory safety, if the Rust community cared about safety in general cargo would not exist in this form) then that is a weak argument.

I don't think 10 years ago there was a lot of Rust used, so I am not sure how relevant it is that 1.0 was released at this time.

vacuity 4 hours ago|||
Of course, culture and technical design are both important for any language, but be specific. Despite the prevalence of tools that improve C's safety, writing C safely generally requires a culture of using those tools and other techniques. For better or worse, Rust's borrow checker is a clear demonstration of where Rust lies on the safety-freedom spectrum.
vlovich123 5 hours ago|||
The low level abstraction was buggy because they forgot to free memory because they confused types, not because of mmap.

Thats completely orthogonal to the question and less likely in Rust because you would generally use an enum with Drop implemented for the interior of the variants to guarantee correct release.

And mmap is no more difficult to call in Rust nor more magically unsafe - that’s the FUD. The vast majority of Ghostty wouldn’t even need unsafe meaning the vast majority of code gets optimized more due to no aliasing being automatic everywhere and why the argument that “zig is safer than unsafe rust” is disingenuous about performance or safety of the overall program.

aw1621107 17 hours ago|||
> I think the main part of Ghostty's design mentioned here that - as a Rust programmer - I think is probably a mistake is the choice to use a linked list. To me this looks exactly like it needs VecDeque, a circular buffer backed by a growable array type.

This comment [0] by mitchellh on the corresponding lobste.rs submission discusses the choice of data structure a bit more:

> Circular buffer is a pretty standard approach to this problem. I think it's what most terminal emulators do.

> The reason I went with this doubly linked list approach with Ghostty is because architecturally it makes it easier for us to support some other features that either exist or are planned.

> As an example of planned, one of the most upvoted feature requests is the ability for Ghostty to persist scroll back across relaunch (macOS built-in terminal does this and maybe iTerm2). By using a paged linked list architecture, we can take pages that no longer contain the active area (and therefore are read-only) and archive them off the IO thread during destroy when we need to prune scroll back. We don't need to ever worry that the IO thread might circle around and produce a read/write data race.

> Or another example that we don't do yet, we can convert the format of scroll back history into a much more compressed form (maybe literally compressed memory using something like zstd) so we can trade off memory for cpu if users are willing to pay a [small, probably imperceptible] CPU time cost when you scroll up.

[0]: https://lobste.rs/s/vlzg2m/finding_fixing_ghostty_s_largest_...

liveoneggs 6 hours ago||
claude code also has a weird thing in ghostty where it breaks copy-paste after exiting. `reset` fixes it but it's annoying
drob518 21 hours ago||
Why not just use a circular buffer for the scroll back? Why use blocks at all if you’re just going to recycle them anyway? That said, great write-up.
mitchellh 21 hours ago|
It started that way, and that's a common way to do this. One of the reasons is to avoid large pre-allocations OR large copies. A few other notes over on lobsters: https://lobste.rs/s/vlzg2m/finding_fixing_ghostty_s_largest_...
drob518 19 hours ago||
Cool, thanks for the link.
Neywiny 21 hours ago||
Edit: I'm getting a lot of down votes for this but nobody is saying why I'm wrong. If you think I'm wrong enough to down vote, please reply why.

I don't understand why that is the preferred fix. I would have solved it other ways:

1. When resizing the page, leave some flag of how it was allocated. This tagging is commonly done as the always 0 bits in size or address fields to save space.

2. Since the pool is a known size of contiguous memory, check if the memory to be freed is within that range

3. Make the size immutable. If you want to realloc, go for it, and have the memory manager handle that boundary for you.

Both of those not only maintain functionality which seems to have been lost with the feature reduction but also are more future proof to any other changes in size.

bastawhiz 19 hours ago||
I didn't downvote, but I suspect it's an easy answer: the fix was like four lines.

At the end of the day, #1 and #3 both probably add a fairly significant amount of code and complexity that it's not clear to me adds robustness or clarity. From the fix:

``` // If our first node has non-standard memory size, we can't reuse // it. This is because our initBuf below would change the underlying // memory length which would break our memory free outside the pool. // It is easiest in this case to prune the node. ```

https://github.com/ghostty-org/ghostty/commit/17da13840dc71b...

#3, it seems, would require making a broader change. The size effectively is immutable now (assuming I'm understanding your comment correctly): non-standard pages never change size, they get discarded without trying to change their size.

#2 is interesting, but I think it won't work because the implementation of MemoryPool doesn't seem like it would make it easy to test ownership:

https://github.com/ghostty-org/ghostty/blob/17da13840dc71ba3...

You'd have to make some changes to be able to check the arena buffers, and that check would be far slower than the simple comparison.

Neywiny 18 hours ago||
Thank you. I think each of my options are pretty trivial in C. I guess what I'm not understanding for #3 is if size is immutable, how the size changed which caused the issue? The post said they changed the size of the page without changing the underlying size of the allocated memory. To me this is the big issue. There was a desync in information where the underlying assumption is that size tells you where the data came from and that the size of the metadata and the size of the allocation move in tandem across that boundary.

#1 and #2 are fixes for breaking that implicit trust. #1 still trusts the metadata, #2 is what I'd consider the most robust solution is that not only is it ideally trivial (just compare if a pointer is within a range, assuming zig can do that) but it doesn't rely on metadata being correct. #3 prevents the desync.

I really don't understand the code base enough to say definitively that my ways work, which is I guess what I'm really looking for feedback on. Looking at the memorypool, I think you're right that my assumption of it being a simple contiguous array was incorrect.

ETA: I think I'm actually very wrong for #2. Color me surprised that the zig memory pool allocated each item separately instead of as one big block. Feels like a waste, but I'm sure they have their reasons. That's addCapacity in memory_pool.zig

bastawhiz 17 hours ago||
I'm not 100%, but my understanding was that the non standard pages are always larger than the standard pages. If you need more than a standard page, you always get a freshly allocated non standard page. But when one was released, it was being treated as though it was standard sized. The pool would then reuse that memory, but only at a standard size. So every released non standard page leaked the difference between what was allocated and what was standard.

Which is to say, I don't think it was actually being resized. I think it was the metadata for the page saying it had the (incorrect) standard size (and the incorrect handling after the metadata was changed).

Neywiny 16 hours ago||
Yes that last point was what I meant. I see no reason that the metadata's size field should get updated without some realloc of the memory it points to. I think I'll need to look into the actual code to see what's going on there, though, because we may both be misunderstanding. It just seems very error prone to categorize how you free memory based on a field that by the time you get to `free` has no guaranteed relationship with where the memory came from. I think that should be fixed. What was done in the blog is more of a band-aid imo.
hotpotat 20 hours ago|||
I upvoted you because I would like to know the response to these approaches
Neywiny 20 hours ago||
Thank you. Sometimes I get to like -4 or even -7 before it starts going up. It might be nice to graph it at some point to see my most varied comments. I'm at -2 right now

23 minutes later I'm at +2

6 minutes after, +5 +4min now +6, another 20 minutes +8. I think I'm in the clear

yakaccount4 20 hours ago|||
I just stopped caring about votes. It's often driven by inertia, and it can't differentiate a vote from someone who doesn't know anything vs a domain expert. Life is better once you stop caring about karma points.
Neywiny 20 hours ago||
While very true and sound advice, fake internet points make dopamine go brrrrr
hotpotat 22 hours ago||
speaking of claude code in Ghostty, I’ve noticed I can’t drag and drop images into the prompt when the session is within a tmux pane. I miss that, coming from the mac terminal app, which allowed me to do so. I’d be willing to look into this myself, but mention it in case someone already knows where to start looking.
sean_pedersen 16 hours ago||
Would this kind of bug have been catched by the Rust compiler?
autarch 16 hours ago|
I was wondering about this myself. My guess is no, since AFAIK the only way to do this sort manual memory management is to use unsafe code. But there's also things like the (bumpalo)[https://docs.rs/bumpalo/latest/bumpalo] crate in Rust, so maybe you wouldn't need to do this sort of thing by hand, in which case you're as leak-free as the bumpalo crate.
cyh555 14 hours ago|
I wonder how a Rust-based terminal implements this without sacrificing performance.
0xbrayo 12 hours ago|
Someone check out alacritty source code and answer it for us
rrgok 12 hours ago||
Ask Claude Code about it using Ghostty without this fix ;)
More comments...