Posted by elashri 4 days ago
> NumPy doesn’t offer a way to store data outside of the array buffer—there’s no concept of “sidecar storage” in NumPy.
But then it goes on and say to he strings are stored on the heap (which clearly is also possible with dtype=object) with an arena allocator. Reading the NEP now
EDIT: looking back at the NEP, I'm not sure it does a great job explaining exactly how the per-array descriptor works. Ultimately it's powered by a hook in the DType API: https://github.com/numpy/numpy/pull/24988. There is only one spot in NumPy where array buffers are allocated, so we hooked there and made sure any arrays with newly allocated buffers get a new DType instance.
Beyond "just" better string arrays, my favorite side effect of this is efficient NaN support in string arrays. The article talks about this a lot, but I had already started this comment before fully reading the article :p
I mean, sure, the old approach was object arrays, and you can do it there because each element is an independent object, but they're super inefficient. This both makes things efficient _and_ has a really cool side effect of supporting something that had become common partly as an accident of the old object array approach - NaNs in arrays of strings.
This is really really really useful work and it's _super_ cool!!
I agree, I couldn't really figure how the new numpy string data type makes it work though.
> So, the actual array buffer doesn’t contain any string data—it contains pointers to the actual string data.
"The idea /we/ came up with"?? :)
One issue with using Arrow directly in NumPy is PyArrow exposes an immutable 1D array, while NumPy exposes a mutable ND array.
See also https://numpy.org/neps/nep-0055-string_dtype.html#related-wo...
That said there is a branch that gets most of the way there: https://github.com/pandas-dev/pandas/pull/58578. The remaining challenges are mostly around getting consensus around how to introduce this change.
If NumPy had StringDType in 2019 instead of 2024 I think Pandas might have had an easier time. Sadly the timing didn’t quite work out.
This is not a discussion about Tim Peters, and your own link proves that this topic has already been aired in this community.
I want to hear more about this project. I don't want to see it hijacked to rehash an irrelevant contentious issue.