What Category Theory Teaches Us About DataFrames

Posted by mchav 5 days ago

What Category Theory Teaches Us About DataFrames(mchav.github.io)

162 points | 52 commentspage 2

Whyachi 5 hours ago|

[dead]

kokhanserhii 7 hours ago||

[dead]

hermitcrab 7 hours ago|

I guess this article is an interesting exercise from a pure maths point of view. But, as someone developing a drag and drop data wrangling tool the important thing is creating a set of composable operations/primitive that are meaningful and useful to your end user. We have ended up 73 distinct transforms in Easy Data Transform. Sure they overlap to an extent, but feel they are at the right semantic level for our users, who are not category theorists.

mrlongroots 7 hours ago||

Algebras are also nice for implementations. If you can decompose a domain into a few algebraic primitives you can write nice SIMD/CUDA kernels for those primitives.

To your point, I wonder if the 73 distinct transforms were just different defaults/usability wrappers over these. And you may also get into situations where kernels can be fused together or other batching constraints enable optimizations that nice algebraic primitives don't capture. But that's just systems---theory is useful in helping rethink API bloats and keeping us all honest.

hermitcrab 6 hours ago||

They are effectively highly level wrappers over the most primitive operations. High enough level that they can be used from a GUI, rather than code.

It is a balance. Too few transforms and they become to low level for my users. Too many and you struggle to find the transform you want.

jimbokun 5 hours ago||

You don’t have to limit the transforms you offer users to just the core ones. But for your own sanity you can implement the none core ones in terms of the core ones.

tikhonj 5 hours ago|||

You can have both: you start with a small, mathematically inspired algebraic core, then you express the higher-level more user-friendly operations in terms of the algebraic core.

As long as your core primitives are well designed (easier said than done!), this accomplishes two things: it makes your implementation simpler, and it helps guide and constrain your user-facing design. This latter aspect is a bit unintuitive (why would you want more constraints to work around?), but I've seen it lead to much better interface designs in multiple projects. By forcing yourself to express user-level affordances in terms of a small conceptual core, you end up with a user design that is more internally consistent and composable.

jimbokun 5 hours ago||

For one thing it gives users of your library fewer concepts to learn.

hermitcrab 4 hours ago||

Yes, but fewer concepts may not be simpler in practice. E.g. assembler is simpler than C++, but I wouldn't want to write a big program in assembler.

whattheheckheck 6 hours ago||

Have you heard of the book Mathematics for Big data

https://github.com/Accla/d4m

He says himself the ideas are more important than the software package

hermitcrab 6 hours ago||

D4M seems to be a library, not a book. Or am I missing something?