Posted by jordigh 4 days ago
But I tried the group's "CSV magician" program and was not impressed
https://github.com/medialab/xan
The table output seems very similar to sqlite3
It's a 20MB musl dynamically-linked executable (cf. 1.7MB statically-linked sqlite3 executable)
Most of the "suite" of subcommands seemed to be aimed at accomplishing the same stuff I can do with sqlite3, and, if necessary, flex
There seemed no obvious way too disable color
The last straw was it messed up the console
First freezing it and the pid was not visible with ps so I had to kill the shell
Then leaving me with no up/down or tab action
Yes I can fix this but I should not have to
I really want to keep an open mind and to believe in these rust console programs
But every time I try one it is a huge executable, an assault of superfluous color and generally no functionality that cannot be achieved with existing non-rust software
Even assuming I could get used to the annoying design, I like software that is reliable
Compared to the software I normally use, I cannot say these rust programs are equally as reliable
I also like software where I can easily edit the source to change what I do not like
These rust programs would require more resources and time to compile, messing around with a package manager and heaps of dependencies, plus a new language, with no clear benefit... b/c TBH I am not losing any sleep worrying that the small text processing programs I use are written in C; after all, the operating systems I use are written in C and that is not changing anytime soon
I wouldn't write it a love letter though. There's a reason that parquet exists.
Without more specifics, I disagree with your take.
Going from tool to tool will leave you with widely different representations of the original data you've put in. Because as you said yourself. All of this data does not have any meaning. It's just strings. The csv and the tools do not care if one column was ms-epoch and another was mathematical notation floating point. It'll all just go through that specific tools deserialization - serialization mangle and you'll have completely different data on the other end.
It clearly means CSV must be doing something right.
I used to work on payment infrastructure and whenever a vendor offered us the choice between CSV and some other format we always opted for that other format (often xlsx). This sounds a bit weird, but when using xlsx and a good library for handling it, you never have to worry about encoding, escaping and number formatting.
This is one of these things that sound absolutely wrong from an engineering standpoint (xlsx is abhorrently complex on the inside), but works robustly in practice.
Slightly related: This was a German company, with EU and US payment providers. Also note that Microsoft Excel (and therefore a lot of other tools) produces "semicolon-separated values" files when started on a computer with the locale set to German...
I work a lot with time series data, and excel does not support datetimes with timezones, so I have to figure out the timezone every time to align with other sources of data.
Reading and writing them is much slower than csv, which is annoying when datasets get larger.
And most importantly, xlsx are way more often fucked up in some way than any other format. Usually, because somebody manually did something to them and sometimes because the library used to write them had some hiccup.
So I guess a hot take indeed.
Which good libraries did you find? That's been my pain point when dealing with xlsx.
I tried using them once after what felt like an aeon of quoting issues, and the first customer file I had had them randomly appearing in their fields.
Indeed, a lie only a lover would believe,,,
It's unkillable, like many eldritch horrors.
> The specification of CSV holds in its title: "comma separated values". Okay, it's a lie, but still, the specification holds in a tweet and can be explained to anybody in seconds: commas separate values, new lines separate rows. Now quote values containing commas and line breaks, double your quotes, and that's it. This is so simple you might even invent it yourself without knowing it already exists while learning how to program.
Except that's just one way people do it. It's not universal and so you cannot take arbitrary CSV files in and parse them like this. You can't take a CSV file constructed like this and pass it into any CSV accepting program - many will totally break.
> Of course it does not mean you should not use a dedicated CSV parser/writer because you will mess something up.
Yes, implementers often have.
> No one owns CSV. It has no real specification
Yep. So all these monstrosities in the real world are all... maybe valid? Lots of totally broken CSV files can be parsed as CSV but the result is wrong. Sometimes subtly.
> This means, by extension, that it can both be read and edited by humans directly, somehow.
One of the very common ways they get completely fucked up, yes. Someone goes and sorts some rows and boom broken, often unrecoverable data loss. Someone doesn't correctly add or remove a comma. Someone mixes two files that actually have differently encoded text.
> CSV can be read row by row very easily without requiring more memory than what is needed to fit a single row.
CSV must be parsed row by row.
> By comparison, column-oriented data formats such as parquet are not able to stream files row by row without requiring you to jump here and there in the file or to buffer the memory cleverly so you don't tank read performance.
Sort of? Yes if you're building your own parser but who is doing that? It's also not hard with things like parquet.
> But of course, CSV is terrible if you are only interested in specific columns because you will indeed need to read all of a row only to access the part you are interested in.
Or if you're interested in a specific row, because you're going to have to be careful about parsing out every row until you get there.
CSV does not have a row separator. Or rather it does but it also lets you have that row separator appear and not mean "separate these rows" so you can't simply trust it.
> But critics of CSV coming from this set of pratices tend to only care about use-cases where everything is expected to fit into memory.
Parquet uses row groups which means you can stream chunks easily, those chunks contain metadata so you can easily filter rows you don't need too.
I much more often need to keep the whole thing in memory working with CSV than parquet. With parquet I don't even need to be able to fit all the rows on disk I can read the chunk I want remotely.
> CSV can be appended to
Yeah that's easier. Row groups means you can still do this though, but granted it's not as easy. *However* I will point out that absolutely nothing stops someone completely borking things by appending something that's not exactly the right format.
> CSV is dynamically typed
Not really. Everything is strings. You can do that with anything else if you want to. JSON can have numbers of any size if you just store them as strings.
> CSV is succinct
Yes, more so than jsonl, but not really more than (you guessed it) parquet. Also it's horrific for compression.
> Reverse CSV is still valid CSV
Get a file format that doesn't absolutely suck and you can parse things in reverse if you want. More usefully you can parse just sections you actually care about!
> Excel hates CSV
Helpfully this just means that the most popular way of working with tabular data in the world doesn't play that nicely with it.
Partially due to the bloat, but also partially because the format doesn't allow for speed.
And because CSV is untyped, you have to either trust the producer or put in mountains of guards to ensure you can handle the weird garbage that might come through.
My company deals with a lot of CSV and we literally built tools and hire multiple full time employees whose entire job is handling CSV sucking in new and interesting ways.
Parquet literally eliminates 1/2 of our our data ingestion pipeline simply by being typed, consistent, and fast to query.
One example of a problem we constantly run into is that nobody likes to format numbers the same way. Scientific notation, no scientific notation, commas or periods, sometimes mixed formats (scientific notation when a number is big enough, for example).
Escaping is also all over the board.
CSV SEEMS simple, but the lack of a real standard means it's anything but.
I'd take xml over CSV.
Truly shocking how many ways people manage to construct these files. I don't think people really get this if they've mostly been moving files from one system to another and not dealing with basically the union of horrors all the various systems that write CSV can make.
> Parquet literally eliminates 1/2 of our our data ingestion pipeline simply by being typed, consistent, and fast to query.
Parquet has been a huge step forwards. It's not perfect, but it is really good. Most improvements I'd like actually are served well stepping up from that to groups of them in larger tables.
> One example of a problem we constantly run into is that nobody likes to format numbers the same way. Scientific notation, no scientific notation, commas or periods, sometimes mixed formats (scientific notation when a number is big enough, for example).
That's a new one on me, but makes loads of sense. Dates were my issue - you hit 13/2/2003 and 5/16/2001 in the same file. What date is 1/2/2003?
For anyone that's never dealt with this before, let me paint a picture -
You have a programming language you're working in. You import new packages every single day, written by people you start to consider adversarial after a few weeks in the job.
You need to keep your system running, with new imports added every day.
There are only string types. Nothing more. You must interpret them correctly.
This is an understatement for what CSV files coming from random customers actually means. I did it for a decade and was constantly shocked at what new and inventive ways people had to mess up a simple file.
> I'd take xml over CSV.
Not to rag on xml but because it feels like you're in or have been in the same boat as me and it's nice to share horror stories - I've had to manually dig through a multi gig xml file to deal with parsing issues as somewhere somehow someone managed to combine files of different encodings and we had control characters embedded in. Just a random ^Z here and there. It's been years so I don't remember the details of exactly how we reconstructed what had happened but there was something due to encodings and mixing things together that messed it up.
This isn't xmls fault, and was a smaller example but since then I've had a strong mistrust of anything that lets humans manipulate files outside of something that can validate them as being parsable.
Also would take XML over CSV.
Particularly for tabular data, parquet is really good. Even a SQLite database isn't a terrible way send that sort of data.
At least with XML, all the problems of escaping are effectively solved. And since it's (usually) tool generated it's likely valid. That means I can feed it into my favorite xml parser and pound the data out of it (usually). There's still a lot of the encoding issues I mentioned with CSV.
Also, commas in quoted strings are quite mainstream csv, but csvs with quoted strings containing unescaped newlines are extremely baroque. Criticism of csv based on the assumption that strings will contain newlines is not realistic.
This was in the context of having it in a place humans can edit it directly so the case here is sorting rows by sorting lines. CSV has this wonderful property when editing - anything that doesn't parse it in then out to ensure that it is a valid file lets you write out a broken file if you mess it up - and in addition has the property that the record delimiter is an exceptionally common element in text.
So to answer your question, sure - take a csv file with newlines in some entries and sort the lines. You can restore it if you don't have two entries with newlines in the same field, and then only if you know it was exactly valid to start with, extra commas anywhere etc.
> csvs with quoted strings containing unescaped newlines are extremely baroque
No, they're all over the place. If you don't think so I don't believe you've worked with lots of real world csvs. Also, how do you know? How do you know your file doesn't contain them? Here's a fun fact - you can get to the point very easily where you *cannot programmatically tell*.
> Criticism of csv based on the assumption that strings will contain newlines is not realistic.
It's a very common thing to happen though.
Let's imagine something. CSV doesn't exist. I'm proposing it to you. I tell you that the bytes used to split records is a very commonly occurring thing in text. But don't worry! You can escape this by putting in another character commonly used. Oh and to escape that use two of them :)
Would you tell me to use something else? That you could foresee this causing problems?