A love letter to the CSV format (2024)

Posted by jordigh 4 days ago

A love letter to the CSV format (2024)(medialab.sciencespo.fr)

107 points | 116 comments

joz1-k 4 days ago|

Except that the "comma" was a poor choice for a separator, the CSV is just a plain text that can be trivially parsed from any language or platform. That's its biggest value. There is essentially no format, library, or platform lock-in. JSON comes close to this level of openness and ease, but YAML is already too complicated as a file format.

thw_9a83c 4 days ago||

The notion of a "platform" caught my attention. Funny story: About five years ago, I got a little nostalgic and wanted to retrieve some data from my Atari XL computer (8-bit) from my preteen years. Back then, I created primitive software that showed a map of my village with notable places, like my friends' homes. I was able to transform all the BASIC files (stored on cassette tape) from the Atari computer to my PC using a special "SIO2PC" cable, but the mapping data were still locked in the BASIC files. So I had the idea to create a simple BASIC program that would run in an Atari 8-bit PC emulator, linearize all the coordinates and metadata, and export them as a CSV file. The funny thing is that the 8-bit Atari didn't even use ASCII, but an unusual ATASCII encoding. But it's the same for letters, numbers, and even for the comma. Finally, I got the data and wrote a little Python script to create an SVG map. So yes, CSV forever! :)

humanfromearth9 4 days ago|||

And the best thing about CSV is that it is a text file with a standardized, well known, universally shared encoding, so you don't have to guess it when opening a CSV file. Exactly in the same way as any other text file. The next best thing with CSV is that separators are also standardized and never positional, you never have to guess.

nradov 3 days ago|||

Technically there is a CSV standard in IETF RFC 4180, although compliance isn't required and of course many implementations are broken.

https://www.ietf.org/rfc/rfc4180.txt

whizzter 4 days ago|||

Almost missed the sarcasm :)

dirkt 3 days ago|||

Try exporting things from Excel to CSV on a Mac with non-us locale.

Some genius at Microsoft decided the exporting to CSV should follow the locale convention. Which means I get a "semicolon-separated value" file instead of a comma-separated one, unless I change my local to us.

Line breaks are also fun...

jstanley 4 days ago|||

JSON has the major annoyance that grep doesn't work well on it. You need tooling to work with JSON.

re 4 days ago|||

As soon as you encounter any CSVs where field values may contain double quotes, commas, or newlines, you need tooling to work with CSV as well.

(TSV FTW)

IAmBroom 3 days ago|||

TSV is superior to CSVs, and it still angers me that Excel doesn't offer it as a standard input option, but your examples are fairly easily handled by eye in a text file.

Tools definitely make it faster and more reliable.

spicybbq 3 days ago||||

One of my first tasks as a junior dev was replacing an incorrect/incomplete "roll your own" CSV parsing regex (which broke in production) with a library.

euroderf 3 days ago|||

ASCII FS GS RS US ... just make decent font entries for them.

jstanley 3 days ago||

And keys on the keyboard.

euroderf 2 days ago||

Yes! But nobody ever came up with decent font entries that would look snappy on keys. Not even IBM (or Data General or Burroughs or whoever) I guess.

rogue7 3 days ago||||

For this I use gron [0]. It's very convenient.

[0]: https://github.com/tomnomnom/gron

theknarf 4 days ago|||

grep is a tool. jq is a good tool for json.

kergonath 4 days ago||

grep is POSIX and you can count on it being installed pretty much anywhere. That’s not the case for jq.

whizzter 4 days ago|||

Do people contain themselvs to a POSIX conformant grep subset in practice, or do you mean GNU grep that probably doesn't behave according to spec unless POSIXLY_CORRECT is set?

IAmBroom 3 days ago|||

"Anywhere" does not include Windows environments, which are over half the work computers out there.

krogenx 2 days ago|||

If a workstation has Git installed on it, which I’d think would be the case for substantial number of engineers out there (…not just software engineers), grep is there due to Git BASH.

keeperofdakeys 3 days ago|||

Arguably, "comma as a separator" is close enough to comma's usage in (many) written languages that it makes it easier for less technical users to interact with CSV.

wlesieutre 3 days ago||

Easier as long as they don't try to put any of those written languages in the CSV

Commas and quotation marks suddenly make it complicated

john_the_writer 4 days ago|||

100%.. xml also worked here too..

YAML is a pain because it has every so slightly different versions, that sometimes don't play nice.

csv or TSV's are almost always portable.

renox 3 days ago|||

I'd say that is not its biggest issue. The way to escape things is by far its biggest issue, a passwd like \, \", \\ would have been far easier.

talles 2 days ago|||

What separator would be better?

freetinker 3 days ago|||

The comma makes it more human-readable. What separator would you suggest?

snthpy 3 days ago|||

So ASCII actually had dedicated characters for this, 0x1C-0x1F. The problem is that they are non-printing.

Unicode has rendered analogs, U+241C-U+241F, but they take more bytes to encode, which can significantly increase file size in large USV files.

So my ideal would be to use ASV files rendered as USV in editors.

https://github.com/SixArm/usv

snthpy 3 days ago||

The benefits are that ASV / USV files are trivial to parse with simple string splitting since you don't have to worry about nesting and quoting.

Here's an example of what a USV looks like:

Folio1␟␞ Sheet1␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet2␟␞ e␟f␟␞ g␟h␟␞ ␝␜ Folio2␟␞ Sheet3␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet4␟␞ e␟f␟␞ g␟h␟␞ ␝␜

joz1-k 3 days ago||||

The comma is too prevalent in the data to be a suitable separator. A semicolon would be a better choice.

r721 3 days ago|||

"|" looks pretty good (and is relatively rarely-used).

conception 3 days ago||

|| separated for life

roland35 4 days ago||

To people saying that "your boss can open it" being an benefit of csv, well I have a funny story!

Back in the early 2000s I designed and built a custom data collector for a air force project. It saved data at 100 Hz on an SD card. The project manager loved it! He could pop the SD card out or use the handy USB mass storage mode to grab the csv files.

The only problem... Why did the data cut off after about 10 minutes?? I couldn't see the actual data collected since it was secret, but I had no issue on my end, assuming there was space on the card and battery life was good.

Turns out, I learned he was using excel 2003 to open the csv file. There is a 65,536 row limit (does that number look familiar?). That took a while to figure out!!

IanCal 4 days ago||

Love it.

The first data release I did excel couldn't open the CSV file, because it started with a capital I (first column ID). Excel looks at this, looks at this file with a comma in the header and text and the ending "csv" and says

I KNOW WHAT THIS IS

THIS IS A SYLK FILE

BECAUSE IT STARTS WITH "I"

NO OTHER POSSIBLE FILE COULD START WITH THE LETTER "I"

then reads some more and says

THIS SYLK FILE LOOKS WRONG

IT MUST BE BROKEN

ERROR

https://en.wikipedia.org/wiki/Symbolic_Link_(SYLK)

IAmBroom 3 days ago||

Yes! That was the STUPIDEST file encoding detection ever developed! OMG....

Dilettante_ 4 days ago||

"Plz fix. No look! Just fix!!"[1] must be one of the circles of programmer hell.

[1]https://i.kym-cdn.com/entries/icons/facebook/000/027/691/tum...

hiAndrewQuinn 4 days ago||

I like CSV because its simplicity and ubiquity make it an easy Schelling point in the wide world of corporate communication. Even very non-technical people can, with some effort, figure out how to save a CSV from Excel, and figure out how to open a CSV with Notepad if absolutely necessary.

On the technical side libraries like pandas have undergone extreme selection pressure to be able to read in Excel's weird CSV choices without breaking. At that point we have the luxury of writing them out as "proper" CSV, or as a SQLite database, or as whatever else we care about. It's just a reasonable crossing-over point.

heresie-dabord 4 days ago|

CSV is a flexible solution that is as simple as possible. The next step is JSONL.

https://jsonlines.org/

https://medium.com/@ManueleCaddeo/understanding-jsonl-bc8922...

ayhanfuat 4 days ago||

Previously: A love letter to the CSV format (https://github.com/medialab/xan/blob/master/docs/LOVE_LETTER...)

708 points | 5 months ago | 698 comments (https://news.ycombinator.com/item?id=43484382)

untrimmed 4 days ago||

This is a great defense, but I feel like it misses the single biggest reason CSV will never die: your boss can open it. We can talk about streaming and Parquet all day, but if the marketing team can't double-click the file, it's useless.

imtringued 4 days ago|

With what software? LibreOffice? Excel doesn't support opening CSV files with a double click. It lets you import CSV files into a spreadsheet, but that requires reading unreasonably complicated instructions.

ertgbnm 4 days ago|||

On windows, csv's automatically open in Excel through the file explorer. Almost all normal businesses use windows so the OPs claim is pretty reasonable.

tommica 4 days ago||

Depends on the country/locale - I just generate them with semicolons to enable easy opening

IAmBroom 3 days ago||

We're discussing CSVs. You are discussing SemicolonSVs.

tommica 3 days ago||

I do wish SCSV was thing

delta_p_delta_x 4 days ago||||

> Excel doesn't support opening CSV files with a double click

Yes, it does. When Excel is installed, it installs a file type association for CSV and Explorer sets Excel as the default handler.

efitz 4 days ago||||

Excel absolutely can open csv files with a double click if you associate the file type extension.

boshomi 4 days ago||

You should never blindly trust Excel when using CSV files. Try this csv file:

    COL1,COL2,COL3 
    5,"+A2&C1","+A2*8&B1"

IAmBroom 3 days ago|||

True, but not the point here. "You can" and "You should as a general rule" are not the same.

0x3444ac53 3 days ago|||

Hesitant to actually try to open this, what does it do?

jowea 4 days ago||||

How is that not opening?

imtringued 4 days ago||

You are creating a new spreadsheet that you can save as an xlsx. What you are looking at is not the CSV file itself.

NoboruWataya 4 days ago|||

This is a distinction that does not matter to most non-technical people.

john_the_writer 4 days ago||||

Well I mean unless you're inspecting it with a hex editor, you're not looking at the csv file itself. Even then, I suppose you could say that's not even the file itself. An electron microscope perhaps? But then you've got the whole Heisenberg issue, so there's that.

eviks 4 days ago||||

That's not true either, try it yourself with a simple csv file, open it, add a row and save - you'll see the original update

(there are some limitations)

tokai 4 days ago|||

You are missing the point so hard its hilarious.

kelvinjps10 4 days ago||||

Those programs support opening csv with double click

john_the_writer 4 days ago|||

What are you talking about? Excel opens csv with zero issue. In windows, and mac. Mac you right click and "open with". Or you open excel, and click file/open and find the csv. I do the first one a dozen times a day.

1wd 4 days ago||

Only if the Windows Regional Settings List Separators happens to be "comma", which is not the case in most of Europe (even in regions that use the decimal point) so only CSV files with SEP=, as the first line work reliably with Excel.

john_the_writer 4 days ago|||

Literally did this all day today. Took a csv file, parsed it in elixir, processed it and created a new csv file, then opened that in excel, to confirm the changes. At least 100 times today.

curioussquirrel 3 days ago|||

This, plus the parser in Excel gets thrown off by some multiline edge cases very easily. Also, the file has to be UTF-8-BOM, not just UTF-8.

guzik 4 days ago||

I am glad that we decided to pick CSV as our default format for health data (even for heavy stuff like raw ECG). Yeah, files were bigger, but clients loved that they could just download them, open in Excel, make a quick chart. Meanwhile other software was insisting on EDF (lighter, sure) but not everything could handle it.

IAmBroom 3 days ago|

And at this point, "lighter" is immaterial.

"Hi, I'm sending you a two-line statement in a Word document. It's 10kB."

"Thanks, I took a screenshot of it and forwarded it. It's now 10MB."

"Great! That's handy!"

mcdonje 4 days ago||

>Excel hates CSV. It clearly means CSV must be doing something right.

Use tabs as a delimiter and excel interops with the format as if natively.

sevensor 4 days ago||

I was writing a program a little while ago that put data on the clipboard for you to paste into Excel. I tried all manner of delimiters before I figured out that Excel really loves HTML tables. If you wrap everything in <tr> and <td>, and put it up on the clipboard, it pastes flawlessly.

tacker2000 4 days ago|||

The problem is that nobody in the real world uses tabs.

Everyone uses , or ; as delimiters and then uses either . or , for decimals, depending on the source.

It shouldnt be so hard to auto-detect these different formats, but somehow in 2025, Excel still cannot do it.

sfn42 4 days ago|||

You don't need to auto-detect the format. The delimiter can be declared at the top of the file as for example sep=;

yrro 4 days ago|||

But now that's not CSV. It's CSV with some kind of ad-hoc header...

sfn42 4 days ago||

It may not be part of any official CSV spec but Excel accepts it. I found that Excel would read my files much more reliably using the sep declaration, which is great when the target audience is less technical types.

tacker2000 3 days ago||||

Ok thats a nice tip, but to be fair when i download some CSV report off some website, i dont wont to open it to check the delimiter, then edit it and resave it. Often I am downloading dozens of such files at a time.

sfn42 3 days ago||

The idea is that the program that creates the file adds that line at the top. If you're downloading CSV files from websites then ideally they should already have that line.

If they don't then what you could do is create a simple script that just adds that line, and Excel will open the files without you having to hassle with making sure Excel interprets them correctly. Of course that's a bit more challenging if they use different separators, but you might be able to find an easy adaptation for your usecase like making a decision about which delimiter to declare based on the filename. Or you could try to analyze the header row to figure out which delimiter to use based on that.

IAmBroom 3 days ago|||

I love you forever and a day. Thank you.

sfn42 3 days ago||

Happy to help! :)

pragmatic 4 days ago|||

Pipe enters the chat.

For whatever reason, pipe seems to be support common in health care data.

gentooflux 4 days ago|||

Use tabs as a delimiter and it's not CSV anymore, that's TSV.

mcdonje 4 days ago||

They're essentially the same format. Same with PSV. They're all DSVs.

Most arguments for or against one apply to all.

https://en.m.wikipedia.org/wiki/Delimiter-separated_values

roelschroeven 4 days ago|||

It still can't properly deal with CSVs that use different decimal separators than the UI setting in Excel / Windows. It's still too stupid to understand that UI localization and interoperation should never be mixed.

IAmBroom 3 days ago||

Yes, again: Excel is not the ideal CSV tool.

It is A CSV tool, readily available in the business world, that often works quite well.

And your argument about comma separators is wrong; the string

1,234

in a CSV file SHOULD mean "two values: 1 and 234", regardless of the local decimal separator. The number one thousand two hundred thirty-four is represented as

"1,234"

roelschroeven 3 days ago||

"And your argument about comma separators is wrong; the string

1,234

in a CSV file SHOULD mean "two values: 1 and 234", regardless of the local decimal separator."

Yes, I agree, it SHOULD mean that, but that is NOT what Excel does when the decimal separator is set to "," in the regional settings. Excel wrongly thinks it should apply the comma as decimal separator, and reads that number as 1 unit and 234 thousands.

Locale MUST NOT be used for data formats, but Excel does it anyway.

This problem doesn't manifest itself when you're using a locale which matches the CSV's separators. Consider yourself lucky if you're in that situation.

kelvinjps10 4 days ago||

Isn't that tsv then?

mcdonje 4 days ago||

Answered: https://news.ycombinator.com/item?id=45195713

kelvinjps10 3 days ago||

Thanks for explanation

efitz 4 days ago||

I don’t think I ever heard anyone say “csv is dead”.

Smart people (that have been burned once too many times) put quotes around fields in csv if they aren’t 100% positive the field will be comma-free, and escape quotes in such fields.

femto 4 days ago||

CSV is good for debugging C/C++ real-time signal processing data paths.

Add cout or printf lines, which on each iteration print out relevant intermediate values separated by commas, with the first cell being a constant tag. Provided you don't overdo it, the software will typically still run in real-time. Pipe stdout to a file.

After the fact, you can then use grep to filter tags to select which intermediate results you want to analyse. This filtered data can be loaded into a spreadsheet, or read into a higher level script for analysis/debugging/plotting/... In this way you can reproducibly visualise internal operation over a long period of time and see infrequent or subtle deviations from expected behaviour.

matt_daemon 4 days ago|

Agree this is the main use for it

lan321 4 days ago|

I hate parsing CSV. There are so many different implementations it's a constant cat and mouse.. Literally any symbol can be the separator, then the ordering starts getting changed, then since you have to guess what's where you go by type but strings, for example, are sometimes in quotations, other times not, then you have some decimal split with a comma when the values are also separated with commas so you have to track what's a split and what's a decimal comma.. Then you get some line with only 2 elements when you expect 7 and have no clue what to do because there's no documentation for the output and hence what that line means..

If the CSV is not written by me it's always been an exercise in making things as difficult as possible. It might be a tad smaller as a format but I find the parsing to be so ass you need really good reason to use it.

Edit: Oh yeah, and some have a header, others don't and CSV seems to always come from some machine where the techs can come over to do an update, and just reorder everything because fuck your parsing and then you either get lucky and the parser dies, or since you don't really have much info the types just align and you start saving garbage data to your database until a domain expert notices something isn't quite right so you have to find when was the last time someone touched the machines and rollback/reparse everything..

More comments...