Posted by jordigh 4 days ago
Some genius at Microsoft decided the exporting to CSV should follow the locale convention. Which means I get a "semicolon-separated value" file instead of a comma-separated one, unless I change my local to us.
Line breaks are also fun...
(TSV FTW)
Tools definitely make it faster and more reliable.
Commas and quotation marks suddenly make it complicated
YAML is a pain because it has every so slightly different versions, that sometimes don't play nice.
csv or TSV's are almost always portable.
Unicode has rendered analogs, U+241C-U+241F, but they take more bytes to encode, which can significantly increase file size in large USV files.
So my ideal would be to use ASV files rendered as USV in editors.
Here's an example of what a USV looks like:
Folio1␟␞ Sheet1␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet2␟␞ e␟f␟␞ g␟h␟␞ ␝␜ Folio2␟␞ Sheet3␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet4␟␞ e␟f␟␞ g␟h␟␞ ␝␜
Back in the early 2000s I designed and built a custom data collector for a air force project. It saved data at 100 Hz on an SD card. The project manager loved it! He could pop the SD card out or use the handy USB mass storage mode to grab the csv files.
The only problem... Why did the data cut off after about 10 minutes?? I couldn't see the actual data collected since it was secret, but I had no issue on my end, assuming there was space on the card and battery life was good.
Turns out, I learned he was using excel 2003 to open the csv file. There is a 65,536 row limit (does that number look familiar?). That took a while to figure out!!
The first data release I did excel couldn't open the CSV file, because it started with a capital I (first column ID). Excel looks at this, looks at this file with a comma in the header and text and the ending "csv" and says
I KNOW WHAT THIS IS
THIS IS A SYLK FILE
BECAUSE IT STARTS WITH "I"
NO OTHER POSSIBLE FILE COULD START WITH THE LETTER "I"
then reads some more and says
THIS SYLK FILE LOOKS WRONG
IT MUST BE BROKEN
ERROR
[1]https://i.kym-cdn.com/entries/icons/facebook/000/027/691/tum...
On the technical side libraries like pandas have undergone extreme selection pressure to be able to read in Excel's weird CSV choices without breaking. At that point we have the luxury of writing them out as "proper" CSV, or as a SQLite database, or as whatever else we care about. It's just a reasonable crossing-over point.
https://medium.com/@ManueleCaddeo/understanding-jsonl-bc8922...
708 points | 5 months ago | 698 comments (https://news.ycombinator.com/item?id=43484382)
Yes, it does. When Excel is installed, it installs a file type association for CSV and Explorer sets Excel as the default handler.
COL1,COL2,COL3
5,"+A2&C1","+A2*8&B1"
(there are some limitations)
"Hi, I'm sending you a two-line statement in a Word document. It's 10kB."
"Thanks, I took a screenshot of it and forwarded it. It's now 10MB."
"Great! That's handy!"
Use tabs as a delimiter and excel interops with the format as if natively.
Everyone uses , or ; as delimiters and then uses either . or , for decimals, depending on the source.
It shouldnt be so hard to auto-detect these different formats, but somehow in 2025, Excel still cannot do it.
If they don't then what you could do is create a simple script that just adds that line, and Excel will open the files without you having to hassle with making sure Excel interprets them correctly. Of course that's a bit more challenging if they use different separators, but you might be able to find an easy adaptation for your usecase like making a decision about which delimiter to declare based on the filename. Or you could try to analyze the header row to figure out which delimiter to use based on that.
For whatever reason, pipe seems to be support common in health care data.
Most arguments for or against one apply to all.
It is A CSV tool, readily available in the business world, that often works quite well.
And your argument about comma separators is wrong; the string
1,234
in a CSV file SHOULD mean "two values: 1 and 234", regardless of the local decimal separator. The number one thousand two hundred thirty-four is represented as
"1,234"
1,234
in a CSV file SHOULD mean "two values: 1 and 234", regardless of the local decimal separator."
Yes, I agree, it SHOULD mean that, but that is NOT what Excel does when the decimal separator is set to "," in the regional settings. Excel wrongly thinks it should apply the comma as decimal separator, and reads that number as 1 unit and 234 thousands.
Locale MUST NOT be used for data formats, but Excel does it anyway.
This problem doesn't manifest itself when you're using a locale which matches the CSV's separators. Consider yourself lucky if you're in that situation.
Smart people (that have been burned once too many times) put quotes around fields in csv if they aren’t 100% positive the field will be comma-free, and escape quotes in such fields.
Add cout or printf lines, which on each iteration print out relevant intermediate values separated by commas, with the first cell being a constant tag. Provided you don't overdo it, the software will typically still run in real-time. Pipe stdout to a file.
After the fact, you can then use grep to filter tags to select which intermediate results you want to analyse. This filtered data can be loaded into a spreadsheet, or read into a higher level script for analysis/debugging/plotting/... In this way you can reproducibly visualise internal operation over a long period of time and see infrequent or subtle deviations from expected behaviour.
If the CSV is not written by me it's always been an exercise in making things as difficult as possible. It might be a tad smaller as a format but I find the parsing to be so ass you need really good reason to use it.
Edit: Oh yeah, and some have a header, others don't and CSV seems to always come from some machine where the techs can come over to do an update, and just reorder everything because fuck your parsing and then you either get lucky and the parser dies, or since you don't really have much info the types just align and you start saving garbage data to your database until a domain expert notices something isn't quite right so you have to find when was the last time someone touched the machines and rollback/reparse everything..