A staggering amount of unnecessary and counterproductive scope creep in just 4 i...

chipx86 · 2025-06-04T07:23:30 1749021810

We build a code review product that interfaces with over a dozen SCMs. In about 20 years of writing diff parsers, we've encountered all kinds of problems and limitations in SCM-generated diff files (which we have to process) that we wouldn't ever have expected to even consider thinking about before. This all comes from the pain points and lessons learned in that work, and has been a huge help in solving these for us.

These aren't problems end users should hopefully ever need to worry about, but they're problems that tools need to worry about and work around. Especially for SCMs that don't have a diff format of their own, have one that is missing data (in some, not all changes can be represented, e.g. deleted files), or don't include enough information for another tool to identify the file in a repository.

HelloNurse · 2025-06-04T07:55:03 1749023703

Better file formats cannot, by themselves, improve an inferior SCM tool that, for instance, processes files with the wrong text encoding or forgets deleted and renamed files: they would only have helped you for the purpose of developing your code review tool.

Standards are meant for interchange, like (as mentioned in other comments) producing a patch file by any means and having someone else apply it regardless of what they use for version control.

tankenmate · 2025-06-04T07:22:55 1749021775

Not so, obviously it is less common these days, but I still use patch(1) and friends enough to run into problems from time to time. This is especially true when you have devs on different platforms (don't even get me started on filename mangling / case-folding issues).

Borg3 · 2025-06-04T07:38:25 1749022705

Oh, then this is management issue, not tooling. You need to sit down and analize where your stuff will be developled. Some very basic rules to start with: file names need to be all lower case (they are case-insensitive), use 7bit ASCII encoding for source code files. And vioala :)

NavinF · 2025-06-04T09:03:07 1749027787

Poe's law at work. Replies are taking you literally, but I'm almost certain that you're joking. Very few large projects exclusively have lowercase filenames

bawolff · 2025-06-04T07:50:18 1749023418

What exactly is the lowest common denominator platform we are trying to target here where we need 7bit ascii? MS-dos?

theamk · 2025-06-05T04:36:28 1749098188

Any system which uses encodings, including Windows and Linux in non-utf8 locale.

keybored · 2025-06-04T07:55:33 1749023733

Could just be Linux. Filenames are just bytes so two equivalent Unicode filenames that have been normalized differently could be confusing. I guess?

I guess since I’m too afraid to use non-ASCII in filenames much.

bawolff · 2025-06-04T08:02:44 1749024164

I guess that is fair. If i remember right mac uses NFD where literally everyone else in the world uses NFC (linux might not normalize but basically it usually ends up being NFC).

That said, i feel like this is something most tooling could just handle, and not really an issue.

Certainly its not a problem diffX is going to solve since it appears to only store charset and not filename normalization rules.

dotancohen · 2025-06-04T08:54:33 1749027273

I had this condition a few years ago. A folder shared with Dropbox was then renormalized either by Dropbox or by another system, then when it was synced back to the original machine I had two folders with identical names, normalized differently.

I still have some ls and hd output that I stored in my notes files, if anybody is interested.

dotancohen · 2025-06-04T13:59:13 1749045553

Here, found it:

  $ ls
  Español  Español  Français  Français
  $ ls | hexdump -C
  00000000  45 73 70 61 6e cc 83 6f  6c 0a 45 73 70 61 c3 b1  |Espan..ol.Espa..|
  00000010  6f 6c 0a 46 72 61 6e 63  cc a7 61 69 73 0a 46 72  |ol.Franc..ais.Fr|
  00000020  61 6e c3 a7 61 69 73 0a                           |an..ais.|
  00000028

bawolff · 2025-06-04T19:17:06 1749064626

The first one (6e cc 83) is NFD which is used by mac, the second one (c3 b1) is NFC which is used by everyone else.

dotancohen · 2025-06-04T19:20:28 1749064828

Thanks. I did have a company Mac in 2017, and it was connected to that account.

Joker_vD · 2025-06-04T17:03:39 1749056619

> I’m too afraid to use non-ASCII in filenames much.

I suggest installing a fresh Linux distribution with e.g. bg_BG.UTF-8 locale and playing with it, especially with XDG directories like "Плот", "Свалени" and "Документи", and apps that should use them by default. Everything should Just Work™.

Although I admit that when reporting bugs for apps that can't handle non-ASCII paths, the responses from the developers (unless they're themselves from non-English speaking countries, but sometimes even then) quite often seem to be very thinly veiled "I can't be bothered to figure out where I botch things, why can't you just speak English like all reasonable people".

bawolff · 2025-06-04T19:06:20 1749063980

To be fair, as far as unicode goes, cryllic is kind of the easy case (no combining characters, no rtl, etc). In some ways its even easier than (non-english) latin scripts because in latin you can get easily confused with windows-1252 where things sort of work where if you are accidentally using a legacy 8bit encoding with cryllic you are more likely to figure that out quickly.

HelloNurse · 2025-06-05T06:51:16 1749106276

It's "Cyrillic", named after St. Cyrill.