A staggering amount of unnecessary and counterproductive scope creep in just 4 items:
A single diff can’t represent a list of commits
There’s no standard way to represent binary patches
Diffs don’t know about text encodings (which is more of a problem than you might think)
Diffs don’t have any standard format for arbitrary metadata, so everyone implements it their own way.
Of these, only a notation for binary patches would be a reasonable generalization of diff files. Everything else is the internal data structure or protocol of some specific revision control system, only exchanged between its clients and servers and backups.
We build a code review product that interfaces with over a dozen SCMs. In about 20 years of writing diff parsers, we've encountered all kinds of problems and limitations in SCM-generated diff files (which we have to process) that we wouldn't ever have expected to even consider thinking about before. This all comes from the pain points and lessons learned in that work, and has been a huge help in solving these for us.
These aren't problems end users should hopefully ever need to worry about, but they're problems that tools need to worry about and work around. Especially for SCMs that don't have a diff format of their own, have one that is missing data (in some, not all changes can be represented, e.g. deleted files), or don't include enough information for another tool to identify the file in a repository.
Better file formats cannot, by themselves, improve an inferior SCM tool that, for instance, processes files with the wrong text encoding or forgets deleted and renamed files: they would only have helped you for the purpose of developing your code review tool.
Standards are meant for interchange, like (as mentioned in other comments) producing a patch file by any means and having someone else apply it regardless of what they use for version control.
Not so, obviously it is less common these days, but I still use patch(1) and friends enough to run into problems from time to time. This is especially true when you have devs on different platforms (don't even get me started on filename mangling / case-folding issues).
Oh, then this is management issue, not tooling. You need to sit down and analize where your stuff will be developled. Some very basic rules to start with: file names need to be all lower case (they are case-insensitive), use 7bit ASCII encoding for source code files. And vioala :)
Poe's law at work. Replies are taking you literally, but I'm almost certain that you're joking. Very few large projects exclusively have lowercase filenames
I guess that is fair. If i remember right mac uses NFD where literally everyone else in the world uses NFC (linux might not normalize but basically it usually ends up being NFC).
That said, i feel like this is something most tooling could just handle, and not really an issue.
Certainly its not a problem diffX is going to solve since it appears to only store charset and not filename normalization rules.
I had this condition a few years ago. A folder shared with Dropbox was then renormalized either by Dropbox or by another system, then when it was synced back to the original machine I had two folders with identical names, normalized differently.
I still have some ls and hd output that I stored in my notes files, if anybody is interested.
> I’m too afraid to use non-ASCII in filenames much.
I suggest installing a fresh Linux distribution with e.g. bg_BG.UTF-8 locale and playing with it, especially with XDG directories like "Плот", "Свалени" and "Документи", and apps that should use them by default. Everything should Just Work™.
Although I admit that when reporting bugs for apps that can't handle non-ASCII paths, the responses from the developers (unless they're themselves from non-English speaking countries, but sometimes even then) quite often seem to be very thinly veiled "I can't be bothered to figure out where I botch things, why can't you just speak English like all reasonable people".
To be fair, as far as unicode goes, cryllic is kind of the easy case (no combining characters, no rtl, etc). In some ways its even easier than (non-english) latin scripts because in latin you can get easily confused with windows-1252 where things sort of work where if you are accidentally using a legacy 8bit encoding with cryllic you are more likely to figure that out quickly.