Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

I'm thinking even bog-standard European umlauts, cedillas, etc go multi-byte in Unicode? (Take a string of ÅÄÖåäöÜü and chop it off at various byte limits and see.)


This is just the general behavior of truncating strings by code point when they contain decomposed glyphs. This can also impact accents etc.


I don't remember the details, only that it was a bigger deal than with umlauts. I'll see if I can find the talk again.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: