Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Ropes are great when constructing a large string, but can be much slower and more memory-heavy for comparison, iteration, indexing, and matching. In a typical program, where many of the strings are small literals being matched against or tokenized from a large input, this is often a poor trade-off.

My personal ideal representation of strings, if I were ever doing a new language:

Length-prefixed (in characters) UTF-8, with a redundant null terminator so that the buffer can be passed directly to C libraries without being copied or re-encoded. Assume that heap objects have their size in bytes automatically prefixed (GC requires this), so that it's also possible to get the byte length of the string. Small strings (<7 bytes of UTF-8 on 64-bit, or <3 bytes on 32-bit) are stored immediately; the assumption is the language runtime would use tagged values much like Lisp, so a small string gets a tag value of 0b10 and uses 4 bits or so in that last byte to indicate the length. No null terminator on immediate strings; when they're that small, the cost of copying them to C is negligible (although hmm, the memory management kinda sucks. It'll suck anyway though, as some C libraries expect to take ownership of the string and some don't...and for ones that don't, you could just zero out the tag byte and pass the word directly, then reset the tag when the call completes).

Ropes are available as a standard library data structure with an API identical to strings, but different performance characteristics. All of these are immutable; concatenation and modification result in a copy (in a rope's case, with much shared structure).

Give up on O(1) indexing and slicing, but indexing should always return the complete codepoint at that position in the string and never a corrupted character. Most of the time, indexing or slicing usually involves looking a small number of characters from the beginning or the end, so you can use linear search of UTF-8 codepoints from the nearest endpoint and N ~= 3-10. The exception is finding the midpoint, which should be provided as an API function (bisect?) that takes the midpoint in bytes and then uses UTF-8's self-synchronizing property to find the nearest legal codepoint. indexOf uses Boyer-Moore on bytes, startsWith/endsWith/equals are also byte-wise (well, word-wise, you can significantly speed up string equality tests by comparing by word and then using duff's device on the remainder) with some special casing to handle immediate values above. Split() is basically repeated indexOf with a copy, join() allocates the full string length and then copies in the data. Regexps use a DFA on bytes.

I'm wondering if it's worth introducing a separate data structure (same API) that is analogous to a Slice in Go - a pointer and index into an existing buffer - but my experience is that these often result in subtle memory leaks in GC'd languages. You hold onto a reference to 3 characters in a 300K buffer, and suddenly your program needs to keep all 300K live. It definitely makes split and slice very fast, though.

Also, there should be a separate standard-library type for []bytes, and all I/O should operate on that, with explicit encoding/decoding operations. Encoding happens at program boundaries, all strings inside the language should be UTF-8. Would be cool to allow vectored I/O directly from ropes, too, it'd make for great templating engines & web frameworks.



You can just fall back to a "rope" consisting of a single node for small strings.

One of the things I really like about what I call "pseudo-immutable" ropes (ropes that are not truly immutable, but to all appearances are - when you take a slice it modifies the original rope to maximize structure-sharing.) is that you get immunity to those subtle memory leaks on slicing "for free".

I agree fully on length-prefixed strings. They are a lot better than null-terminated strings, in a bunch of ways. Regexes using DFAs though can be... problematic. You really want a NFA that's converted to a DFA on-the-fly, with caching, or else you can end up with exponential blowup when constructing the DFA (a[<some large character class>]{1000}a, for example.). A DFA-based regex also doesn't handle a number of things.

As for slicing, I had a thought on that matter. Make slices reference the main array weakly, and have (weak) backreferences to any slices made of an array. When an array comes up for GC, if it has any slices made of it, copy out the slices to new arrays. There are a bunch of heuristics that can be tuned here, of course (don't do weakref / backref at all on smaller arrays, don't copy out slices if it won't reduce memory usage, etc.), but it prevents the worst case at least.

The big advantage of slices is that it means that programmers don't need to sacrifice readability for efficiency. It's much easier to be able to pass in arr[7:10000] to a function than to either have to modify the function to take index parameters (which you can't always do) or have to do a temporary copy. It also prevents the O(n^2) behavior associated with doing simple tokenizing (the change to Java's substring behaviour, for example)

I disagree strongly with "Encoding happens at program boundaries". The (major) problem with that is that there becomes no good way to pass strings through the language without going through an encoding pass. Try to implement cat in Python 3, for example. Encoding should happen when you ask for it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: