Arc Forumnew | comments | leaders | submit | no's commentslogin
4 points by no 6171 days ago | link | parent | on: Clarification about Character Sets

Okay pg, it's time to call you out. Unicode support is not trivial, like you make it out to be, and it's not a waste of time. It's a critical piece of infrastructure for any runtime. You fail.

-----

5 points by pg 6157 days ago | link

I'm glad this is preserved for posterity, since it did turn out to be trivial and in fact got added with about -2 lines of code a few days after this comment was posted...

-----

4 points by maxwell 6171 days ago | link

So, how would you implement it?

-----

6 points by kirkeby 6171 days ago | link

I think the Py3K solution sounds right: Two different types of strings, 8-bit byte-strings and full unicode strings, with encode and decode functions to convert between the two.

Without this strict separation you get the wart that is Pythons current string-support.

-----

9 points by olavk 6171 days ago | link

Or just one type of string: unicode character strings (which is sequences of unicode code points). Then a seperate type for byte arrays. Byte arrays are not character strings, but can easily be translated into a string (and back).

-----

3 points by olavk 6171 days ago | link

...and this seem to be exactly what MzScheme provides :-) Strings in MzScheme are sequences of unicode code points. "bytes" is a seperate type which is a sequence of bytes. There are functions to translate between the two, given an encoding.

Python 3000 is close to this, but I think Python muddles the issue by providing character-releated operations like "capitalize" and so on on byte arrays. This is bound to lead to confusion. (The reason seem to be that the byte array is really the old 8-bit string type renamed. Will it never go away?) MzScheme does not have that issue.

-----