Arc Forumnew | comments | leaders | submitlogin
Are strings useful ?
4 points by sacado 6087 days ago | 15 comments
Be careful, that might be strange.

I was wondering, couldn't be strings removed from the language ? If you accept to make strings unmutable (and who needs mutable strings ? Furthermore, if I'm right, the only thing you can mutate in strings yet is individual characters : you cannot append or remove substrings) and to remove the character type (as in Python or Lua, accessing the ith character of a string could return a string of length 1), you don't need strings either. I mean, they can be unified with symbols.

Aren't strings just uselessly complicating the language spec and programs where you have to manipulate symbols ? You know, when you have to coerce syms to string, appending them and then coercing them to a symbol one more time ?

That's probably stupid & wrong, but I can't see why. Help me !



8 points by kens 6087 days ago | link

Strings vs symbols has caused me problems; using symbols as immutable interned strings seems like a hack (in the bad sense).

Some of my complaints: I never know if something expects a string or a symbol, e.g. keys into the req structure. Symbols sometimes require quotes and sometimes not, interacting confusingly with macros. (To make a hash with key foo and value bar: (obj foo 'bar) - bar needs a quote and foo needs no quote.) Symbols have their own strange syntax, e.g. to define a home page, you use the mysterious (defop || ...) because || is the empty symbol. Using symbols as strings makes the code more confusing; is passwd being used as a variable or as a string-like token? I think it's nice that strings have a consistent syntax: if it's got quotes around it, you know it's a string.

The big advantage of symbols over strings in Lisp is they are interned, so you can compare two symbols for equality in constant time, unlike strings which take O(n) time. But it seems that you could use interned immutable strings.

I agree with you that Arc's marginally mutable strings aren't very useful; being able to only modify individual characters gives you the worst of both worlds.

While I'm complaining about strings, it would be very nice to have something like Python's triple-quote """ that can be used to define a string that contains single and double quotes. To generate my documentation, I'm constantly creating strings that contain double quotes, and it's a big pain to go through and make sure all the backslashes are correct.

Overall, I'd like to see symbols used as symbols and strings used as strings, rather than symbols sometimes used as strings. But maybe I'm missing the goodness of using symbols as strings.

-----

6 points by drcode 6087 days ago | link

I would frame the question differently.

"Strings" are used for communication with a human- Human readable text is an important tool that software uses for interacting with a user.

Inside the actual logic of a program it really has no useful role. It is unfortunate that a byte hold a text character reasonably efficiently, so almost all software uses strings as primary data structures, even though they don't belong in the body of a program.

In my view, strings should only be used in two places in any application: 1. Acquiring raw data from the user to parse into more meaningful internal data structures. 2. In the final step when rendering text for the human to read, generated from internal data structures.

I think it would be interesting (though perhaps not practical) if arc had only two commands with string support (string-to-sexp fmt str) and (sexp-to-string fmt sexp)... In this case, the 'fmt parameter would be some kind of definition of the string syntax, something like a BNF or RegEx description. The only other things you'd be able to do is somehow acquire it from the user and somehow display it to the user. Any internal manipulation of strings would be impossible and unnecessary.

Once a string is parsed into an sexp, the sexp would contain only symbols and numbers. The symbols would be standard lower-case and contain no weird characters. All strings would need to be translated into such an sexp using the 'fmt description. There would be no string or character data types. (Clearly there would seem to be lots of apparent limitations if you completely excised strings from a language like this, but I wonder if they could be addressed if the right approach is taken)

-----

0 points by bitcirkel 6063 days ago | link

I strongly agree. After all, there are no particular accommodations for sounds, images, and what have you. Why would there be one for verbal data?

Strings have no place in the core logic of a program. It is only because of inertia that we have this remnant from the early terminal days.

And strings bring their own set of problems (not very useful for a 100-year language) like translation, ascii vs latest unicode version, coupling (the web is finally beginning to get this right by separating behaviour from content and presentation).

-----

1 point by absz 6063 days ago | link

In fact, many languages (or at least their large libraries) do have accommodations for sounds, images, etc. This is because most languages have some sort of "class" mechanism, and create classes for those things (in their standard library or in a very common one). And sometimes, what you are working on is a string parsing/manipulating/generating program, where strings do belong.

-----

4 points by nlavine 6086 days ago | link

Let me get at two points separately.

First, I have always taken symbols to be things that are conceptually different than strings, but just happened to be represented as strings. After all, symbols really appear because they are part of the interpreter/compiler data structures - they are a way of referencing an identifier, which you might use as, for instance, a variable name. Yes, they can be displayed, and they print as the string associated with their identifier, but I've always considered that a rather uninteresting side effect. Showing a symbol to a user would seem to break layers of abstraction.

Strings, on the other hand, always seemed like the right way to store user-readable text. They have these nice escape sequences, and you can isolate them and change them without affecting your entire program. They can also include spaces and nice stuff like that.

However, it is completely possible that my gut reaction is wrong. Having interned strings can be very useful. I would suggest, however, that we do it with a real interned string facility, and then make symbols a special case of that. This facility should include strings that would not make valid identifiers, including things with spaces and special characters in them.

To your second point, though, you're right - strings are pointless. After all, they're just a special case of a list (or an array, depending on how you like to think). If you have the general case, then having the specific too is pointless and feels bloated. You could argue that characters, too, are just a special case of numbers, since that's what they really are underneath. In fact, you could say that the current string type is just a vector of numbers, with cool special syntax.

To which I would say, yes, let's go for the more general stuff. Practically speaking, we should add that in before we take out the specific string case, which after all is used a lot. But in general, yeah, let's make it possible to do this with anything.

If you do allow this abstraction, you also get the interesting benefit that internationalization and unicode come for free. As PG says, they're just stupid bloat in the core of the language. In order not to have them there then, there should be some way for people to implement them, without sacrificing anything that a built-in implementation would have.

This means that there needs to be a way to make a new type which is a vector of uniform type (oddly enough, this might be easier in arc2c than in arcn or anarki). It also means that there should be a way to define nice readable syntax for things like strings in double quotes, and isolated single characters.

And it still doesn't address the issue of how you communicate with the development system, which after all does have one preferred string representation - the one that you're writing in.

This definitely wouldn't be an easy or simple thing to do, and it might prove to be a bad idea. But I think it's worth considering.

-----

2 points by absz 6086 days ago | link

You make some really good points, but I have to disagree. Because of Unicode, characters aren't numbers. They have different semantic properties. However, I think arc.arc has a good idea sitting on line 2521 (of the Anarki version): "; could a string be (#\a #\b . "") ?" That is a novel idea, and probably (if more things work with improper lists) a good one. The downside would be that (isa "abc" 'string) wouldn't be true, unless we make strings (annotate "abc" 'string), but then we lose the ability to treat them as lists. Maybe we should have a retype operator, so that (retype "abc" 'string) would return an object str for which (isa str 'string) would be true, but for which all list operations (car, cdr, map, etc.) would still work (so it would probably have to return true for (isa str 'cons), too).

-----

2 points by almkglor 6086 days ago | link

> (#\a #\b . "")

Well, Arc lists are (e e . nil). Why do we have a different terminator for strings?

This is where things get hairy. In Scheme lists are (e e . ()), so I would say that having strings be (#\a #\b . "") would make sense there, but Arc specifically tries to pretend that () is nil. Why should "" be different from nil, when () is nil?

As for [isa _ 'string], I propose instead the following function:

  (def astring (e)
    ((afn (e e2)
       (if
         (no e)
           t
         ; check for circularity
         (is e e2)
           nil
         (and (acons e) (isa (car e) 'character))
           (self (cdr e) (cddr e2))
         ; else
           nil))
      e (cdr e)))
Edit:

Hmm. I suppose someone will complain that you might want to differentiate between the empty string and nil. To which I respond: How is nil different from the empty list? Arc attempts to unify them; why do we want to not unify the empty string with nil? If someone says http://example.com/foobar?x= , how different is that from http://example.com/foobar ? x is still empty/nil in both cases.

As another example: table values are deleted by simply assigning nil to the value field. And yet in practice it hasn't hurt any of my use of tables; has anyone here found a real use where having a present-but-value-is-nil key in a table?

-----

2 points by almkglor 6086 days ago | link

I personally think that strings and symbols should be separate, largely because of their different uses.

That said, I did recently write an app which used symbols as de facto strings, and the text file format being used as a configuration/task description file by the user was just an s-expr format. The app wasn't written in Lisp or anything like it, which was why I settled for using symbols as strings (to make it easier on my readerfunction - I just didn't have a separate readable string type).

Given that experience, well, I really must insist that having separate strings and symbol types are better. In fact in the app I wrote it for, the config/taskdesc file was just a glorified association list (where the value is the 'cdr of the list, not the 'cadr)!

As for strings being lists/arrays of characters, yes, that's a good idea. We might hack into the writer and have it scan through each list it finds, checking if all elements are characters, and if so just print it as a string. We might add a 'astring function which does this checking (obviously with circular list protection) instead of [isa _ 'string].

-----

2 points by bogomipz 6078 days ago | link

I think the strongest reason for separate strings and symbols is that you don't want all strings to be interned - that would just kill performance.

About lists of chars. Rather than analyzing lists every time to see if they are strings, what about tagging them? I've mentioned before that I think Arc needs better support for user defined types built from cons cells. Strings would be one such specialized, typed use of lists.

Also, how do you feel about using symbols of length 1 to represent characters? The number one reason I can see not to, is if you want chars to be Unicode and symbols to be ASCII only.

-----

2 points by sacado 6077 days ago | link

Symbols, ASCII only ? No way, I'm writing my code in French, and I'm now used to calling things the right way, i.e. with accents. "modifié" means "modified", "modifie" means "modifies", that's not the same thing, I want to be able to distinguish between both. Without accents, you can't.

Furthermore, that would mean coercing symbols into strings would be impossible (or at least the 1:1 mapping would not be guaranteed anymore).

-----

2 points by stefano 6077 days ago | link

From the implementation point of view representing characters as symbols is a real performance issue, because you would have to allocate every character on the heap, and a single character would then take more than 32 bytes of memory.

-----

2 points by sacado 6077 days ago | link

I think that's an implementation detail. You could still somewhat keep the character type in the implementation, but write them "x" (or 'x) instead of #\x and making (type c) return 'string (or 'sym).

Or, if you take the problem the other way, you could say "length-1 symbols are quite frequent and shoudn't take too much memory -- let's represent them a special way where they would only take 4 bytes".

-----

1 point by stefano 6076 days ago | link

This would require some kind of automatic type conversions (probably at runtime), but characters-as-symbols seems doable without the overhead I thought it would lead to.

-----

2 points by sacado 6075 days ago | link

OK, I think I made my mind about this now. Thanks for your comments. Well, after fighting one more time yesterday against things that were strings and that I expected to be symbols, I would really love to see them merged. The string / symbols distinction feels like an onion to me. As long as any string can be translated in one and only one symbol, and vice versa. That means case sensitivity, UTF-8 in symbols too, etc. Everything Arc already has.

I don't know if interning strings would really kill performance. Lua does it if I remember well, and it is quite performant. Anyway, mutable strings are a real performance killer, since you must compare them with scheme's 'equal? instead of 'eq?.

Now, you are right too, they don't represent exactly the same thing, so simply removing one of them would quite feel bad, but sometimes you have to make both worlds communicate. That's a matter of syntax too. After all, "Hello, Arc world" could be a special syntax meaning : the quotation of the symbol |Hello,\ Arc\ world|. A triple-quote syntax could be added, too.

It can be said the other way around : every time you need a string and are provided with a symbol (or the opposite), an automatic coercion could be performed.

Well, anyway, I'm just thinking loud since I am not here to design the language (and since its designers don't seem to appear very often anymore) :)

-----

1 point by absz 6087 days ago | link

This has come up before, but I don't like it--symbols and strings feel distinct for me. http://tinyurl.com/6e9fv5 makes some interesting points about (Ruby) symbols, some of which apply here; I think that Ruby 1.9 will make Symbol a subclass of String, and Symbols will effectively be frozen strings (thought that might have gone away). It might be nice if there were some relationship like that with strings and symbols, but I do thing keeping them separate is useful.

-----