Arc Forum | Binary Search Trees in Arc

Arc Forum

Binary Search Trees in Arc

20 points by pg 6696 days ago | 45 comments

I put an example implementation of (nondestructive) binary search trees at

  http://ycombinator.com/arc/bst.arc

I'd appreciate it if people could tell me (a) if there are any bugs, (b) if there are bits that could be rewritten to be shorter, and (c) what the equivalent program would look like in other languages. Thanks!

6 points by rkts 6695 days ago | link

Bug:

  (def other (side) (case side 'l 'r 'r 'l))

should be

  (def other (side) (case side l 'r r 'l))

This was causing bst-rem to behave incorrectly when removing a node with two children.

-----

3 points by pg 6695 days ago | link

Thanks! I updated the code.

-----

4 points by almkglor 6696 days ago | link

Wouldn't it be better to have f< as a slot in (deftem node ...) ? That way we don't have to keep specifying it, although it does complicate how to specify the null tree.

-----

2 points by rkts 6696 days ago | link

Not to mention this is a cause of potential bugs: you have to supply the right f< with every call, or you'll get weird behavior.

-----

2 points by pg 6696 days ago | link

This code is derived from the bst code I wrote for ACL. I remember when I wrote it thinking people might want to use different ordering functions when operating on the same tree. Now that doesn't seem so likely. But are there conceivable cases when one might want to?

-----

4 points by almkglor 6696 days ago | link

By definition a binary search tree has its l link as < r link. If a different f< suddenly causes the r link to be < l link, then your bst suddently becomes incorrect and weird stuff happens.

-----

3 points by pg 6695 days ago | link

Suppose you have a bst that persists for a long time. Suppose the things you're sorting are not mere integers, but a structures that represent something in the world, ranked according to some scoring function. Suppose you're able to figure out a slightly more accurate scoring function. If you're passing f< as an argument on each bst operation, you can just start using the new ranking fn. Whereas if you've been storing the comparison fn in the tree as you've been building it, you're probably going to have to rebuild the tree if the comparison fn changes.

I'm not saying this is going to be common enough to drive the design of a bst library. But it's by no means logically impossible to change f<.

-----

4 points by almkglor 6695 days ago | link

Suppose we have two f<, f<old and f<new, to be used in a binary search tree b.

f<new may safely be used to replace f<old iff:

  all x in b: all y in b: (f<old x y) == (f<new x y)

This stems from the binary sort tree invariants:

  all x in b!l: (f<old x b!v)
  all x in b!r: (f<old b!v x)

And the definition of all x of b:

  all x of b = {all x in b!l, b!v, all x in b!r}
  all x of nil = {}

This is because if even a single case is wrong, that part of the binary search tree concerned with keeping track of the correct order becomes wrong. In such a case, it is necessary to rebuild the tree.

Of course, note that the above condition is necessary only for all current existing nodes, i.e. if we have a new m:

  all x in b: m != x

then it is okay that:

  all x in b: (f<old m x) != (f<new m x)

While your style allows such an edge case, it is arguably such an edgy edge case that you might as well rebuild the tree if f<old != f<new; this is largely because you have to check that f<new == f<old for all current members of the tree, and that checking is O(N^2), whereas rebuilding the tree is O(N) taking O(2N) space (and your nondestructive bst takes O(N) space for each operation anyway).

-----

3 points by pg 6695 days ago | link

I was thinking about cases where you'd be willing to tolerate a small amount of misordering in the tree (since you were till then using the old scoring function, which presumably misordered things or you wouldn't have improved it). But deletion wouldn't work if you changed the scoring fn, and the number of cases where you both want to change the ordering function and never need to do deletions must be small enough that a general-purpose bst lib doesn't need to support that.

-----

3 points by almkglor 6694 days ago | link

I don't think you quite understand. A misordered binary search tree will fail lookups - (bst-find ...) will fail, even if that object is returned by (bst-elts ...). If you're going to create a general-purpose bst lib with "you can change the sort function without rebuilding the tree!" then you'd better make plenty damn sure there's a big warning saying "...but if you do (bst-find ...) might fail." Personally, I'd rebuild the tree anyway, because if the change is significant enough to change the sort function, it's big enough to completely destroy the meaning of the binary search tree.

-----

3 points by pg 6694 days ago | link

I get that. When I think of using a bst I think of what News.YC needs: the top-ranked 100 or so stories out of 100,000, and no need for find or delete. (Though currently News.YC just truncates the list and sorts it, which would stop working at very high submission rates.)

-----

1 point by almkglor 6694 days ago | link

c/ref Mythical Man-Month, Frederick P. Brooks.

A programming systems product takes about nine times as much effort as the component programs written separately for private use. I estimate that productizing imposes a factor of three; and that designing, integrating, and testing components into a coherent system imposes a factor of three; and that these cost components are essentially independent of each other.

-----

4 points by jonnytran 6695 days ago | link

You know... there's another amazing new Lisp that is getting better by the day and flying under the radar.

Clojure version of BSTs: http://paste.lisp.org/display/55915

"The Clojure version is shorter, is in an enumerable namespace, and has a lazy bst/seq function that interoperates with the rest of Clojure."

I didn't do the node count myself, but it is fewer lines.

-----

5 points by pg 6694 days ago | link

The Clojure version is shorter

It doesn't look to me like it's obviously shorter. I haven't tried counting nodes in both either, but in nonwhitespace chars, which should be a reasonable approximation of nodes in two languages so similar, the Arc version is slightly shorter: 755 chars vs 774.

-----

4 points by lojic 6696 days ago | link

I would love to see a Haskell version of this.

-----

7 points by raymyers 6695 days ago | link

Ask and ye shall receive. http://ray.codezen.org/wiki/doku.php?id=binary-search-tree

-----

4 points by lojic 6695 days ago | link

Very nice - thanks!

  Haskel:
  36 total lines; 1,167 chars
  Arc:
  53 total lines; 1,215 chars

I don't know the formula for a "code tree".

-----

4 points by gregwebs 6695 days ago | link

This doesn't tell the whole story. The haskell version is doing some balancing. For an AVL tree with rotations implemented through pattern matching see here. http://www.nabble.com/Re%3A-Why-functional-programming-matte...

The line deriving(Eq,Show) means that you can now do this:

  Prelude BST> Tree 'a' Empty Empty
  Tree 'a' Empty Empty
  Prelude BST> let a = Tree 'a' Empty Empty
  Prelude BST> a
  Tree 'a' Empty Empty
  Prelude BST> let b = Tree 'b' Empty Empty
  Prelude BST> b
  Tree 'b' Empty Empty
  Prelude BST> a == b
  False
  Prelude BST> let c = Tree 'c' a b
  Prelude BST> let d = Tree 'c' a b
  Prelude BST> c == d
  True

All this is guaranteed at compile time. Moreover, the haskell version is much more readable. The use of modules removes unnecessary function prefixing, and there is no doubt about what the arguments to functions are when they are pattern matched.

Curiously, though, the arc version actually has less tokens by my measurement

  ruby -e 'p File.read(ARGV[0]).split(/\s|\.|!/).reject{|s| s==""}.map{|tok| tok.gsub(/\(|\)/,"")}.flatten.uniq.size'

  bst.arc 205
  bst.hs  253

but here is the interesting thing- now lets count just the unique tokens

  bst.arc 41
  bst.hs   47

This is in part due to haskell's pattern matching semantics where the same function definition is repeated multiple times, and the '|' character is repeatedly used. Also, there are more functions in the arc version, and in both pieces of code, variable names are usually one character and function names have multiple characters.

-----

2 points by rkts 6695 days ago | link

> The haskell version is doing some balancing.

Balancing that causes remove to be O(n). Probably not a good idea.

-----

2 points by raymyers 6695 days ago | link

A more balanced tree will speed up 'find' and 'insert'. If those actions are performed far more often than 'remove', it would indeed be a good idea. Without knowing the use case, I cannot say which I would prefer.

-----

3 points by pg 6695 days ago | link

It's fairly easy to do by hand. Just count the number of tokens-- in the sense of things whose first character a parser would not treat as merely another character in the name it was previously reading-- that are not closing delimiters of a pair.

-----

2 points by vrk 6695 days ago | link

If by that you mean counting the nodes, you need to parse the source first. You can either do it by hand (if you know the grammar) or tweak the compiler/interpreter, if it doesn't have any statistics option.

-----

4 points by lojic 6695 days ago | link

Is there a bug in bstMin ? Shouldn't it reference bstMin on the rhs?

-----

2 points by raymyers 6695 days ago | link

Fixed. Thanks!

-----

1 point by pg 6695 days ago | link

It's missing bst-trav. (The functions that are supposed to be part of the external interface are the ones that begin bst-...)

Also, is numerical < hard-wired as the comparison function?

-----

2 points by gregwebs 6695 days ago | link

no need for bst-trav since there is an elements function. You can just map over the elements.

in the data declaration 'a' is a generic type, but Haskell will infer that the type must be an instance or class Ord (i.e. implement <,<=,>=,>,max,min,==). (If the type is not an instance of Ord it will tell you so at compile time)

-----

2 points by pg 6695 days ago | link

no need for bst-trav since there is an elements function. You can just map over the elements

Mapping over elements is not the same if you're looking for the 4th element of a huge tree; which is probably what you have if you're using a bst.

-----

7 points by raymyers 6695 days ago | link

Actually yes, in Haskell it is the same. Lazy evaluation means you could take the 4th element of a map over an infinite list. For example: "map (*2) [1..] !! 4"

-----

2 points by rkts 6695 days ago | link

I would argue that relying on overloaded comparison functions is a bad idea. Not all things have a 'natural ordering,' and even for those that do you might want to override it at some point.

-----

2 points by raymyers 6695 days ago | link

I think reasonable people can differ on this. Passing comparators as arguments is an approach more typical of dynamically-typed languages (e.g. CL and Scheme 'sort'). In a statically-typed language, it is more common to use a comparison operator that specializes on type (e.g. Haskell Prelude 'sort'). If a set of objects don't have a 'natural ordering', they might not belong in a Binary Search Tree in the first place.

-----

3 points by pg 6695 days ago | link

What do you do if you want to maintain a list of strings sorted according to their length?

-----

4 points by raymyers 6695 days ago | link

If I wanted to keep using the Ord type-class for comparison, I might use a container type, like this:

    data Elem = Elem String deriving (Show, Eq)
    instance Ord Elem where
        Elem x < Elem y = length x < length y
    
    tree1 = insert (Elem "Some string") Empty
    tree2 = remove (Elem "Some string") tree1

If I really wanted to pass in the comparison function, I would have to change my code slightly. See: "A More Faithful Translation" (http://ray.codezen.org/wiki/doku.php?id=binary-search-tree).

-----

0 points by gregwebs 6695 days ago | link

You have just seen a bunch of code that gets the job done, whereas there is still an argument about whether the lisp code should be passed a comparison function. If you are going to make this argument you should show some actual code, because it is obvious from looking at this example which way is better.

-----

2 points by pg 6695 days ago | link

You have just seen a bunch of code that gets the job done,

It seems to me more that you have redefined the job to be what the code does.

-----

3 points by raymyers 6694 days ago | link

Or maybe we just interpreted the job as "a non-destructive binary search tree". Sorry if we misread the syllabus :)

-----

3 points by gregwebs 6694 days ago | link

Sorry, that comment doesn't sound nice and is dumb because I didn't realize what the previous commenter was getting at.

-----

2 points by lojic 6696 days ago | link

Would someone be so kind as to explain the following line (1007) from within the definition of copy in arc.arc from arc1.tar?

  (apply (fn args args) x)

-----

7 points by shiro 6696 days ago | link

apply copies its argument list (guaranteed by Scheme spec). So this line lets the apply copy x, then passes it to a function that just returns the copied list. It may be faster than traversing the list in arc, for apply may be able to take advantage of underlying Scheme.

-----

2 points by akkartik 6696 days ago | link

Worth encapsulating in a macro perhaps?

  (copy x)

-----

1 point by lojic 6696 days ago | link

Cool - thx. Probably worth a small comment in the code.

-----

2 points by okplus 6696 days ago | link

To me, this code is illustrating that "." and "!" should have their meanings reversed. "b!l" and "b!r" in arc would map to "b.l" or "b.r" in python, while "b.side" in arc, is something more akin to "b.getattr(side)"

Ofcourse, swapping the meanings of ! and . puts a cramp in syntax like "b.0"...

-----

1 point by rkts 6695 days ago | link

Here's a sloppy version 0: http://benstoker.com/code/arcbst.ml

I implemented all the bst- functions except bst-rem-edge, which seems redundant given bst-rem and bst-edge.

-----

2 points by rkts 6694 days ago | link

Update: I've rewritten my solution above to be as close as possible to the original Arc code. All the function interfaces should be the same. Equality is determined with (=), which is the closest thing OCaml has to Arc's is.

I've also created a separate version that's closer to how I would prefer to implement it:

http://benstoker.com/code/mybst.ml

The main differences are:

1. The comparison function is specified using a functor. This allows the user to specify it only once instead of in every call to insert, find, and so on. (I'd like to see something like ML's functors in Lisp someday.)

2. The comparison function handles equality as well as order.

3. edge is replaced by min_elt and max_elt.

4. rem-edge is gone.

5. The output of to_list (aka elts) isn't backwards.

-----

2 points by pg 6695 days ago | link

Are you sure it works the same? Bst-rem-edge is simpler than bst-rem; it doesn't rebalance the tree.

-----

2 points by rkts 6695 days ago | link

Maybe I'm not understanding your code (it's kind of cryptic...) but it does seem the same to me. When bst-rem removes a node with one child, it just grabs the other child. This is the same thing bst-rem-edge does.

Your 'bubble' function calls bst-edge to find the left/rightmost node and then calls bst-rem-edge to remove it. You could just as well call bst-edge and then bst-rem on the result; either way you traverse the tree twice. I can't imagine a situation where you'd just want bst-rem-edge by itself.

Come to think of it, I don't know why you need bst-edge either. It's simpler to just have min and max functions that return an element, as in the Haskell solution.

-----