I haven't figured out how to measure CPU time in java; the profiler just calls System.nanoTime() at entry and exit of each function invocation and notes the difference.
Welcome :) And thanks for the profiling macro! I'd love to get to the point where we can press a button and have the computer say, "ding! 90% of your CPU time is being spent in foo"
The general variable assignment procedure in Scheme is set!, as in (set! x 3). But this is an error when x is undefined.
I just noticed compile-allow-set!-undefined the other day:
Welcome to MzScheme v4.2.1 [3m], Copyright (c) 2004-2009 PLT Scheme Inc.
> (set! a 3)
set!: cannot set undefined identifier: a
> (compile-allow-set!-undefined #t)
> (set! a 3)
>
They're what first-class functions are to functions. That is, the ability to create anonymous macros and pass them around as first-class objects. Imagine you could do something like
(let m (macro (x) `(foo ,x))
...)
or
(map (macro (x) `(= ,x 'blah)) xs)
It's like the difference between having fn and just having def. Right now we only have mac and no way to reference macros as their own entity.
arc> ((annotate 'mac (fn (x) x)) 'x)
Error: "Function call on inappropriate object #(tagged mac #<procedure>) (x)"
Whenever the topic resurfaces, the word seems to be the same: they still make compilation difficult (the most recent research only seems to be from 2000 or so http://lambda-the-ultimate.org/node/2987) and pg has yet to comment on them. So to implement first-class macros currently, you'd probably effectively turn ac into an interpreter so that macros could expand at run-time.
That example specifically? It was nonsense. My point was that you could, say, map macros (anonymous or not). Those used to inline functions, for instance: you want to use a macro like a function, but it's a macro for efficiency. Currently, you have to wrap it in a function, which is kludgey.
Silly example:
; instead of
(map complement list-of-fns)
; you have to do
(map [complement _] list-of-fns)
Again, macros being first-class intuitively works like functions being first-class. So, with first-class macros you could have expressions return macros.
(mac m ...)
((traced m) ...) ; traced is a higher-order function
; since m is a macro, (traced m) returns a macro
More popularly, Bawden's paper (http://people.csail.mit.edu/alan/mtt/) demonstrates writing a module system with first-class macros in which the modules are also first-class.
I think the example was supposed to set all of the variable names contained in the list xs to contain the value 'blah.
The example wasn't supposed to do anything. I was just foo-bar-baz-ing something. I don't think the example would actually do that anyways, for the same reason the following doesn't work.
arc> (mac m (x) `(= ,x 'blah))
#(tagged mac #<procedure: m>)
arc> (m 'a)
Error: "reference to undefined identifier: _quote"
I guess it would've alleviated confusion had I said
Well, first class macros don't work right now anyway, so I wasn't expecting your example to work. I was just trying to figure out the idea behind the example.
And I think that first class macros would be required to set an arbitrary list of variables to contain an arbitrary list of values via map. Obviously with or let would work in a local scope, but if you want global variables you'd need something like that, or at least a function that wraps the macro.
Oh, do you mean you want access to the lexical environment at run time? That sounds like to me more like you want the lexical environment to be a first class object.
The lexical environment is (mostly) accessible at run time. The problem is when you're returning a macro from a function, instead of just renaming it.
Suppose you implement packages as tables of symbols. Suppose also that you have a macro stored in said package. There are two ways to use the macro
1.
(= newmac package!newmac)
(newmac args)
2.
(package!newmac args)
I think that most people would like to use the second form, because it doesn't require importing the macro into the current namespace. Unfortunately, the second version requires first class macros.
Obviously, since the compiler can't easily determine that an expression is going to be a macro, it won't have the advantage of compile time optimization. I'm not sure how important that is. Given that arc is already a "slow" language, why not give it more expressive power at the cost of slightly lower speed? Especially since the user can choose whether or not to use them.
So, as I see it first class macros give you several features:
1. easy to make a package system that can contain macros
2. macros fit in better with functions; no wrapping of macros in functions
3. anonymous macros created for mapping across expressions, etc.
evaluation of expr2 and expr3 can be delayed until after expr1 is evaluated. If expr1 turns out to evaluate to a macro, then the literal expr2 and expr3 can be passed to the macro.
I don't understand your comment that caching isn't an option because it's picking one story at a time to render. You have a bunch of feeds containing a bunch of stories. Some other background thread / process / program / box can fetch stories, render them, and store the rendered story to disk. Now all your user interface has to do is choose which story to present to the user.
In most programs that are too slow it's a small percentage of the code which is causing the slowness. If you work on making all of your code faster, you're wasting 95% of your effort, because it's only 5% of the code which is causing a problem. So it's worthwhile to run timing tests and figure out what exactly is the slow part, and work on algorithms such as caching to make those parts fast enough.
If you can't think of any algorithms or methods to handle the parts that are too slow, then you can rewrite your program (either the whole program or just the slow parts) in a faster language. But this will only take you so far. If you rewrite your program in a language which is 20 times faster, now you can support 60 users instead of just 3, which is great, but how then do you scale beyond 60 users? At some point you need to figure out what your strategy is for handling lots of users, and rewriting in a faster language is only a temporary fix for that issue.
"If you rewrite your program in a language which is 20 times faster, now you can support 60 users instead of just 3, which is great, but how then do you scale beyond 60 users?"
We're talking about 60 concurrent users. If I could do 60 requests per second I'd be a happy man. 60 requests per second is 216k requests/hr. Assuming 30 page views per happy readwarp user, that's 7200 uniques per hour. Let's say daily uniques at that point is 6 times that, 40k uniques per day. With 40k uniques a day I'd be a happy man. More importantly, 40k uniques will take me n years to get to. So you see, I don't need to "scale beyond 60 users".
Right now readwarp isn't even at 3 requests per second because requests can take several seconds with contention. But that's not ideal either. And it's largely because of the language and the platform.
What you describe in the first paragraph is not caching, it's precomputation. It may help, but it's a completely different performance calculation. The reason caching works for HN is that even if it gets 10k users in 5 minutes it has to generate a dozen pages once that they all see. When I have to manage 100x as many pages in the 'cache' I have to be a lot more careful not to waste time on stories that never get shown.
Precomputation doesn't actually save very much work at all. All it does is move latencies around, often by throwing hardware at the problem. And it can easily go off the rails and end up doing a lot of wasted work without helping latency.
Precomputation does help sometimes. In fact I use it in readwarp in a couple of places. But it's not as straightforward a cost-benefit analysis as caching would be. When caching works it's just brain-dead to reason about. With precomputation I have to constantly monitor whether I'm getting the bang for my buck. And I have to be conservative.
The bigger question is why arc behaves like it takes 3-5s to render one page. All I'm doing is selecting a story and rendering it.
Hmm, I probably haven't fully explored all these ideas. Every now and then my server just hangs for a few seconds and becomes completely unresponsive, and I haven't been able to figure out what it's doing at that point, to get a stack trace or something. For any timing instrumentation I insert, times often vary wildly, by an order of magnitude. I'm starting to suspect PLT's timings shoot through the roof because of contention between threads.
I should just try to offload all my state in a way that can handle multiple coexisting arc interpreters. I'll probably do that. Arc seems quite memory-heavy, though. I'm not sure how many interpreters can coexist on a machine.
You're right that I'll still have to scale in another language, but language features can make my life easier or harder when I have to scale to multiple machines. Having a global interpreter lock perhaps makes it hard to avoid thread contention, and to balance responsiveness with throughput. I know the python folks have had similar issues. OTOH, having just one request per interpreter makes arc far more inefficient in RAM than say mongrel.
"The bigger question is why arc behaves like it takes 3-5s to render one page. All I'm doing is selecting a story and rendering it."
Actually there's 3 stages. Selecting a story (A), rendering it (B), and updating user data structures when the vote on it returns (C). Each user request is seeing the total time for C+A+B.
When I eliminate rendering (the server does all the computations, but doesn't send the story to the client) I still see timeouts with 3 concurrent requests.
---
Ah, I got an improvement. Remember those 'couple of places' I was doing precomputation? I was trying to batch up stage-A computations 5 at a time in a precompute thread when a user's running low on stories. I got rid of them, and the average latency for everyone went up from 1s to 3-4s even with just one concurrent request at a time. But I can now handle 6 concurrent requests without timing out.
When I eliminate stage C the number starts inching up to 10. At 10 median request time is 13s. But at least stuff is getting processed.
My lessons:
1. When the interpreter is single-threaded try to minimize use of threads. Contention gets expensive fast. I am now used to seeing processing latency double when #concurrent requests doubles.
2. When things are slow resist the temptation to precompute. It always makes hot spots harder to see in the future. It's a last resort. And it'll always cost you on a single-threaded platform. I'm now going to rerun my profiling.
3. When you're stuck come blather about your problems on the arc forum. Thanks guys.
So I can vary the number of assignments each op does in the URL.
Experiment 1: just a hundred sequential requests.
$ ab -n 100 "http://127.0.0.1:8080/?iters=10"
On my laptop 98% of requests complete in 3ms.
Experiment 2: let's inch up to 2 concurrent requests at a time.
$ ab -n 100 -c 2 "http://127.0.0.1:8080/?iters=10"
Now 50% of requests require at least 4ms. For 98% it's 10-15ms.
Experiment 3:
ab -n 100 -c 10 "http://127.0.0.1:8080/?iters=10"
Now 50% of requests require 20-40ms.
In general the 50% latency number that apacheBench returns seems to increase linearly with the number of concurrent requests. This rule of thumb has held in my app as well over several major changes. Right now readwarp requests take 800ms on localhost and 2-3s from the internet at large. Not much room for those latencies to grow linearly.
I can't compare your first and last example because one says 98% and the other 50%.
If it were an apples to apples comparison would you agree that it wouldn't be surprising that if the server were handling 10x the number of simultaneous requests that each request would take 10x as long?
The only apples to apples comparison is going from one simultaneous request to two simultaneous requests which has the increase from 3ms to 10-15ms, which seems like about double what we'd expect if using threads had no overhead at all.
When I give just the 98% number I meant the 50% number stays the same (+/- 1ms). We're at the limits of ab's time resolution here.
Yes, none of this is surprising given that PLT is single-threaded. I'm trying to remember my class in queuing theory.. If each request takes m ms to process, n concurrent requests will take mn ms to process, (n-1)m ms of queuing delay and m ms of processing time. The average latency seen would be mn/2 which scales by n. As the number of concurrent requests grows, the curve of latency to number of concurrent requests would be linear.
If you now add a second queue, the curve stays flat until 2 requests and then moves linearly up. Each queue is only n/2 long, so average queuing delay is (n-1)m/4 ms. With q queues, on average there's only (n-1)m/2q ms of queuing delay.
Summary: if you can have multiple truly-concurrent threads, you'll be able to overlap that many concurrent requests without any increase in latency, and the slope of the curve past that point will be inversely proportional to the number of threads. Does this make sense?
---
After I wrote my previous comment I started looking in greater depth into why I'm taking 800ms to process a request. I'm really not doing that much. I'll post an update with what I find. Perhaps I haven't spend enough time attacking m before looking to expand q.
When you say that PLT is "single-threaded", is what you're referring to is that PLT only uses one CPU?
Well, yes, if you have 10 CPU's then you can handle 10 simultaneous requests in the same elapsed time as 1 CPU can handle 1 request. If you don't have any overhead in communication between CPU's, locks, etc.
I think it's clearer to talk about having one CPU or multiple CPU's instead of "truly-concurrent threads" or "thread contention".
You can have multiple threads interleaving on a CPU, and in some systems that support more than one CPU, a thread can run for a while on one CPU and then get moved to running on a different, less busy CPU.
Now if you said that that on one CPU 100 requests one at a time took a total of 300ms to process but 100 requests running interleaved on one CPU took a total of 800ms to process, then we would have evidence that there's too much overhead in thread context switching or some other problem.
I think the quality of the thread implementation has a role to play. When a process waits on IO the OS will find something else to do. I don't know if user-level threads can be that smart, and I don't know if PLT threads are that smart.
Without casting aspersions, it's not just about number of processors. Very few programs avoid IO altogether, and if there's IO some programs will be more clever than others in moving work around.
In a properly working thread system, all threads that aren't waiting for something should be given a fair slice of the available CPU time. In particular, no thread that is ready to run should be blocked or delayed because some other thread is waiting on I/O.
If you are wondering if PLT threads work correctly, isn't this rather easy to test? Do an A/B test with some threads that do some computation with and without some other threads waiting on I/O and see if the runtimes are the same.
The main reason is that I already have a shared host, so I figured I might as well run arc on it, instead of pay for separate vps. Especially if I can get better performance/reliability than running it locally.
Also, I figured that there might be some other people interested in doing the same thing.
By "manage", do you mean to download a page? Has curl been giving you trouble downloading https files?
For downloading http or https files or pages, both curl and wget work well, and I'm surprised to hear that curl would be giving you trouble. What exactly is the problem that you're seeing?
For looking at a downloaded HTML page and looking for links that match a particular pattern, I often find that regex's work well.
I've tried a dozen options in curl... but it couldn't get it working... curl: (35) error:140773F2:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert unexpected message
however wget worked for me!!! and I should be able to write code to follow the links from there.
thanks.
For your second question, we could work on getting your macro defining macro to work, if you wanted. For this particular task, personally I'd do something simpler: I'd make the link like this:
Not sure if I'm quite following you. I think you're saying that you're getting segfaults on your new server even though your sample regex test is working; but if that's true, how do you know that it's bug C, that the segfaults are coming from the regexs?
It's dying in the same phase as before. I emit prints before and after the regex-heavy phase. That's where it's dying. The error is still "SIGSEGV fault on 0x10000084"[1]
I mean, clearly I'm not capturing some aspect of the bug since it's not showing up in the sample test. It seems simpler to assume there's one bug in PLT scheme rather than two, and that we just don't fully understand it.
[1] update: I suppose that's just some error handler
If you want to pursue this, here are some steps...
If your server dies when running live, you can get it to die yourself by feeding it the same input as it gets from your browser and/or your users.
Now you have a large and complex program that you can get to segfault by feeding it some large and complex set of data.
Then you can run the same program with the same data in the same version of MzScheme on your personal computer, and see if it segfaults there as well. (If not, perhaps it is an OS library problem).
If it also segfaults on your computer, that's good news, you're on your way to track it down.
You can ask on the PLT mailing list what debugging information they'd like to see to track down a segfault.
And, if you feel like it, you can try removing chunks of your program to find a smaller test case that triggers the segfault.
If you can get it to segfault with chunks of code that are non-proprietary to you, that you don't mind publishing, then you can even post a link to a zip or a tar containing the code and the input data and instructions on how to trigger the segfault. Then it would be possible for one of the PLT people to find their bug, even if you have to apologize that your test case is so giant because you couldn't find a smaller test case to trigger the bug.
Alternatively, if this regex code is something you could push off to do in another process (such as by shelling out to Perl to do the regex stuff for you), you could try that.
You know, those steps were approximately what I've been trying to follow :) I thought I had boiled it down to a test case, but now I'm betrayed by it. Betrayed! :)
I don't really have a problem posting the code, but it'll be a huge ball of mud since it includes arc as well. I'm not sure the PLT folks will want to mess with that to find a bug that only occurs sometimes and may well be arc's fault.
It doesn't seem as easy as replaying a request. I encounter the error during a periodic data import phase in my server. It's not being triggered by a specific request. And it doesn't happen everytime the program goes into that phase. Sometimes I see the segfault 6 times a day (the phase runs once an hour), and sometimes everything's peachy for 3 days.
Shelling out to perl is a good idea. I think I'll try using your hack for that.
If you do the exact same data import with the program in the same initial state, will it segfault again? (You may need to add code to record e.g. the contents of your data files prior to the import and what exactly is being imported).
Or, if you start in a known state and run fifty data imports in a row, can you get it to segfault?
Arc is not supposed to be able to segfault MzScheme. To MzScheme, Arc is just a big program written in Scheme. And MzScheme running Scheme programs is not supposed to segfault.
(If you aren't using the C foreign function interface, of course. A messed up pointer can cause unrelated code to blow up later. So the above paragraph is true if you are only using libraries which are either written in Scheme [and/or Arc, since Arc is written in Scheme] or part of the official MzScheme distribution).
It's true that the PLT folks certainly don't want to be debugging your Arc program, but they do want to be able to get MzScheme to work. And the way to enable them to do that is to give them a test case that they can run to see the segfault. If you can. Even if it is thousands of lines of code and megabytes of data. If you give them a shell script "diedie" that runs a MzScheme program (that runs Arc and loads your program and imports your data) that segfaults, then you make it possible for them to find out what is causing the segfault.
"If you aren't using the C foreign function interface, of course."
Yeah that phase actually performs stemming as well using the FFI. For a long time that was my prime suspect, but I have never been able to get stemming to segfault in isolation, either within arc or from a C program. And I have been able to get my program to segfault without the stemming.
You're right, I'll take another stab at a test case. And I'll not try so hard to make it tiny.
You know, that was one reason I was inclined to move to v372 - version 4 is a huge change. If PLT seems less mature now because of these errors I wanted to treat pre-4.0 and post-4.0 as separate beasts.
But I gave up on porting json.ss to v372 (it's prob something really simple; I'll mail you a file just FYI) and now I'm going to go fix non-stability issues on readwarp for a couple of days..