Arc Forumnew | comments | leaders | submitlogin
2 points by aw 5386 days ago | link | parent

I don't understand your comment that caching isn't an option because it's picking one story at a time to render. You have a bunch of feeds containing a bunch of stories. Some other background thread / process / program / box can fetch stories, render them, and store the rendered story to disk. Now all your user interface has to do is choose which story to present to the user.

In most programs that are too slow it's a small percentage of the code which is causing the slowness. If you work on making all of your code faster, you're wasting 95% of your effort, because it's only 5% of the code which is causing a problem. So it's worthwhile to run timing tests and figure out what exactly is the slow part, and work on algorithms such as caching to make those parts fast enough.

If you can't think of any algorithms or methods to handle the parts that are too slow, then you can rewrite your program (either the whole program or just the slow parts) in a faster language. But this will only take you so far. If you rewrite your program in a language which is 20 times faster, now you can support 60 users instead of just 3, which is great, but how then do you scale beyond 60 users? At some point you need to figure out what your strategy is for handling lots of users, and rewriting in a faster language is only a temporary fix for that issue.



2 points by akkartik 5385 days ago | link

"If you rewrite your program in a language which is 20 times faster, now you can support 60 users instead of just 3, which is great, but how then do you scale beyond 60 users?"

We're talking about 60 concurrent users. If I could do 60 requests per second I'd be a happy man. 60 requests per second is 216k requests/hr. Assuming 30 page views per happy readwarp user, that's 7200 uniques per hour. Let's say daily uniques at that point is 6 times that, 40k uniques per day. With 40k uniques a day I'd be a happy man. More importantly, 40k uniques will take me n years to get to. So you see, I don't need to "scale beyond 60 users".

Right now readwarp isn't even at 3 requests per second because requests can take several seconds with contention. But that's not ideal either. And it's largely because of the language and the platform.

-----

1 point by akkartik 5386 days ago | link

What you describe in the first paragraph is not caching, it's precomputation. It may help, but it's a completely different performance calculation. The reason caching works for HN is that even if it gets 10k users in 5 minutes it has to generate a dozen pages once that they all see. When I have to manage 100x as many pages in the 'cache' I have to be a lot more careful not to waste time on stories that never get shown.

Precomputation doesn't actually save very much work at all. All it does is move latencies around, often by throwing hardware at the problem. And it can easily go off the rails and end up doing a lot of wasted work without helping latency.

Precomputation does help sometimes. In fact I use it in readwarp in a couple of places. But it's not as straightforward a cost-benefit analysis as caching would be. When caching works it's just brain-dead to reason about. With precomputation I have to constantly monitor whether I'm getting the bang for my buck. And I have to be conservative.

The bigger question is why arc behaves like it takes 3-5s to render one page. All I'm doing is selecting a story and rendering it.

Hmm, I probably haven't fully explored all these ideas. Every now and then my server just hangs for a few seconds and becomes completely unresponsive, and I haven't been able to figure out what it's doing at that point, to get a stack trace or something. For any timing instrumentation I insert, times often vary wildly, by an order of magnitude. I'm starting to suspect PLT's timings shoot through the roof because of contention between threads.

I should just try to offload all my state in a way that can handle multiple coexisting arc interpreters. I'll probably do that. Arc seems quite memory-heavy, though. I'm not sure how many interpreters can coexist on a machine.

You're right that I'll still have to scale in another language, but language features can make my life easier or harder when I have to scale to multiple machines. Having a global interpreter lock perhaps makes it hard to avoid thread contention, and to balance responsiveness with throughput. I know the python folks have had similar issues. OTOH, having just one request per interpreter makes arc far more inefficient in RAM than say mongrel.

-----

1 point by akkartik 5385 days ago | link

"The bigger question is why arc behaves like it takes 3-5s to render one page. All I'm doing is selecting a story and rendering it."

Actually there's 3 stages. Selecting a story (A), rendering it (B), and updating user data structures when the vote on it returns (C). Each user request is seeing the total time for C+A+B.

When I eliminate rendering (the server does all the computations, but doesn't send the story to the client) I still see timeouts with 3 concurrent requests.

---

Ah, I got an improvement. Remember those 'couple of places' I was doing precomputation? I was trying to batch up stage-A computations 5 at a time in a precompute thread when a user's running low on stories. I got rid of them, and the average latency for everyone went up from 1s to 3-4s even with just one concurrent request at a time. But I can now handle 6 concurrent requests without timing out.

When I eliminate stage C the number starts inching up to 10. At 10 median request time is 13s. But at least stuff is getting processed.

My lessons:

1. When the interpreter is single-threaded try to minimize use of threads. Contention gets expensive fast. I am now used to seeing processing latency double when #concurrent requests doubles.

2. When things are slow resist the temptation to precompute. It always makes hot spots harder to see in the future. It's a last resort. And it'll always cost you on a single-threaded platform. I'm now going to rerun my profiling.

3. When you're stuck come blather about your problems on the arc forum. Thanks guys.

-----

1 point by aw 5385 days ago | link

Contention gets expensive fast

Can you provide an example that shows the problem with thread contention?

-----

1 point by akkartik 5384 days ago | link

  (= tab* (table))

  (defop || req
    (repeat (int:arg req "iters")
      (= (tab*:rand) (obj 1 2 3 4))))

  (new-thread asv)
So I can vary the number of assignments each op does in the URL.

Experiment 1: just a hundred sequential requests.

  $ ab -n 100 "http://127.0.0.1:8080/?iters=10"
On my laptop 98% of requests complete in 3ms.

Experiment 2: let's inch up to 2 concurrent requests at a time.

  $ ab -n 100 -c 2 "http://127.0.0.1:8080/?iters=10"
Now 50% of requests require at least 4ms. For 98% it's 10-15ms.

Experiment 3:

  ab -n 100 -c 10 "http://127.0.0.1:8080/?iters=10"
Now 50% of requests require 20-40ms.

In general the 50% latency number that apacheBench returns seems to increase linearly with the number of concurrent requests. This rule of thumb has held in my app as well over several major changes. Right now readwarp requests take 800ms on localhost and 2-3s from the internet at large. Not much room for those latencies to grow linearly.

-----

1 point by aw 5384 days ago | link

I can't compare your first and last example because one says 98% and the other 50%.

If it were an apples to apples comparison would you agree that it wouldn't be surprising that if the server were handling 10x the number of simultaneous requests that each request would take 10x as long?

The only apples to apples comparison is going from one simultaneous request to two simultaneous requests which has the increase from 3ms to 10-15ms, which seems like about double what we'd expect if using threads had no overhead at all.

-----

1 point by akkartik 5384 days ago | link

When I give just the 98% number I meant the 50% number stays the same (+/- 1ms). We're at the limits of ab's time resolution here.

Yes, none of this is surprising given that PLT is single-threaded. I'm trying to remember my class in queuing theory.. If each request takes m ms to process, n concurrent requests will take mn ms to process, (n-1)m ms of queuing delay and m ms of processing time. The average latency seen would be mn/2 which scales by n. As the number of concurrent requests grows, the curve of latency to number of concurrent requests would be linear.

If you now add a second queue, the curve stays flat until 2 requests and then moves linearly up. Each queue is only n/2 long, so average queuing delay is (n-1)m/4 ms. With q queues, on average there's only (n-1)m/2q ms of queuing delay.

Summary: if you can have multiple truly-concurrent threads, you'll be able to overlap that many concurrent requests without any increase in latency, and the slope of the curve past that point will be inversely proportional to the number of threads. Does this make sense?

---

After I wrote my previous comment I started looking in greater depth into why I'm taking 800ms to process a request. I'm really not doing that much. I'll post an update with what I find. Perhaps I haven't spend enough time attacking m before looking to expand q.

-----

1 point by aw 5384 days ago | link

When you say that PLT is "single-threaded", is what you're referring to is that PLT only uses one CPU?

Well, yes, if you have 10 CPU's then you can handle 10 simultaneous requests in the same elapsed time as 1 CPU can handle 1 request. If you don't have any overhead in communication between CPU's, locks, etc.

I think it's clearer to talk about having one CPU or multiple CPU's instead of "truly-concurrent threads" or "thread contention".

You can have multiple threads interleaving on a CPU, and in some systems that support more than one CPU, a thread can run for a while on one CPU and then get moved to running on a different, less busy CPU.

Now if you said that that on one CPU 100 requests one at a time took a total of 300ms to process but 100 requests running interleaved on one CPU took a total of 800ms to process, then we would have evidence that there's too much overhead in thread context switching or some other problem.

-----

1 point by akkartik 5382 days ago | link

I think the quality of the thread implementation has a role to play. When a process waits on IO the OS will find something else to do. I don't know if user-level threads can be that smart, and I don't know if PLT threads are that smart.

Without casting aspersions, it's not just about number of processors. Very few programs avoid IO altogether, and if there's IO some programs will be more clever than others in moving work around.

-----

1 point by aw 5382 days ago | link

In a properly working thread system, all threads that aren't waiting for something should be given a fair slice of the available CPU time. In particular, no thread that is ready to run should be blocked or delayed because some other thread is waiting on I/O.

If you are wondering if PLT threads work correctly, isn't this rather easy to test? Do an A/B test with some threads that do some computation with and without some other threads waiting on I/O and see if the runtimes are the same.

-----