Friday, January 4, 2013

Android: running native code on multiple CPU cores

Recently I've been trying to speed up some C++ code running on my dual-core Android device. The problem was that two threads that I used were not necessarily running on two cores in parallel (some insight here). I'm going to briefly describe my situation and how I managed to fix the problem.

My device is Ainol Aurora II tablet. I don't know if methods described here will work with yours. It probably depends on CPU model and/or OS version. Maybe you won't even have such problems in the first place.

Problem

Application is realtime and needs high, consistent framerate, so I need predictable threads behaviour. If for 90% of the time app runs faster thanks to multithreading, but for remaining 10% entire computation is squeezed into one core and framerate drops by one third, then this kind of multithreading is useless for me.

One thread is the default thread that is called through JNI from onDrawFrame method. Other threads (in case of my dual-core device only one, but possibly more) are spawned as worker threads when application is brought to foreground. Such thread spends part of it's time waiting on a condition variable (I'll write CV from now on) for signals from main thread. Signal tells it that it should start working. When finished, it signals back through another CV that work is completed and starts waiting for another piece of work.

When work pieces are reasonably big (taking 5-10 milliseconds), two threads seem to run on separate cores for most of the time. Not ALL time (every few seconds there is an increase in frame processing time indicating that one core is idle and another is overloaded), but it might be acceptable.

In general, it seems that when thread spends significant (~50%) part of it's time on a CV and doesn't do huge chunks of work, but rather multiple small chunks separated by waiting on a CV, scheduler doesn't bother assigning it to separate core for all (or at least most of) the time.

Unfortunately, work pieces that I have happen to be small - more like half of a millisecond. The most important part of processing is iterative algorithm and after each iteration all threads need to be synchronized (wait for each other). When they wait using CVs, problem described above happens.

Solution 1 - busy waiting

First solution that seems to work most of the time is very simple. Instead of waiting on condition variables between subsequent chunks of work, I made threads do the following:

while(true)
{
 pthread_mutex_lock(&t.MutexSignal);
 if(t.WorkToDo)
  break;
 pthread_mutex_unlock(&t.MutexSignal);
}

(I'm not sure locks are necessary here, but if I wanted to avoid them I would probably need to dive deep into CPU memory models to make sure everything is ordered properly.)
I think it's usually advised to use CVs for this kind of stuff, but well... now it works. I'm very lame in concurrency and so rather hesitant to draw conclusions, but it seems like in this case thread is perceived by the scheduler as working all the time (opposite to situation when thread waits on CV for another piece of work), so it considers it appropriate to assign it separate core.

Solution 2 - thread affinity

OK, but busy-waiting is not always that good I suppose. Additionaly, this solution may not be very general. Another solution is to assign threads to separate cores manually. Given that apparently sched_setaffinity() is still not working correctly on Android, I successfully used code given in this stackoverflow topic to set thread affinities.

Seems like fast and easy solution, but there is a tricky part here. Initially, I only called this code when threads were spawned. It did not work out very well and I was close to dumping this idea, but after closer examination I noticed that it was slightly better than before (second core was used a little more often).
I then modified affinity setting code to run EVERY FRAME. Apparently it's important, because from then on both cores started to be used consistently all the time.

Update

I'm working on a new application and it turns out setting thread affinity doesn't always work. I got my hands on Nexus 7 with quad-core CPU. Two of the four cores are powered off most of the time and the syscall that is supposed to set the affinity returns with an error claiming that there are only two cores available! (Well it doesn't say exactly that, but you can infer that from the error code and the circumstances.) I have a procedure that takes ~17ms (so most of the frame time) and even though the execution is split between four threads with their affinities updated every frame, only two cores are actually used (I test it with Usemon). This still gives a nice speedup: ~9ms.

Using busy waiting as described above powers up all of the cores and the execution time drops down to ~5ms. Not the perfect solution, since all cores' usage is now 100%.

I am going to multithread a couple more functions and see if running more code in four threads will convince the scheduler to give the app more power. My last app uses the affinity setting solution and it runs on four Nexus cores very well, probably because things are executed on multiple threads basically all the time (which also results in almost 100% CPU usage, but at least it's time well spent :) ).

Otherwise, I might just make the app user choose in runtime whether he wants to use all available cores or save the battery.