Faster Java: Caching Classes from the ClassLoader?

Saturday, 31 August 2013

Caching Classes from the ClassLoader?

Code for the article is here. I also submitted an enhancement patch to Tomcat bugtracker.

I noticed previously that the Tomcat WebappClassLoader is heavily serialized. In fact, the entire loadClass entry point is marked synchronized, so for any poorly designed libraries, the impact of this on scalability is pretty remarkable. Of course, the ideal is not to hit the ClassLoader hundreds of times per second but sometimes that's out of your control.

I decided to play some more with JMH and run some trials to compare the impacts of various strategies to break the serialization.

I trialled four implementations:
1) GuavaCaching - a decorator on WebappCL which uses a Guava cache
2) ChmCaching - a decorator on WebappCL which uses a ConcurrentHashMap (no active eviction)
3) ChmWebappCL - a modified WebappCL using ConcurrentHashMap so that loadClass is only synchronized when it reaches up to parent loader, classes loaded through current loader are found in local map
4) Out of the box Tomcat 8.0.0-RC1 WebappClassLoader - synchronized fully around loadClass method

The results; in operations per microsecond, where an operation is a lookup of java.util.ArrayList and java.util.ArrayDeque.

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
GuavaCaching	3687	978	1150	1129	1385	1497	1607	1679	1777	1733	1834
ChmCaching	27241	51062	81678	107376	134798	162125	188192	213007	208034	210812	200231	211744	214431	215283	209782	212297
ChmWebappCL	185	48	81	81	83	81	84	85	85	85	80	84	82	83	83	84
WebappCL	181	69	91	92	91	92	100	98	95	94	94	95	102	102	95	98

And the explanation -

GuavaCaching seems remarkably slow compared to CHM. Might be worth investigating further. I also noticed significant issues with Guava implementation; some tests were running for extremely long time, seems there is an issue in the dequeue (quick look, it appears stalled in a while loop).
ChmCaching seems very effective; although it is caching classes loaded from parent and system loader. This seems OK per the API but unusual, I will have to check the API in more detail. Scales linearly with cores (it is an 8 core machine).
ChmWebappCL seemed to have less of an effect. This is likely because I am testing loading classes against core java.util.* rather than from a JAR added to the classloader. I expect ChmWebappCL can approach ChmCaching speed if I attach JARs directly to the class loader rather than passing through to system loader. (Going to system loader means entering the synchronized block).
WebappCL - very slow performance.

And pretty pictures. You can see that CHM caching is far and away the best of this bunch.

Same picture, at log scale -

3 comments:

Ryan17 September 2013 at 11:00
In your bug post you say you think this slowness was an artifact of YourKit - the JMH test case seems to clearly show a major performance difference.

I'm seeing a lot of concurrency backed up in the new Java Mission Control app tied in with the classloader (and I found your blog post while searching about it) - I was about to throw in a caching classloader based on one of yours and see if it fixes it...
ReplyDelete
Replies
Unknown17 September 2013 at 16:46
Hi Ryan - totally agree with you. TC classloader is definitely a source of contention, but I was unable to come up with a real-world scenario where it affected throughput.

If you have enough of a test bench you can probably re-run the tests and see how it goes, I'd be keen to hear about it. I assume you are looking at Tomcat?

What I found is that when I used the default Tomcat classloader in a real world scenario (about 2000req/sec incl business logic, db query, output transform, etc), changing the classloader did not significantly affect the req/sec rate. It only changes it when running under the profiler or under contrived examples - eg calling loadClass in a tight loop. I don't have the test bench to drive enough load onto the server - I was able to saturate the NIC before running out of CPU.
ReplyDelete
Replies
Ryan18 September 2013 at 07:27
I'll play with it more - I currently have a test bench that is capable of generating that load (2 machines with 10GB NICs, fast CPUs etc) - I'll play with it a bit in an A/B scenario and see if there is an improvement.

The app has plenty of other bottlenecks in it so this one might be small in comparison - but JMC seems to indicate that the contention at the classloader could be one of those bottlenecks (it's not clear if the time that JMC reports something was blocked waiting for something are 'real times' or 'profiler times' - I'm still getting familiar with the tool).

I have one tweak for the guava-based one - there is a "concurrencyLevel" param you can set on the cacheBuilder - I have a fork where I am trying to set it to 16 to see if it improves things in the benchmark.

What are you passing to JMH in your tests? (I'm a JMH noob - I'm guessing "java -jar target/microbenchmarks.jar ".*" -r 10 -t 8 -tc" might do the trick?

ReplyDelete
Replies