I noticed previously that the Tomcat WebappClassLoader is heavily serialized. In fact, the entire loadClass entry point is marked synchronized, so for any poorly designed libraries, the impact of this on scalability is pretty remarkable. Of course, the ideal is not to hit the ClassLoader hundreds of times per second but sometimes that's out of your control.
I decided to play some more with JMH and run some trials to compare the impacts of various strategies to break the serialization.
I trialled four implementations:
1) GuavaCaching - a decorator on WebappCL which uses a Guava cache
2) ChmCaching - a decorator on WebappCL which uses a ConcurrentHashMap (no active eviction)
3) ChmWebappCL - a modified WebappCL using ConcurrentHashMap so that loadClass is only synchronized when it reaches up to parent loader, classes loaded through current loader are found in local map
4) Out of the box Tomcat 8.0.0-RC1 WebappClassLoader - synchronized fully around loadClass method
The results; in operations per microsecond, where an operation is a lookup of java.util.ArrayList and java.util.ArrayDeque.
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | |
GuavaCaching | 3687 | 978 | 1150 | 1129 | 1385 | 1497 | 1607 | 1679 | 1777 | 1733 | 1834 | |||||
ChmCaching | 27241 | 51062 | 81678 | 107376 | 134798 | 162125 | 188192 | 213007 | 208034 | 210812 | 200231 | 211744 | 214431 | 215283 | 209782 | 212297 |
ChmWebappCL | 185 | 48 | 81 | 81 | 83 | 81 | 84 | 85 | 85 | 85 | 80 | 84 | 82 | 83 | 83 | 84 |
WebappCL | 181 | 69 | 91 | 92 | 91 | 92 | 100 | 98 | 95 | 94 | 94 | 95 | 102 | 102 | 95 | 98 |
And the explanation -
- GuavaCaching seems remarkably slow compared to CHM. Might be worth investigating further. I also noticed significant issues with Guava implementation; some tests were running for extremely long time, seems there is an issue in the dequeue (quick look, it appears stalled in a while loop).
- ChmCaching seems very effective; although it is caching classes loaded from parent and system loader. This seems OK per the API but unusual, I will have to check the API in more detail. Scales linearly with cores (it is an 8 core machine).
- ChmWebappCL seemed to have less of an effect. This is likely because I am testing loading classes against core java.util.* rather than from a JAR added to the classloader. I expect ChmWebappCL can approach ChmCaching speed if I attach JARs directly to the class loader rather than passing through to system loader. (Going to system loader means entering the synchronized block).
- WebappCL - very slow performance.
And pretty pictures. You can see that CHM caching is far and away the best of this bunch.
.
Same picture, at log scale -
In your bug post you say you think this slowness was an artifact of YourKit - the JMH test case seems to clearly show a major performance difference.
ReplyDeleteI'm seeing a lot of concurrency backed up in the new Java Mission Control app tied in with the classloader (and I found your blog post while searching about it) - I was about to throw in a caching classloader based on one of yours and see if it fixes it...
Hi Ryan - totally agree with you. TC classloader is definitely a source of contention, but I was unable to come up with a real-world scenario where it affected throughput.
ReplyDeleteIf you have enough of a test bench you can probably re-run the tests and see how it goes, I'd be keen to hear about it. I assume you are looking at Tomcat?
What I found is that when I used the default Tomcat classloader in a real world scenario (about 2000req/sec incl business logic, db query, output transform, etc), changing the classloader did not significantly affect the req/sec rate. It only changes it when running under the profiler or under contrived examples - eg calling loadClass in a tight loop. I don't have the test bench to drive enough load onto the server - I was able to saturate the NIC before running out of CPU.
I'll play with it more - I currently have a test bench that is capable of generating that load (2 machines with 10GB NICs, fast CPUs etc) - I'll play with it a bit in an A/B scenario and see if there is an improvement.
ReplyDeleteThe app has plenty of other bottlenecks in it so this one might be small in comparison - but JMC seems to indicate that the contention at the classloader could be one of those bottlenecks (it's not clear if the time that JMC reports something was blocked waiting for something are 'real times' or 'profiler times' - I'm still getting familiar with the tool).
I have one tweak for the guava-based one - there is a "concurrencyLevel" param you can set on the cacheBuilder - I have a fork where I am trying to set it to 16 to see if it improves things in the benchmark.
What are you passing to JMH in your tests? (I'm a JMH noob - I'm guessing "java -jar target/microbenchmarks.jar ".*" -r 10 -t 8 -tc" might do the trick?