Sunday, July 12, 2009

Charset, continued

I wrote about a concurrency issue in an earlier post, and while everything in that post is true, there's more to that story.

While this form

new String(byte[] data, String charsetName)

does perform better than the overload without the character set, the above invocation is not concurrent, in all circumstances. It turns out that the resolution of a character set name to a character set object, is also a synchronized call (JVM-wide lock). You can read all about it in Sun's humbly-named FastCharsetProvider class.

Basically, their implementation is fine, as long as a particular thread never deals with more than 2 character sets (see the ThreadLocal cache). But, it should be noted that you count character sets by their aliases, not unique character sets. For example, there are lots of ways to specify the US-ASCII character set, and if you use more than 2 of them, you defeat the cache. And, it's more than aliases that are a problem, lots of textual data (eg, HTML) use the distinct encodings of UTF-8, Cp1252 and ISO-8859-1 almost interchangeably.

In JDK 1.6, many of these charset methods now take a Charset object, instead of just the name, so that avoids this hurdle. But, some of us are still in JDK 1.5 land and may be for some time. In that case, there's a concurrent approach to dealing with multiple character sets.


Java NIO

Yes, in Matrix-like fashion, NIO saves the day. Now, the problem with Java NIO is that it's "Not I/O", meaning you're going to need to change your code to use these new APIs. And, even if you are able/willing to do that in your code, that says nothing about all of the 3rd party software your system likely uses.

But, let's assume we have control over our little universe... how is Java NIO more concurrent? Because it lets us do things using Charset objects (like JDK 1.6), without requiring us to use JDK 1.6.

For example, converting from a byte[] to a String using a character set in JDK 1.6 looks like this:

private static Charset UTF8 = Charset.forName("UTF-8");

public String treatAsUTF8(byte[] data) {
return new String(bytes, UTF8);
}

However, there is a way to get there with Java NIO alone:

public String treatAsUTF8(byte[] data ) {
return UTF8.decode(ByteBuffer.wrap(data)).toString();
}

What was that about? (and who let this guy chain that many invocations on one line?!)
This approach may look painful... creating a ByteBuffer out of the byte[], decoding it into a CharBuffer, and then toString()ing the CharBuffer to get our result. And, it is, in a lot of ways. But, the above invocation is fully concurrent! So, if you're dealing with multiple charsets and multiple threads, huge gains can be had.

But, the above code isn't very multi-charset friendly, so let's beef it up a bit:

private static final ConcurrentMap<String,Charset> charsetsByAlias = 
new ConcurrentHashMap<String,Charset>();

public static String decode(byte[] data, String charsetName) {
Charset cs = lookup(charsetName);
return cs.decode(ByteBuffer.wrap(data)).toString();
}

private static Charset lookup(String alias) {
Charset cs = charsetsByAlias.get(alias);
if (cs == null) {
cs = Charset.forName(alias);
charsetsByAlias.putIfAbsent(alias, cs);
}
return cs;
}

So, what does this decode() routine buy us over new String()? When I run tests with more than 2 character sets, and a dozen or more threads in a JDK 1.5 environment, the difference is about an order of magnitude. Yes, this isn't some 30% gain sort of thing, it's a total game-changer if your app is in this situation (say, a proxy server on the internet).

Parting comments

The above discussion and implementation is limited to the "new String" scenario, but that's just for brevity and illustration. The same problems can manifest when calling String.getBytes, or creating an InputStreamReader or OutputStreamWriter and many others, and the same approach works to resolve them.

If you're wondering why the cached implementation does not pre-loaded the map with all charsets and their aliases (see Charset.availableCharsets()), it's because in my testing, it just doesn't matter. That is, the gain is the subsequent requests, not the initial one. Furthermore, this lazy approach lets the cache be just for the aliases your app encounters, not the huge pile that is found in most JDKs.

Saturday, June 06, 2009

Default Charset Contention

It's no secret that if you want your multi-threaded application to run (and scale) well, the less shared data and resources the better.

Well, I ran smack into a shared piece of data in the JDK libraries that I would never have guessed was there. It turns out there is a significant difference between the following method implementations:

String convert(byte[] data) {
    return new String(data);
}

and
String convert(byte[] data) {
    return new String(data, "UTF-8");
}


The difference is that the second is more concurrent than the first one.

The reason is that in order for java.lang.String to get the default character set, it needs a lock on the Charset.class object. Since there's only one of those per JVM, it means that no 2 threads can get the default character set concurrently, and therefore, the first implementation can be a bottleneck.

To demonstrate just how bad the bottleneck can be, I wrote a tiny test program. It creates 50 threads and each thread creates 100,000 strings, and then dies.

On my laptop, the results for this contrived scenario are dramatic.

Not specifying the character set, the program takes about 35 seconds to complete.
When the character set is specified, the program takes about 4 seconds to complete.

The moral of the story is in concurrent environments, you can't make any assumptions about what data is being shared (and therefore, contended for). Instead, you should rely on thread dumps, profilers, and source code analysis.

Saturday, May 16, 2009

Overbuilding

I ran into a situation where I was creating a testing framework abstraction layer for my coworkers to use. The primary goal was to end up with something that was as easy to use as possible, especially given how complex the code to be tested was. To set the stage, jUnit was in place, but we needed mock objects to help with our large, legacy codebase, since it not only lacked unit tests, but also lacked classes with unit testing in mind (read: lots of coupling with no means for injection). An uphill road if ever there was one.

Mockito seemed like a good fit, since most of our classes were coupled to a wide number of concrete classes and there's very little in the way of interfaces anywhere. So, Mockito's ability to mock concrete classes (via CGLIB trickery) really saved my unit-test-writing-bacon. Requiring the team to introduce interfaces or injection everywhere would have made this effort DOA.

The trouble with mocking a complex class (many methods and many references) is that setting up all the stubs is not just tedious, it can be downright frightful. The astute reader has been chanting "time to refactor" several times by now, to which I would agree. But, refactoring means development time which means there are powers-that-be who must approve such an effort, and the battle to convince them of such benefits involve careful reasoning about long-term costs, technical debt, and things of this nature. So, let's invite Ward Cunningham to dinner and discuss ways to get that done some other time.

(Take note - I'm all about refactoring. So, if you're like me, you're already grabbing your monitor and screaming "refactor this crap!". But, this is a situation where refactoring is off the table, and if you've never encountered such a situation in commercial software development before, I have nothing but envy for you).

So, back to mocking a really complex class. The first/naive approach is to use the Mockito API to create our mock, and then use all of its slick JDK 1.5 methods to stub out values we're going to need. The problem is that we certainly don't want every unit test writer to have to do all this "setup code". Remember, we've got tons of methods. That means, there's probably redundancy (suppose there's one method returning a URL as a String, another as a URL, etc. -- not only does the writer have to mock both, he has to know and remember! to mock both). In short, if tests are painful to write, you can bet that tests won't get written (or worse, maintained).

The second approach is to use a Factory. This design pattern is probably in competition with Singleton for the most used. I say that, because sometimes this is OK, and sometimes this is just developers giving a knee-jerk response when the words "object creation" and "pattern" are used in the same sentence. The trouble with a Factory in this situation is that our unit test abstraction layer is meant to serve the needs of all unit test writers (present and future). With 50 methods and members on a class, there are an obscene number of object instance permutations to be handled. Factories aren't very good at obscene permutations. Factory won't cut down the unit test set up code very much.

This led me to my third approach -- the Builder pattern. Builders are a great fit here, since one of their strengths is providing the client code a great deal of control over object creation, without being involved in object construction. In other words, the ability to specify in great detail how an object gets built, while not seeing a single constructor, or any other part of that class' definition.

So, great, I'll have a Builder for every class that I need to mock. I'll let the unit test writer grab a builder, configure it, and then build their mock. Even better, they don't even need to know we're using Mockito. Shoot - they don't even know if we're using "real" objects, spies, mocks, dynamic proxies, subclasses, etc. All they know is that they asked for an instance of type X that has the specifications they provided. I'll be an even nicer guy and insure that the built objects have all kinds of very friendly default values, so most of the time, things "just work".

So, I was all happy and danced a jig. Clean, simple design - check. No coupling between unit tests and mocking library - check. Unit tests can get "set up" with the absolute minimum involvement of the test writer (only write as much as they deviate from the defaults). Time to write it up, and notify my coworkers. But wait, just to sort of "check myself" (I've got that tendency - it's some combination of personal paranoia and writing code for a while), I pull down a design pattern book off the shelf and look up the Builder pattern.

Oh, yuck.

What the heck is all this? Director, AbstractBuilder, ConcreteBuilder, the product, a Product interface, etc., etc. Part of me was saying to myself, "dude, you screwed this up", but the rest was saying, "what is all this stuff"? Where is Keep It Simple Stupid when you need him?

My takeaway (and the impetus of this post), is that sometimes even the seminal design patterns need a little trimming. Here's my justification with respect to the Builder pattern:

  1. Product. Obviously, need to build something. So, I keep this one.
  2. ConcreteBuilder. Similarly, need something to build it with. It stays.
  3. AbstractBuilder. Don't need it. I've got a ConcreteBuilder for every class I need, and those ConcreteBuilders know about their references, so the Products that get built are composites and life is good. Also, the codebase I'm dealing with has almost no inheritance, so the Builders map one-for-one to the types of Products I need. (Yes, I know, refactor!)
  4. Marker interface (for Product). Not appropriate. The users (unit test writers) do want to be coupled directly to the class being built. That's the point of the test.
  5. Director. In some sense, the unit test themselves are in this role. But, the point is there isn't a new class being introduced to support this.

The point is not that the Builder design pattern is all crazy-bloated, and the GoF are nutjobs. No, the point is that when building software, layers of abstraction and complexity should always be taken into account. While it's great to follow the Builder pattern to the letter (because you just say "Builder" and some other dev either knows what you mean, or goes and looks it up), it may be a fancier tool than you need. And, the worst thing you can do in software is exceed necessary complexity. I'll take maintaining software without a single pattern over software with excessive complexity any day.

I also don't truck with the notion of "if you don't have all those layers, then it won't be flexible enough down the road". My philosophy is I'd rather have just enough layers of abstraction to get through today. If I knew what tomorrow held, I'd own more stocks. Taken from a different angle, and borrowing someone else's phrasing (I couldn't pin down the original source):

There's no problem in Computer Science that can't be solved by adding another layer of abstraction to it. Except for the problem of too many layers of abstraction.