Sunday, July 12, 2009

Charset, continued

I wrote about a concurrency issue in an earlier post, and while everything in that post is true, there's more to that story.

While this form

new String(byte[] data, String charsetName)

does perform better than the overload without the character set, the above invocation is not concurrent, in all circumstances. It turns out that the resolution of a character set name to a character set object, is also a synchronized call (JVM-wide lock). You can read all about it in Sun's humbly-named FastCharsetProvider class.

Basically, their implementation is fine, as long as a particular thread never deals with more than 2 character sets (see the ThreadLocal cache). But, it should be noted that you count character sets by their aliases, not unique character sets. For example, there are lots of ways to specify the US-ASCII character set, and if you use more than 2 of them, you defeat the cache. And, it's more than aliases that are a problem, lots of textual data (eg, HTML) use the distinct encodings of UTF-8, Cp1252 and ISO-8859-1 almost interchangeably.

In JDK 1.6, many of these charset methods now take a Charset object, instead of just the name, so that avoids this hurdle. But, some of us are still in JDK 1.5 land and may be for some time. In that case, there's a concurrent approach to dealing with multiple character sets.


Java NIO

Yes, in Matrix-like fashion, NIO saves the day. Now, the problem with Java NIO is that it's "Not I/O", meaning you're going to need to change your code to use these new APIs. And, even if you are able/willing to do that in your code, that says nothing about all of the 3rd party software your system likely uses.

But, let's assume we have control over our little universe... how is Java NIO more concurrent? Because it lets us do things using Charset objects (like JDK 1.6), without requiring us to use JDK 1.6.

For example, converting from a byte[] to a String using a character set in JDK 1.6 looks like this:

private static Charset UTF8 = Charset.forName("UTF-8");

public String treatAsUTF8(byte[] data) {
return new String(bytes, UTF8);
}

However, there is a way to get there with Java NIO alone:

public String treatAsUTF8(byte[] data ) {
return UTF8.decode(ByteBuffer.wrap(data)).toString();
}

What was that about? (and who let this guy chain that many invocations on one line?!)
This approach may look painful... creating a ByteBuffer out of the byte[], decoding it into a CharBuffer, and then toString()ing the CharBuffer to get our result. And, it is, in a lot of ways. But, the above invocation is fully concurrent! So, if you're dealing with multiple charsets and multiple threads, huge gains can be had.

But, the above code isn't very multi-charset friendly, so let's beef it up a bit:

private static final ConcurrentMap<String,Charset> charsetsByAlias = 
new ConcurrentHashMap<String,Charset>();

public static String decode(byte[] data, String charsetName) {
Charset cs = lookup(charsetName);
return cs.decode(ByteBuffer.wrap(data)).toString();
}

private static Charset lookup(String alias) {
Charset cs = charsetsByAlias.get(alias);
if (cs == null) {
cs = Charset.forName(alias);
charsetsByAlias.putIfAbsent(alias, cs);
}
return cs;
}

So, what does this decode() routine buy us over new String()? When I run tests with more than 2 character sets, and a dozen or more threads in a JDK 1.5 environment, the difference is about an order of magnitude. Yes, this isn't some 30% gain sort of thing, it's a total game-changer if your app is in this situation (say, a proxy server on the internet).

Parting comments

The above discussion and implementation is limited to the "new String" scenario, but that's just for brevity and illustration. The same problems can manifest when calling String.getBytes, or creating an InputStreamReader or OutputStreamWriter and many others, and the same approach works to resolve them.

If you're wondering why the cached implementation does not pre-loaded the map with all charsets and their aliases (see Charset.availableCharsets()), it's because in my testing, it just doesn't matter. That is, the gain is the subsequent requests, not the initial one. Furthermore, this lazy approach lets the cache be just for the aliases your app encounters, not the huge pile that is found in most JDKs.