Whacky latin1 to UTF8 conversion in JDBC

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters. This behaviour is different from what MySQL’s internal functions do.

Character encoding is a rabbit hole that i’ve been stuck in for the last week, and in interest of not generating 100 obvious answers i’ll demonstrate whats happening with a couple of code examples.

Mysql:

[[email protected] ~]$ echo 'SELECT CONVERT(UNHEX("81") using latin1);' | mysql --init-command='set names latin1' | tail -1| hexdump -C
00000000  81 0a                                             |..|
00000002
[[email protected] ~]$ echo 'SELECT CONVERT(UNHEX("81") using latin1);' | mysql --init-command='set names utf8' | tail -1| hexdump -C
00000000  c2 81 0a                                          |...|
00000003

This is pretty obvious and works exactly as expected. 0x81 is an undefined latin1 codepoint. It is represented as u0081 in UTF8 or c2 81 in hex “on disk”.

Now the weirdness comes from JDBC, Take this groovy example:

@GrabConfig(systemClassLoader=true)
@Grab(group='mysql', module='mysql-connector-java', version='5.1.6')
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:mysql://localhost/test', 'root', '', 'com.mysql.jdbc.Driver' )
sql.eachRow( 'SELECT CONVERT(UNHEX("C281") using utf8) as a;' ) { println "$it.a --" }

The output of this query is two bytes, c2 81 as expected. Its pretty easy to understand whats happening here. The Mysql Connection is defaulting to UTF8. The unhexxed column is also cast to UTF8 (without encoding, as the source is binary, the data after CONVERT() is still c2 81).

Now consider this case. The connection is still in UTF8, as is default with JDBC. we cast our 0x81 byte as latin1, so hopefully mysql will convert it to c2 81 like it did in the bash example above.

@GrabConfig(systemClassLoader=true)
@Grab(group='mysql', module='mysql-connector-java', version='5.1.6')
import groovy.sql.Sql
sql = Sql.newInstance( 'jdbc:mysql://localhost/test', 'root', '', 'com.mysql.jdbc.Driver' )
sql.eachRow( 'SELECT CONVERT(UNHEX("81") using latin1) as a;' ) { println "$it.a --" }

Running this with groovy latin1_test.groovy | hexdump -C yields this:

00000000  ef bf bd 0a                                       |....|
00000004

ef bf bd is a utf8 replacement char. A char used when a utf8 conversion has failed.

Answer

JDBC seems to insert a utf8 replacement character when asked to read from a latin1 column containing undefined latin1 codepage characters

Yes, this is the default behavior of CharsetDecoder instances which by default, when the (byte) input is malformed, will perform a substitution of this unmappable byte sequence with Unicode’s replacement character, U+FFFD.

Examples of methods which use this behavior are all Readers but also String constructors which take a byte array as an argument. And this is the reason why you should never use String to store binary data!

The only solution to make that an error is to grab the raw byte input, create your own decoder and tell it to fail in that situation…

Leave a Reply

Your email address will not be published. Required fields are marked *