Unicode is tricky in Java and might be impossible in C++

Here’s a challenge for you. Ready?

In Java and C++ on OS X, output to the console the following string:

I have €100 to my name.

You’ll be surprised how hard this is.

I’ll start with a control — something that works correctly. I created a text file with TextMate encoded with UTF-8:

blake-ramsdells-macbook-pro:~/Source/test/JavaUnicode blake$ cat ~/Documents/Unicode.txt
I have €100 to my name.
blake-ramsdells-macbook-pro:~/Source/test/JavaUnicode blake$ od -c ~/Documents/Unicode.txt
0000000    I       h   a   v   e     342 202 254   1   0   0       t   o
0000020        m   y       n   a   m   e   .  n
0000032

Note the perfect output when I run “cat” on it. The octal string 342 202 254 in the file is the UTF-8 encoding of the Euro character which is \u20ac. Rock.

So fine, I know Java pretty well, I’ll take a whack at it. This is the source code:

class JavaUnicode
{
public static void main(String args[])
{
System.out.println(“I have \u20ac100 to my name.”);
}
}

Easy enough — let’s see what it outputs:

blake-ramsdells-macbook-pro:~/Source/test/JavaUnicode blake$ java JavaUnicode
I have ?100 to my name.

Neat. Not exactly a Euro character. I’m a pretty smart Java guy, and I know that the general MO of Java is to take Unicode characters and re-encode them as required. So my guess is that there’s some Java concept of the charset of the console, and that’s not set correctly. It’s pretty clear that the OS X terminal understands UTF-8, so we just need to tell someone that the console is UTF-8.

I Googled for a bit, and found that file.encoding will alter the charset of the console output. I modified my test program to add the line:

System.out.println(“file.encoding=” +
System.getProperty(“file.encoding”));

Which indicated that the default encoding was “MacRoman”. Nice try. Let’s override it manually on the commandline:

blake-ramsdells-macbook-pro:~/Source/test/JavaUnicode blake$ java -Dfile.encoding=utf-8 JavaUnicode
file.encoding=utf-8
I have €100 to my name.

Ahhh. Very nice. All fixed. Now the question is “where the hell does Java use file.encoding besides here, and does that concern me?” It looks like the places where file.encoding are used in Java are places where you can omit a charset name for encoding / decoding. Which I’m not sure you should ever do. So I think this issue is fixed.

DANGEROUS OPINION #1: Java on OS X is stupid and uses MacRoman as the file.encoding. I have no idea what kind of prior practice they are trying to be compatible with, but I don’t have any text files or consoles that use MacRoman as the charset on OS X. It’s all UTF-8. Let’s get into the 90’s shall we and make the default file.encoding UTF-8?

Moving on to C++ now.

#include
 

int main(void)
{
std::wcout << L”I have \u20ac100 to my name.\n”;
return 0;
}

I’ll skip the voyage of discovery to determine how to make wide strings and the difference between cout and wcout. Let’s see what this outputs:

blake-ramsdells-macbook-pro:~/Source/test/CPPUnicode blake$ ./CPPUnicode
I have blake-ramsdells-macbook-pro:~/Source/test/CPPUnicode blake$

Say what? The output just dropped dead right at the Euro character. No questionmark, no nothing. Hm. Strange. Did it really just terminate the program? I tried changing it to the following:

std::wcout &lt;&lt; L“I have \u20ac100 to my name.\n;
std::wcout &lt;&lt; L“I lived!\n;

Run it again. Same output. You’ve got to be shitting me — the application *terminated* because of a character that wouldn’t encode? No way!

OK, let me try adding a try block around it to see if it threw an exception. So now I’ve got:

try
{
std::wcout &lt;&lt; L“I have \u20ac100 to my name.\n;
}
catch()
{
}
std::wcout &lt;&lt; L“I lived!\n;

Nope. Same output. So you mean to tell me that because of a character that wouldn’t encode, my application terminated without raising a signal or throwing an exception? Outstanding!

DANGEROUS OPINION #2: C++ terminating your application because a character destined for stdout couldn’t be encoded is the most pedantic, worthless behavior I’ve ever seen.

OK, so let’s quit whining and get it to work. I’m presuming the same problem with Java — the default encoding is something other than UTF-8, and it needs to be reset to be UTF-8.

I tried doing the following:

blake-ramsdells-macbook-pro:~/Source/test/CPPUnicode blake$ LANG=en_US.UTF-8 ./CPPUnicode

Nope.

I tried adding the following at the start of the code:

std::locale loc(“en_US.UTF-8″);
std::locale::global(loc);

That got me a freakout that at least made me feel better that exceptions were turned on:

blake-ramsdells-macbook-pro:~/Source/test/CPPUnicode blake$ ./CPPUnicode
terminate called after throwing an instance of 'std::runtime_error'
what():  locale::facet::_S_create_c_locale name not valid
Abort trap

So huh? Why don’t I have en_US.UTF-8?

I tried checking my locale:

blake-ramsdells-macbook-pro:~/Source/test/CPPUnicode blake$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

Man, that’s just crap. But OK, maybe I’m on the right track with this locale thing.

I dig a little more and I find /usr/share/locale and note that it has the following:

blake-ramsdells-macbook-pro:~/Source/test/CPPUnicode blake$ ls -alR /usr/share/locale/en_US.UTF-8
total 40
drwxr-xr-x     8 root  wheel   272 Jan 13 14:06 .
drwxr-xr-x   236 root  wheel  8024 Feb  3 20:10 ..
lrwxr-xr-x     1 root  wheel    28 Feb  3 20:09 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x     1 root  wheel    17 Feb  3 20:09 LC_CTYPE -> ../UTF-8/LC_CTYPE
drwxr-xr-x     3 root  wheel   102 Jan 13 14:06 LC_MESSAGES
lrwxr-xr-x     1 root  wheel    30 Feb  3 20:09 LC_MONETARY -> ../en_US.ISO8859-1/LC_MONETARY
lrwxr-xr-x     1 root  wheel    29 Feb  3 20:09 LC_NUMERIC -> ../en_US.ISO8859-1/LC_NUMERIC
lrwxr-xr-x     1 root  wheel    26 Feb  3 20:09 LC_TIME -> ../en_US.ISO8859-1/LC_TIME

/usr/share/locale/en_US.UTF-8/LC_MESSAGES:
total 8
drwxr-xr-x   3 root  wheel  102 Jan 13 14:06 .
drwxr-xr-x   8 root  wheel  272 Jan 13 14:06 ..
lrwxr-xr-x   1 root  wheel   45 Feb  3 20:09 LC_MESSAGES -> ../../en_US.ISO8859-1/LC_MESSAGES/LC_MESSAGES

So I think that en_US.UTF-8 is a splendid thing. But how the hell do I tell C++ about it? That std::locale fiasco didn’t encourage me.

I found a blog entry where some guy got “less” to work with UTF-8, so I figured I’d look at that some more. He just set LC_CTYPE to en_US.UTF-8. I tried this with my program, and it didn’t work, but with “less” it worked great, so I at least found one useful tip in this whole process.

But I’m still miffed at why mine won’t work. I try something more 70’s — good ol’ printf:

printf(“%ls”, L“From printf, I have \u20ac100 to my name.\n);

Yeah, right. With LC_CTYPE set to en_US.UTF-8 this didn’t output anything. I even tried:

printf(“From wprintf, I have %lc100 to my name.\n, L\u20ac’);

Nope. Not even if I set LANG also. I changed my code to:

printf(“From wprintf, I have %lc100 to my name.\n, L\u20ac’);
printf(“At least I lived.\n);

And got:

blake-ramsdells-macbook-pro:~/Source/test/CPPUnicode blake$ ./CPPUnicode
At least I lived.

So at least the process didn’t terminate.

So I’m a complete failure at console output with Unicode right now in C++. My locale info set as follows:

blake-ramsdells-macbook-pro:~/Source/test/CPPUnicode blake$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"

Has the following behaviors:

  • Terminates the process if I execute
    std::wcout &lt;&lt; L“I have \u20ac100 to my name.\n;
  • Outputs nothing (but at least continues to run) if I execute
    printf(“%ls”, L“From printf, I have \u20ac100 to my name.\n);

DANGEROUS OPINION #3: Man, the state of the art for Unicode in g++ under OS X sucks pretty hard.

Comments are closed.