Unicode in Action
December 12, 2004 – 8:54 amFrom Pat Smith and Jeffrey Jia, who’ve been working on the Psiphon project:
One of the more challenging parts of getting Psiphon up and running
was the internationalization and localization of the interface. The
problem was Unicode. As we discovered,
Unicode (or rather, pretending it does not exist) is a big problem in
software. In order to ensure that Psiphon would work with “all”
languages, we needed to get a language such as Chinese or Persian
working. Both Chinese and Persian use the UTF-8 encoding, and they do
not use the ASCII character set we are used to dealing with. Chinese
and Persian were two ideal languages to try because Chinese requires
more bites to be displayed, and Persian is also a “right-to-left”
language.
The first big problem was that of the development tools that we were
using to develop Psiphon, not one was able to properly handle the
Unicode characters. During initial tests on the Persian translation
file we were unable to tell if the problem was with our data file, or
with wxPython and PythonCard. After four to five hours of working
along side Michelle Levesque, comparing how our different tools
handled the Unicode, we determined that it was our development tools
that were causing the problem.
Of Komodo, Cygwin, Windows Command Prompt, and IDLE, it appeared
(at first) that IDLE was the best at handling Unicode. Komodo could
probably display the data file when it was opened, but its Python
shell interpreter would print gibberish when it attempted to handle
the data. We determined it was a codec problem with Komodo’s Python
shell, by observing patterns in the gibberish that was returned and
using Google to see if the results were a common occurance when
dealing with Unicode. We had moderate success with IDLE, however,
after further tests with more Unicode, it seemed that IDLE also had
problems dealing with alot of Unicode. We were not able to find any
tool that was capable of handling all Unicode correctly, so we decided
to continue working on the internationalized interface without a
satisfactory tool.
We soon realized that the Unicode was not being displayed properly
because a non-Unicode build of wxPython was installed. After
uninstalling it and installing the correct Unicode build of wxPython,
the problem was still not fixed. wxPython includes a demo application
which includes a Unicode example, with the Unicode build installed,
this demo should run without a problem, this was not the case. After
many uninstalls/reinstalls, it was discovered that there was an old
partial install on the path that was part of an unsuccessful Unix
install under Cygwin. Removing that allowed the wxPython demo to run
smoothly, but Psiphon’s Persian translation was still causing the
program to crash.
Reading the error trace we were able to determine that the problem
was now within PythonCard’s handling of Unicode. The problem was
traced to PythonCard’s multicolumnlist module. A multi-column list is
used in Psiphon to track the connected users, and the headers of each
column require internationalization. By reading through the
multicolumnlist source code, and reading the Python 2.3 documentation
for Types, we found that PythonCard was using
StringType, which according to Python 2.3 does not
include Unicode strings. The PythonCard source would produce an error
if Unicode was given to a multi-column list. By changing all
occurances of StringType with StringTypes
(includes Unicode), we were able to avoid the multi-column list
error. However, it does require users to change their own version of
multicolumnlist.py if they choose to run Psiphon from the source. We
reported the bug to the PythonCard author Kevin Altis, and he posted
on the PythonCard mailing list that the bug will be fixed on the next
release.
Despite all this, we were still getting errors when Psiphon tried
to display Unicode. Since we were sure (well, pretty sure) that the
Unicode wxPython wasn’t the problem, we determined that the PythonCard
resource file was the problem. This lead to many hours of considering
the possibility of redoing the interface entirely in wxPython, which
would have resulted in many hours of work to migrate the existing GUI
away from PythonCard and into just wxPython.
Back to Kevin Altis, who informed us of a bug: Unicode was handled
in resource files in the same way it was handled in the multi-column
list component. Unfortunately, the suggested fix only worked if the
resource file was being written in Unicode, and not dynamically
creating a resource file while a program is running. Robin Dunn of
wxPython gave much appreciated help in examining the resource file
generating code, and with his help we were able to get Psiphon working
with Unicode languages. However, we were unable to get the
“right-to-left” working with Persian. Since we needed Psiphon to be
multi-platform we could not simply use the GTK wxPython build. GTK
itself is able to handle “right-to-left” easily, but wxPython does not
have its own related options.
The Unicode problem took a lot of hours to fix, but forced us to
understand Unicode, and why we should never write a program which does
not handle Unicode properly. We also gained experience in sorting out
problems in open source software. Tracking down a new bug in an
existing tool gave more insight into how the tool itself works. It
also taught us how to better debug code, and to realize that sometimes
the bug may not be in the code you are writing, but rather in the code
from a tool that you are using. Overall, the time invested in fixing
the Unicode problem was worth it for the knowledge that was gained.