Character encoding in Subversion

Character encodings are the organization of numeric codes that represent the characters of a character set in memory. Each character in a language is assigned a unique pattern of numeric codes. If encoding is not handled correctly, it can cause problems with Subversion functionality.

A locale is a string that names the encoding, the language, and possibly the country. For example a Korean locale setting can be ko_KR.UTF-8.

For full details of how to choose and set a locale, consult your system's documentation.

Some of the character encodings used in settings are:

If you name a file using the Korean characters and if your system had the EUC-KR character encoding, the Version Control system stores it in the repository and when another user whose system has the same EUC-KR character encoding checks the file out, the filename reads the same.

If the person who checks out the file has the EUC-JP encoding set in his system and if the project uses Subversion, then this will cause Subversion to report an error, and to refuse to do the check-out. This is because the Korean characters used in the filename simply are not part of the EUC-JP encoding, because EUC-JP only supports Japanese characters.

CollabNet uses the Unicode variant known as UTF-8 throughout the user interface. Unicode allows multiple languages to co-exist in the same file, filename, strings and so on. Subversion is capable of transcoding to and from UTF-8, from EUC-KR to UTF-8, and other possible combinations. However, if you have multiple users and machines, located in different places, then you are very likely to have problems because users will set locale to different values, which typically means a different encoding. If you have filenames in hyperlinks, this leads to garbled text in links that do not work.

Note: If you are a single user using a single machine, then you will have no encoding issues.

If the project uses Subversion, and the same check-in happens, then the file names are transcoded from the original encoding into UTF-8 by Subversion, (during the check in), and then they are presented as UTF-8 when CollabNet publishes the web site. Similarly, URLs inside the web pages are presented in UTF-8 by the browser. This could work, except for the one developer, who is using some other encoding other than UTF-8 . At his/her desk, the files have some non-UTF encoding. If the developer enters URLs in the web pages that match the files on his disk, there is nothing available in his system to successfully transcode these into UTF-8 (the browser will assume they already are UTF-8 , and will not transcode). So you will face a situation where either the concerned developer can't use the pages, or if the same developer encodes them such that he/she can, then no one else can use them.

Files and directories with non-ASCII names are part of the "remote published" document tree of a CollabNet project. Tthat is, if they are checked in anywhere below the www/directory, then CollabNet will publish them as UTF-8.

If a developer working in Japan named a file using Japanese characters then Subversion will store it in the repository by encoding it in UTF-8. Subsequently when another developer in the US checks out the same file, UTF-8 will decode it for him/her and display it in Japanese characters again.

Note: Transcoding by UTF-8 will not make any coherent sense to the American developer because UTF-8 can only encode and decode names from one language in one setting to the same language in another setting: Japanese > UTF-8 > Japanese. UTF does not translate. So the checked out filename will contain the same Japanese filename that was originally checked in.

Solutions and best practices

In general, you first create file and directory names on your own computer. When you commit the files using the "commit" process of the Version Control tools, or an "upload" process of some web page, most services will detect the encoding you have set, interpret the file and directory names by that encoding, and either translate immediately into UTF-8 , or ensure that this will eventually happen when the names are displayed in the other ways discussed below.

Once a file is stored within CollabNet, a second place where its name matters is in the various browsing features of CollabNet: the Documents and Files area, and similar places. All of these areas actually display their names using UTF-8. In order to support legacy data from before CollabNet Enterprise Edition 3.0.0, these areas will attempt to convert names from one of the traditional encodings into UTF-8 , if they can determine that this is needed. Unfortunately, there is no completely certain way to make that determination, as was discussed above. If your site has been migrated from an earlier version of CEE, where some traditional encoding was the default, then there is a very good chance that this determination will work out right, and the displayed names will look right. But the most certain way to get it right is, of course, not to leave the system to guess at all, but simply to use UTF-8 from the beginning.

If you check files out of Version Control onto your computer. As above, if Subversion is the Version Control tool in use, it will detect the encoding setting for you and transcode the names into that encoding, so that they look right. This means that project members using Subversion can actually cooperate successfully even if they have different encodings set (but see the next paragraph for a critical exception to this freedom).

Finally, many of the files you commit to your project may actually be web pages, and these web pages will have URLs in them that refer to other web pages — other files that you also commit into your project. This is the fourth area of interface. It is important to understand, at this point, that no Version Control system will transcode the characters within your files. Subversion fixes the encoding of the names of the files, but it does not change the encoding of the contents of the files. This means that the collection of bits that make up the URLs inside your files have to match the collection of bits that make up the file names on disk, in the area from which CollabNet serves up the pages.

As mentioned above, CollabNet serves up pages in UTF-8 , and so the URLs that refer to your web pages must be in UTF-8 as well. But when you build a web site, you ordinarily test it locally, on your own machine, before you publish it. That means that the collection of bits that make up the URLs in your pages must also match the collection of bits that makes up the file names on your own computer: you must be configured to use UTF-8. Similar things can happen with other kinds of files. For example, most programming languages have some sort of mechanism for logically "including" one file into another; these file references must also match bit-for-bit, and so the encoding inside the files must match the encoding of the file names. Even with Subversion, all project members who work with such files must use the same encoding setting, and it must be UTF-8.

Now, what if you have some need to work in a traditional encoding, instead of UTF-8 ? Most likely, this would be because you already have files named in the traditional encoding.

The easiest way to deal with this possibility is to stick to the 96 characters of the basic ASCII alphabet (English letters, digits, and a few punctuation marks). All of the encodings discussed here use the same collection of bits to represent all of these characters, and so it does not matter which encoding you have set, so long as you stick to these. Of course, this is severely limiting: you can only represent English in this way. Still, until the advent of Unicode, this was the only way to achieve multi-lingual systems.

If that is too restrictive (and it probably is), and yet use of UTF-8 is still not practical for you, you can achieve much of what you need by very carefully ensuring that all your users have the same encoding set. This may not be as difficult as it sounds: when you buy a computer configured for a particular country or language, that configuration includes some encoding suitable for that language, and probably one of the ones discussed here. If all your users use the same operating system, that may be enough.

If you mix operating systems (or even operating system versions), you may find some users get or produce garbled names and text; these users probably need to change their settings to use the project's encoding. The most important thing that cannot be solved this way is the problem of URLs, mentioned above. This is why it is standard practice, throughout the web, to use ASCII names for the files and directories of a web site, even when the site pages are all in some language that cannot be represented in ASCII. With ASCII file and directory names, and one of the traditional encodings discussed here, you can still provide ASCII URLs and local-language pages.

Top | Help index