Notes on Russian: overview


Over the years, I have found that there is a general interest about making one's computer to speak Russian. This desire comes partially from Russian non-natives who happened to be interested in our culture, so he wants to read Russian documents, exchange e-mails with Russians. Partially, this interest comes from Russians whose fate brought them outside, and, sitting before a computer far away from Russia, they would like to have Russian support.
One of the sections that Dazhdbog's Grandchildren supports is this section where I am trying to explain what is involved in this procedure, "russification", as we, the Russians, call it. Depending on what you have on desk, it might be very simple, or it might turn out to be a nightmare. In any case, my goal is to make your computer as Russian friendly as possible: you need a normal access to Russian networks, newsgroups, and documents.
Some information here is of a general nature. It is relevant to any language that uses 8bit (see below) encoding schemes: a lot of other European languages do, although there are some striking differences because we use a different alphabet. Historically, it has turned out that a huge chunk of the Internet is very much English biased, although situation changes quite rapidly mainly because of the Web. However, it is still very far from what one would like to be. I personally believe that the Web and most of the software people use should be as much language symmetrical as possible. Many people all around the World have contributed to bringing national tastes to the Internet.
There is, however, one generic rule independent on what language we are talking about. This is how a computer recognizes characters. Characters are coded according to some scheme, and each character corresponds to some number. One of such schemes is called ASCII or American Standard Code for Information Interchange. It has only 128 characters in it: some control characters, special characters like ".", ";" or "!" and letters. You need only 7 bits to code all of them: 2 to the 7th power gives you 128. If you want to code something else -- national characters, for example -- you have to tell your computer that it has to operate in 8 bit mode. In this mode it's possible to compose a table of 256 characters. Usually, ASCII characters sit from 0 to 127, all others are located from 128 up to 255. Cyrillic alphabet (Russian version) has only 33 letters, so having 256-character grid (8 bits) is enough. If you wish to code some Eastern characters you obviously need more numbers; this translates to more bits in the numeric grid of your computer. Japanese characters, for example, require 16 bits (or 2 bytes) to be coded.
One of the international projects that will lead to some uniformity is a Unicode project. This will take a long way to achieve, although some software now is already Unicode aware.
When you are interacting with the computer you are using many programs that ease your life, and if you wish them to speak Russian you ALWAYS need several things:
These three points are the most general way to formulate your goal. Achieving them, as I mentioned before, could be a matter of minutes, or it might even require a substantial hacking. If you are working in the environment where there is some sort of system administration, you might even run into problems of another nature: sysadm's close-mindedness, laziness, and unwillness to help you. Believe me, in my experience I met such people, and sometimes it might be very hard to break them: you depend on it. My advice is to handle it with care, trying to explain these people what you need -- you will not prove anything yelling at him: he simple does not understand what you want. Sometimes, system administrator will give you a wonderful lecture about security problems and violations that will take much more time then actually making 8bit thing working. You, however, must realize, that managing many stations might be a tedious job, so... bare it with him.

Encoding issues

Now, once we are done with the introductions... Let us now discuss Russian aspects of internalization. The first surprise comes from the fact that, purely because of historical reasons, we have several coding tables to code our alphabet. Why? See... when mid-size machines came (those PDP-like boxes), we invented a table called KOI-8. It has a wonderful property (which was important at that time): Russian letters are laid out in such a way that if you strip 8th bit out, you will still be able to read Russian words: Latin letters appears phonetically. The disadvantage of this layout is, of course, that letters are not in the alphabetical order (adding 1 to 'A' you might not get 'B'). It is sometimes inconvenient for programming.
Then, microcomputers came. The hardware was already more advanced (no need to phonetic correspondence after stripping 8th bit), it operated under MS-DOS, and so called "Alternative" coding page emerged. It served its purpose all right.
Apart from this there are two more schemes. Both of them were not designed in Russia. The ISO-8859-5 standard is a part of the attempt to bring some order to the international diversity. But... When it was created, none asked Russians what they think about it, so the fact is that it is almost never used. Another monster is CP1251 created by Bill Gates and his gang for his precious Windows (I hate it and thing it is the worst a man can create). However, its fate is more successful that ISO's because of so many Windows boxes around. In fact, CP1251 is accepted as a standard for Russian in the Bill Gates kingdom.
So, what we have at the end is the following. Unix world uses KOI-8, dyeing MS-DOS world has its "Alternative" scheme, and, as I said, Bill Gates is loyal to his CP1251. Macs use their own encoding but very frequently you'll find KOI-8 there.
You also have to take into account that KOI-8 is a de facto standard for Russian news and e-mails. Jerks who use something else there have to be beaten. Much more information on this net standard that we call KOI8-R can be found on Andrei Chernov's page. KOI-8 is also officially registered as a valid MIME charset (RFC-1700). Notice that KOI8-R is a Russian table only. There is also a beast called KOI8-U for Ukrainian language (there are slight differences).
If you have a file in one encoding and want to recode it, there are zillions utilities that would do it for you. Under Unix I use a small C-program called recode. I don't remember where I got it. It can easily be compiled anywhere. Because DOS or Windows users usually do not have compilers (and I do not work in this environment), you have to browse somewhere to find the tool for you. Kiarchive would be a good place.
Finally, we got there... I think, all the preliminary words that have to be said are said. Now we go to the examples.

X Window System MacOS MS-Windows MS-DOS
Netscape Mosaic Internet Explorer USENET
Unix shells Emacs TeX and LaTeX
PostScript Electronic Mail Russian Spellers Dictionaries
HTTP Java


Please, if you find something incorrect or would like to contribute, drop me a line. I can be reached at serge@metalab.unc.edu.