Over the years, I have found that there is a general interest about
making one's computer to speak Russian. This desire comes partially
from Russian non-natives who happened to be interested in our culture,
so he wants to read Russian documents, exchange e-mails with Russians.
Partially, this interest comes from Russians whose fate brought them
outside, and, sitting before a computer far away from Russia, they
would like to have Russian support.
One of the sections that
Dazhdbog's Grandchildren
supports is this section where I am trying to explain what is involved
in this procedure, "russification", as we, the Russians, call it.
Depending on what you have on desk, it might be very simple, or it might
turn out to be a nightmare. In any case, my goal is to make your
computer as Russian friendly as possible: you need a normal access
to Russian networks, newsgroups, and documents.
Some information here is of a general nature. It is relevant to any
language that uses 8bit (see below) encoding schemes: a lot of other
European languages do, although there are some striking differences
because we use a different alphabet.
Historically, it has turned out that a huge chunk of the Internet
is very much English biased, although situation changes quite
rapidly mainly because of the Web. However, it is still very far
from what one would like to be. I personally believe that the Web
and most of the software people use should be as much language
symmetrical as possible. Many people all around the World
have contributed to bringing national tastes to the Internet.
There is, however, one generic rule independent on what language we
are talking about. This is how a computer recognizes characters.
Characters are coded according to some scheme, and each character
corresponds to some number. One of such schemes is called
ASCII or American Standard Code for Information Interchange.
It has only 128 characters in it: some control characters, special
characters like ".", ";" or "!" and letters. You need only 7 bits
to code all of them: 2 to the 7th power gives you 128. If you want to
code something else -- national characters, for example -- you have to
tell your computer that it has to operate in 8 bit mode. In this mode
it's possible to compose a table of 256 characters. Usually, ASCII
characters sit from 0 to 127, all others are located from 128 up to 255.
Cyrillic alphabet (Russian version) has only 33 letters, so having
256-character grid (8 bits) is enough.
If you wish to code some Eastern characters you obviously need
more numbers; this translates to more bits in the numeric grid of your
computer. Japanese characters, for example, require 16 bits (or 2 bytes)
to be coded.
One of the international projects that will lead to some uniformity is a
Unicode project. This will
take a long way to achieve, although some software now is already
Unicode aware.
When you are interacting with the computer you are using many programs
that ease your life, and if you wish them to speak Russian you
ALWAYS need several things:
Make sure your software is 8bit aware
To have appropriate fonts installed
Have a keyboard layout, so you can type
These three points are the most general way to formulate your goal.
Achieving them, as I mentioned before, could be a matter of minutes,
or it might even require a substantial hacking. If you are working in
the environment where there is some sort of system administration, you
might even run into problems of another nature: sysadm's close-mindedness,
laziness, and unwillness to help you. Believe me, in my experience I
met such people, and sometimes it might be very hard to break them:
you depend on it. My advice is to handle it with care, trying to explain
these people what you need -- you will not prove anything yelling at
him: he simple does not understand what you want. Sometimes, system
administrator will give you a wonderful lecture about security problems
and violations that will take much more time then actually making 8bit
thing working. You, however, must realize, that managing many stations
might be a tedious job, so... bare it with him.
Encoding issues
Now, once we are done with the introductions... Let us now discuss
Russian aspects of internalization. The first surprise comes from the
fact that, purely because of historical reasons, we have several
coding tables to code our alphabet. Why? See... when mid-size machines
came (those PDP-like boxes), we invented a table called KOI-8. It has
a wonderful property (which was important at that time): Russian
letters are laid out in such a way that
if you strip 8th bit out, you will still be able to read Russian words:
Latin letters appears phonetically. The disadvantage of this
layout is, of course, that letters are not in the alphabetical order
(adding 1 to 'A' you might not get 'B'). It is sometimes
inconvenient for programming.
Then, microcomputers came. The hardware was already more advanced (no
need to phonetic correspondence after stripping 8th bit), it operated
under MS-DOS, and so called "Alternative" coding page emerged. It served
its purpose all right.
Apart from this there are two more schemes. Both of them were not designed
in Russia. The ISO-8859-5 standard is a part of the attempt to bring
some order to the international diversity. But... When it was created,
none asked Russians what they think about it, so the fact is that it is
almost never used. Another monster is CP1251 created by Bill Gates and his
gang for his precious Windows (I hate it and thing it is the worst a man
can create). However, its fate is more successful that ISO's because of
so many Windows boxes around. In fact, CP1251 is accepted as a standard
for Russian in the Bill Gates kingdom.
So, what we have at the end is the following. Unix world uses KOI-8,
dyeing MS-DOS world has its "Alternative" scheme, and, as I said,
Bill Gates is loyal to his CP1251. Macs use their own encoding but very
frequently you'll find KOI-8 there.
You also have to take into account that KOI-8 is a de facto
standard for Russian news and e-mails. Jerks who use something else
there have to be beaten. Much more information on this net standard
that we call KOI8-R can be found on
Andrei Chernov's page.
KOI-8 is also officially registered as a valid MIME charset
(RFC-1700). Notice
that KOI8-R is a Russian table only. There is
also a beast called
KOI8-U for Ukrainian language
(there are slight differences).
If you have a file in one encoding and want to recode it, there are
zillions utilities that would do it for you. Under Unix I use a small
C-program called
recode.
I don't remember where I got it. It can easily be compiled anywhere.
Because DOS or Windows users usually do not have compilers (and I do
not work in this environment), you have to browse somewhere to find
the tool for you.
Kiarchive
would be a good place.
Finally, we got there... I think, all the preliminary words that have to
be said are said. Now we go to the examples.