Lekha Project

Ramana Juvvadi (juvvadi@allegra.att.com)
Mon, 30 Oct 95 10:13:30 EST

With all the talk of Telugu Virtual library let me release here a
README of the Lekha project. Since completing all the aims of the project
is going to take a long time I am planning to release is even incomplete
versions like Lekha-0.1, Lekha-0.2 etc. Once it is open for public
I am hoping some enthusiasts will add their own improvements.

I am planning to release Lekha-0.1 in the near future. It will not be a great
improvement over RIT-3.0 in functionality. Specifically I am planning to add

latex2e support for potana
settle the issue of ISCII

No Suresh I wasn't successful in getting any useful documentation on ISCII.
If you can get it through please do so.

******************** Lekha ***************** Lekha ************************

The emergence of free software in the computer industry turns the
convetional upside down that one gets what one pays for. A chief
example of this is Linux, a unix operating system which runs on
intel platform.

Despite the fact that there is sizable population of Indian programmers
very little free software is available for Indian languages. Mostly,
the available free software consists of some postscript and TeX fonts
and converting ascii text to TeX and postscript.

In spite of the fact that there is nearly a one-to-one mapping between
alphabets of diffrent Indian languages there has been very little
co-ordination between different programmers. With the exception of
itrans, all the softawares are designed to work for only a single
language.

The chief aim of Lekha project is to co-ordinate and document
the work of different programmers belonging to different languages
and enhance the portability of their work to different languages.

ISC representation
---------------------

I propose a representation for Indian languages called ISC (Indian
Standard Character Set). ISC is not meant to be easily readable by
human beings. Instead it to designed to be easily readable by computer
programs. The basic scheme for representation is this.

0-127 ASCII characters
128-255 Indian characters

Since the representation of ASCII characters does not conflict with
Indian chracters it is possible to mix Indian characters and ASCII
characters freely in one file. As an example consider the
representation of 'moorkha' 228 176 230 205 171

m 228 code for 'm'
U 176 code for 'U'
r 230 code for 'r'
kh 205 code for 'kh'
a 171 code for 'a'

One immediate advantage of isc representation is that it is easy to
write converter programs from one format to other. For example, there
are two popular schemes for transliteration in Telugu: 'rts' scheme
used by RIT software and Rachana scheme used by Rachana a commercial
software for type setting Telugu. There are several transliteration
schemes popular like itrans and others on the net.

ISC files normally have a file extension of .isc. RIT files
normally have extension of .rts and itrans files have an extension of
..itx. Simliarly, let us assume that Rachana files normally have a file
extension of .rcn.

I have already written the following two programs

rts2isc --> converts a .rts file to .isc file
isc2rts --> converts a .isc file to .rts file

Suppose I want to write a program which converts a Rachana file to
RIT file and vice-versa. One way is to write programs rcn2rts and
rts2rcn. Another way is to write rcn2isc and isc2rcn. Let us say now
you have a file called kavita.rcn. You want to convert it to
kavita.rts. You would convert kavita.rcn to kavita.isc with rcn2isc
and then convert kavita.isc to kavita.rts with isc2rts. At this stage
let us say you want to add support for itrans. Simple, write two
programs itx2isc an isc2itx.

Mathematically oriented people might like to put it as "Instead of
writing n(n-1) programs for converting between n transliteration
schemes, you would just write 2n programs". This is correct but there
is something more to be said also. Because of the simplicity of isc
representation it is simpler to write two programs 'rcn2isc' and 'isc2rts'
instead of a direct 'rcn2rts'.

LaTeX support for isc
-----------------------

It is very easy to use isc very effectively with LaTeX. I have
written a program called 'isc2tex' which converts indian character
syllables to LaTeX macros. For example, when isc2tex reads a string
232 173 221 229 171 (stands for v i d y a) it would output a string

\cvow{v}{i}\ccvow{d}{y}{a}

All the ascii characters in 0-127 are output without any change.
Totally there are four macros:

\vow vowel and no consonants
\cvow consonant+vowel
\ccvow consonant+consonant+vowel
\cccvow consonant+consonant+consonant+vowel

I am assuming that none of the Indian languages contain ligatures
with more than three consonants.

As an example consider a kavita.rts. Youconvert it to a tex file
in the following manner:

kavita.rts ---> kavita.isc ---> kavita.tex

------ kavita.rts ---------------------
\documentstyle[telugu]{article}

\begin{document}

This is a test. #kavitA! O kavitA!#
\end{document}
------------- kavita.rts ---------------

------ kavita.isc ---------------------
\documentstyle[telugu]{article}

\begin{document}

This is a test. \204 \171 \232 \173 \219 \172 ! \185 \204 \171 \232 \173 \219 \172
\end{document}
------------- kavita.isc ---------------

------ kavita.tex ---------------------
\documentstyle[telugu]{article}

\begin{document}

This is a test. \ind \cvow{k}{a}\cvow{v}{i}\cvow{t}{A}\eng ! \ind \vow{O} \cvow{k}{a}\cvow{v}{i}\cvow{t}{A}\eng !
\end{document}
------------- kavita.tex ---------------

\ind is a mode which indicates that indian mode is on and \eng indicates that
english mode is on.

The next step consists of making the macros \vow, \cvow, \ccvow and
\cccvow work with postscript or TeX fonts. I have already done this
for 'potana' a Telugu postscript font. I have partly done it for the France
Velthuis' devnagari TeX font.

Writing these macros is a nontrivial task. Some
commitment of time is involved.

Postscript support
-------------------

It is possible to typeset Indian languages with LaTeX but
many people find LaTeX too complex. For them, I am planning to
write a simple converter of isc files postscript. There is
a nice program called 'genscript' which converts an ascii file
to postscript. It has several options and supports rudimentary
formatting. One option is to modify 'genscript' program to
work with isc files. Other option is to generate an input file
to 'genscript'.

Viewer for ISC
----------------

One disadvantage of isc files is that they cannot be viewed
directly. One can think of writing a viewer which works in
X, Windows, and Mac Platforms. Ofcourse one can always
convert it postscript and view it. But if a direct viewer
has better response time it would be very convenient. It is
then possible to put isc files on the network and the viewer
can be connected to netscape or mosaic. Right now, importing
postscript files over the network is very slow.

Editor for ISC
----------------

This is probably the most ambitious and time consuming,
but probably the most useful of all projects. There is
an emacs editor called 'mule' which can work with multiple
languages. Making 'mule' work with isc files is not trivial
but if it can be done it would be wonderful. Personally,
I would like to concentrate on this once the smaller projects
are done.

Word Processor for ISC
------------------------

Again, this is not easy to write. But we can start with a
rudimentary version and capabilities slowly.

Spell Checker for ISC
-----------------------

This is language dependent and full-fledged spelling checker for
any Indian language is not easy to write. But a rudimentary
version like checking unpronouncible conjuncts like 'gk' is
certainly possible.

Why not unicode?
---------------

One advantage of isc is that no characters are assigned in the range
0-31 and 128-160. This is important for sending data freely across the
network. This is not possible with unicode. At present, not much
software supports unicode. If at all unicode gains popularity It will
be easy to write converter programs from ISC to unicode. I feel it is
more important not to confuse the existing mailers news readers than
supporting unicode.

Why not ISCII?
-------------

I haven't found it easy to get documentation on ISCII. It needs to be
carefully looked at whether ISCII will serve the purpose.

Summary
-------

We seem to be duplicating lot of work in trying to bring Indian
languages into the computer age. My belief is that it is possible to
work together and avoid duplication of work while accomodating everyone's
concerns. A transliteration scheme that is natural for
tamil may not be so natural for devnagarai. It is possible to support
mutiple transliteration schemes and there is no need to insist on a
single transliteration scheme. I propose a standard representation
called ISC towards for this purpose. I have already written some
programs which are useful and also illustrate how to use isc files.

Program status Purpose
----------------------------------------------------------------------------

rts2isc done Converts a RIT file to isc file

isc2rts done Converts an isc file to RIT file

isc2tex done Converts an isc file to tex file

rcn2isc due Converts a rachana file to isc file

isc2rcn due Converts an isc file to Rachana file

itx2isc due Converts an itrans file to isc file

isc2itx due Converts an isc file to itrans file

isc2ps due Converts an isc file to posctscript file
Most probably with the help of genscript

isc viewer due A simple viewer for isc files.

isc editor due Either make multi lingual emacs work
with isc files or write an editor

word processor due A generic word processor which
can work with Indian fonts