The Multilingual Web: Anything Missing ?

Introduction

This document is a Chair Memo for the panel The Multilingual Web: Anything Missing ? organized in the context of the 9th International Word Wide Conference, Amsterdam, 15-19 May 2000.

There is double intention with this memo:

This memo will probably change in the next few weeks.

Internationalization and multilingualism

Internationalization (I18N) means creating culture neutral systems apt for Localization (L10N) into different cultures; for example, Greece, Canada-French or Belgium-French. Language is one of the aspects, but there are others such as date representation. In the context of I18N, language is mostly concerned with basic mechanisms; for example, large character repertoire. Essentially, one is talking about monolingual systems (e.g., Spanish or Greek). In new developments, I18N is usually take into account from day one. The existing problems is mostly due to legacy systems.

Multilingualism is concerned with processing several languages (e.g., Spanish and Greek) such as facilities for multilingual sites, translation or multilingual parallel texts. I18N could be considered a layer below multilingualism.

I18N and multilingualism aspects are often interwoven. For example, one could need a document with Greek and Spanish (notice that it is in the same document) and no need for multilingual parallel texts.

The emphasis is on mechanisms that could be standardized. Hence, these mechanisms could be used by many vendors.

Link:

Existing facilities

Many of the basic mechanism for I18N have been implemented. For example, HTML support many different character encoding and this allow the representation of most languages.

In general, implementing the internal representation is easy and the external representation is hard. For example, one could have a file in the computer with a Greek text but one cannot view it because there is not application that support the character encoding and/or the necessary glyphs are not available.

Link:

Characters

This has been largely solved. In other words, if one does not see the expected characters, it is probably a question of parametrization, loading the glyph or similar factors.

Some concepts associated with characters are: glyph, character, character repertoire, transformation, character encoding, single byte encoding and multibyte encoding.

Link:

Transparent Content Negotiation (TCN)

With transparent content negotiation, a user visiting the URL http://europa.eu/roma could receive in his browser the Treaty of Rome is French and a second user visiting the same URL could receive the Treaty in Spanish. This depends in the language preference setup in the browser.

Though, these facilities have existed for years, few sites use them. One of the main reasons is the people (webmastes and users) are not even aware that they exist.

Links:

Missing facilities

Few basic mechanism are missing; e.g., tags for tranlisteration is missing. Most of the missing mechanisms are at a higher level; e.g., parallel texts.

Glyph Servers

These proxy servers substitute inline bitmaps for non-ASCII characters in the current page. The glyph server retrieves a document, and then parses the HTML to replace non-displayable characters with an <IMG> element. Each <IMG> element points to a bitmap image of the glyph. The client eventually receives the edited HTML along with all the new images. The resulting display is fairly accurate, but retrieval time is long, and the text can no longer be treated as text since it is now stored graphically.

Embedding Fonts

This approach allows you to send fonts with individual web pages. Unfortunately different browsers supply varying levels of support for embedded fonts. Some browsers control font display themselves, while others rely more heavily on the operating system's font display handling.

Link:

URI

It is not possible to have a URI such as http://España.com as the ñ is not allowed.

Link:

Language Conversion

Language Conversion refers to transliteration, transcription and similar natural language transformations. There is no mechanism for tagging a document (or element) as Greek transliterated for French speaker.

An approach is to extend Tags for the Identification of Languages (RFC1766). This approach has the advantage that little would have to be changed in the present standards and softwares. In the worse case, this new semantic could be ignore. An example of this type of approach is the Codes for language transformation (lapsed Internet-Draft).

The language and character encoding are separated mechanisms. Software that infer the encoding from the language will break. e.g., Greek will not automatically mean that one could use ISO-8859-7 as if the text is transliterated into French one would have to use ISO-9959-1. With character encoding that include both character repertoire is not an issue; e.g., Unicode.

Multilingual Parallel Texts

Multilingual parallel texts are translations of each other. For example, the Treaty of Rome in eleven languages. Each individual linguistic version is an object on itself that could be part of set of parallel texts. There are also metadata to be considered at set level.

Some linguistic versions could be partial. For example, a linguistic version could be incomplete or there could several fragments of texts corresponding to one fragment in the source document; the partial document could be the result of pre-processing (e.g., extraction of a translation memory) for translation.

The text could be aligned to different level. For example, they could be aligned at document, paragraph, term or even word level.

Monolingual texts, partial parallel texts and full parallel texts should be considered a continuum. They correspond to different phases of the Author, Translation and Publication chain (ATP-chain) and there should be one single model.

All the translation types should be considered: human translation, machine translation and any type in between.

What it is needed is standard for multilingual parallel text that takes into account all these aspects: a standard for Multilingual Dossier. Hence, such a data object could be process by many tools; for example, there could be browsers that display parallel texts side by side. Such an standard would allow a faster expansion of the language industry.

Multilingual site

Standard for multilingual site or at least a guideline.

Neutral templates

Templates for generating parallel texts; i.e., a file without linguistic texts that defines how the documents in the different languages should be generated.

Language Negotiation

The language negotiation can be done using TCN. But guidelines/implementation are needed for:

Other links

Suggestions of Patrice Husson:

Acknowledgment

Suzanne Topping suggested most of the font issues. Her original text is Fonts Role in Web-based Character display.

Author

M.T. Carrasco Benitez
Send comments to ca{AT}dragoman{DOT}org

Last updated: 17 May 2000