The Multilingual Web: Anything Missing ?

Introduction

This document is a Chair Memo for the panel The Multilingual Web: Anything Missing ? organized in the context of the 9th International Word Wide Conference, Amsterdam, 15-19 May 2000.

There is double intention with this memo:

Existing facilites: an overview.
Missing facilites: list new candidate facilites.

This memo will probably change in the next few weeks.

Internationalization and multilingualism

Internationalization (I18N) means creating culture neutral systems apt for Localization (L10N) into different cultures; for example, Greece, Canada-French or Belgium-French. Language is one of the aspects, but there are others such as date representation. In the context of I18N, language is mostly concerned with basic mechanisms; for example, large character repertoire. Essentially, one is talking about monolingual systems (e.g., Spanish or Greek). In new developments, I18N is usually take into account from day one. The existing problems is mostly due to legacy systems.

Multilingualism is concerned with processing several languages (e.g., Spanish and Greek) such as facilities for multilingual sites, translation or multilingual parallel texts. I18N could be considered a layer below multilingualism.

I18N and multilingualism aspects are often interwoven. For example, one could need a document with Greek and Spanish (notice that it is in the same document) and no need for multilingual parallel texts.

The emphasis is on mechanisms that could be standardized. Hence, these mechanisms could be used by many vendors.

Link:

LISA

Existing facilities

Many of the basic mechanism for I18N have been implemented. For example, HTML support many different character encoding and this allow the representation of most languages.

In general, implementing the internal representation is easy and the external representation is hard. For example, one could have a file in the computer with a Greek text but one cannot view it because there is not application that support the character encoding and/or the necessary glyphs are not available.

Link:

Internationalization/Localization, W3C.

Characters

This has been largely solved. In other words, if one does not see the expected characters, it is probably a question of parametrization, loading the glyph or similar factors.

Some concepts associated with characters are: glyph, character, character repertoire, transformation, character encoding, single byte encoding and multibyte encoding.

Link:

Transparent Content Negotiation (TCN)

With transparent content negotiation, a user visiting the URL http://europa.eu/roma could receive in his browser the Treaty of Rome is French and a second user visiting the same URL could receive the Treaty in Spanish. This depends in the language preference setup in the browser.

Though, these facilities have existed for years, few sites use them. One of the main reasons is the people (webmastes and users) are not even aware that they exist.

Links:

Transparent Content Negotiation in HTTP.
Content Negotiation Explained.
Linux Mandrake (example of TCN).
Europa (example of no TCN).

Missing facilities

Few basic mechanism are missing; e.g., tags for tranlisteration is missing. Most of the missing mechanisms are at a higher level; e.g., parallel texts.

Glyph Servers

These proxy servers substitute inline bitmaps for non-ASCII characters in the current page. The glyph server retrieves a document, and then parses the HTML to replace non-displayable characters with an <IMG> element. Each <IMG> element points to a bitmap image of the glyph. The client eventually receives the edited HTML along with all the new images. The resulting display is fairly accurate, but retrieval time is long, and the text can no longer be treated as text since it is now stored graphically.

Embedding Fonts

This approach allows you to send fonts with individual web pages. Unfortunately different browsers supply varying levels of support for embedded fonts. Some browsers control font display themselves, while others rely more heavily on the operating system's font display handling.

Link:

Typography on the Web (Microsoft)

URI

It is not possible to have a URI such as http://España.com as the ñ is not allowed.

Link:

URIs and other identifiers

Language Conversion

Language Conversion refers to transliteration, transcription and similar natural language transformations. There is no mechanism for tagging a document (or element) as Greek transliterated for French speaker.

An approach is to extend Tags for the Identification of Languages (RFC1766). This approach has the advantage that little would have to be changed in the present standards and softwares. In the worse case, this new semantic could be ignore. An example of this type of approach is the Codes for language transformation (lapsed Internet-Draft).

The language and character encoding are separated mechanisms. Software that infer the encoding from the language will break. e.g., Greek will not automatically mean that one could use ISO-8859-7 as if the text is transliterated into French one would have to use ISO-9959-1. With character encoding that include both character repertoire is not an issue; e.g., Unicode.

Multilingual Parallel Texts

Multilingual parallel texts are translations of each other. For example, the Treaty of Rome in eleven languages. Each individual linguistic version is an object on itself that could be part of set of parallel texts. There are also metadata to be considered at set level.

Some linguistic versions could be partial. For example, a linguistic version could be incomplete or there could several fragments of texts corresponding to one fragment in the source document; the partial document could be the result of pre-processing (e.g., extraction of a translation memory) for translation.

The text could be aligned to different level. For example, they could be aligned at document, paragraph, term or even word level.

Monolingual texts, partial parallel texts and full parallel texts should be considered a continuum. They correspond to different phases of the Author, Translation and Publication chain (ATP-chain) and there should be one single model.

All the translation types should be considered: human translation, machine translation and any type in between.

What it is needed is standard for multilingual parallel text that takes into account all these aspects: a standard for Multilingual Dossier. Hence, such a data object could be process by many tools; for example, there could be browsers that display parallel texts side by side. Such an standard would allow a faster expansion of the language industry.

Multilingual site

Standard for multilingual site or at least a guideline.

Neutral templates

Templates for generating parallel texts; i.e., a file without linguistic texts that defines how the documents in the different languages should be generated.

Language Negotiation

The language negotiation can be done using TCN. But guidelines/implementation are needed for:

Response to request without language preference. This could send to a multilingual page to select a language.
Return of available linguistic versions in the body (example).
Return of available linguistic versions in the header. This could be used to implement in browser a language button, similar to File Edit or View at the top of the browser.

Acknowledgment

Suzanne Topping suggested most of the font issues. Her original text is Fonts Role in Web-based Character display.

Author

M.T. Carrasco Benitez
Send comments to ca{AT}dragoman{DOT}org

Last updated: 17 May 2000

The Multilingual Web: Anything Missing ?

Introduction

Internationalization and multilingualism

Existing facilities

Characters

Transparent Content Negotiation (TCN)

Missing facilities

Glyph Servers

Embedding Fonts

URI

Language Conversion

Multilingual Parallel Texts

Multilingual site

Neutral templates

Language Negotiation

Other links

Acknowledgment

Author