Big Multilingual Linked Data (BigMu)
M.T. Carrasco Benitez, European Commission, January 2014

BigMu is the confluence of three streams with their own peculiarities and traditions:
- Linked Data
- Big Data
- Multilingual parallel corpora

One must apply standards in the simplest fashion. Back to basics: it is about using URIs and content negotiation for the language, format and similar items. Going for simplicity, one service can supply both the human and the machine readable versions using XHTML with appropriate markings, though different output could be arranged in XML and HTML. The same mechanisms must work: for small data (one record) and Big Data (terabytes-sized databases); for tabular and prose data.

Multilingual data often requires cleaning; hard-to-process data might be discarded; and there is a bias toward bilingual data. The challenge is to end up with clean, complete n-lingual aligned data. To put the problematic in perspective, there is nothing better that trying to process a large corpus such as about ten years of the Official Journal of the European Union (OJ).
 
The presentation will combine both:
- Theory: using current web technologies
- Practice: the experience of cleaning a large corpus