Web Data

Nature of the resources

The nature of the resources must take into consideration. This section partially illustrate some of the resources.

Ultrapersistent URI

URI creation has to take into account all identification scenarios: original site, archival sites, and offline data; they might exist in parallel. One has to consider some relations to the data such as data organisation and long-term data preservation [[DOCP]] [[RFC4810]], tough data itself is out of scope.

Scheme processing capability is the scheme property to process data before delivery. The processing capability could be in the server and/or client. For example, a browser might or might not have a Javascript engine.

Examples:

http: server and browser side
file: browser side

Official data server

Official data server is a web server maintained by an organisation, such as a government, for the publication of authoritative datasets. Often the second or the third level domain is called data, such as data.gov or data.gov.uk.

http://data.example.com

It is fuzzy what should be into a official data server and what should be separated domain server, though clearly all cannot go into official data servers. For example, the significant dataset (or group, collection, concept, type, etc) foo could be a in the official data server or in a separated domain server.

http://data.example.com/foo              # in official data server - longer URI
http://foo.example.com                   # separated domain server - shorter URI

Comuri governance

The centralised governance rules must be only for interoperability. Delegate as much as possible and let each party to follow the Comuri philosophy: similar to subsidiarity or Auftragstaktik.

The DNS administrator assigns the appropriate level domain and the web site and other administrators manage paths: it is up the administrators at the proper level to follow reasonable mnemonics and rules that comply with the Comuri philosophy. For example, a web site could delegate every path starting with 10 to a department and the departmental administrator must follow the Comuri philosophy.

First segment administrator is the entity with the mandate to allocate first segments of the path. Similar for the second and subsequent segments. The role similar to DNS administrator.

Legacy URIs

URIs are forever and they should be redirected to the new URI. Redirectio services are out of scope.

The appropriate status codes such 301 Moved Permanently should be considered.

Data archival

The following data archiving techniques are considered:

Online archival sites
Offline archival
Pack

http://example.com                       # original site
http://example.org/example.com           # online archival
file:///example.com                      # offline archival
http://example.org/example.com.zip       # online pack
file:///example.com.zip                  # offline pack

http://example.com/foo                   # original site
http://example.org/example.com/foo       # online archival
file:///example.com/foo                  # ofline archival

The appropriate status codes such 301 Moved Permanently should be considered.

Data packing

Data packing techniques that works with both schemes (file and http) should be considered [[XDOSSIER]].

Data granularity

One must be able to directly identify a whole database (e.g., 1TB) and a single record (e.g., 1kb).

Online and offline data

Online data (http) and offline data (file) refers to the schemes used to access the data. Online data can be processes by the server and the client; offline data only the client. One requires techniques that work for both. The simplest and safest is to use static data. There are two approaches:

Same data structure for network and offline data
Data structure transformation between online and offline data – import/export techniques

The simplest is to have the same data structure for both. The negative side is that one has to dumb-down to the file scheme; for example, for http://example.com servers would usually send the file index.html and the file scheme would do a directory listing, hence anchors has to be in the form of http://example.com/index.html.

Online data

Online data is data accessed with the http scheme.

Online data can also use the HTTP mechanisms to request variants. For example, TCN, header fields and server configuration.

If other techniques are used, URI should take priority, For example, if the appropriate header field request the German variant of the resource and the URI request the Spanish variant, the server should send the Spanish variant.

Offline data

Offline data is data accessed with the file scheme.

Static and dynamic data

It is much easier to use static data to have the same data structure in the network and offline. It is also safer for data preservation; for example, it might not be possible to run some programs in the future if the appropriate environment is not available. There are also negative aspects; for example, the generated static data might be big and some functionalities might be lost.

Static data

Static data is data delivered without modification.

file:///foo                    # latest
file:///foo3                   # version 3

Dynamic data

Dynamic data is data resulting from the output of a program. It could be server side, client side or both. For example, from a CGI, Javascript or a combination of both.

http://example.com/foo
http://example.com/foo3

Multilingual data

The particular case of multilingual data must be resolved in the wide context of web data: it must be treated as a variant in the same fashion as the format [[!MEDIA-TYPES]].

Language neutral URI

It is a URI with little or no natural language content. Language neutral URI is important to multilingual web data.

The language neutral aspect is mostly directed to the path component. The ultimate language neutral URI is using numbers; the negative side is poor mnemonics.

http://example.com/1234        # language neutral and opaque

http://example.com/london      # URI using the English word london
http://example.com/london.es   # Spanish variant
http://example.com/londres     # BAD Spanish variant

Language identification in URIs

Language variants can be identified in URIs as:

Dot extension
Domain

http://example.com/foo.en      # dot extension
http://en.example.com/foo      # domain - à la Wikipedia

Mnemonic

Overview

Mnemonic is a technique to assist the memory.

Mnemonics are very dependent in many aspects: language, cultural, profession, age, group, etc. Something might be very mnemonic for one person and totally unmnemonic for another. There are some general rules: short strings are usually more mnemonic; for example, most people should be able to remember 3. One can taylor mnemonics for a certain group at the cost of penalising larger audiences. Targeting larger audiences usally means lowering the mnemonics for everybody. It is recommended to use standard abbreviations; this might be challenging in a multilingual context. URI guessing is along the same lines.

For multilingual audience, mnemonics should be as language neutral as possible. Numbers are arguably the most language neutral characters: 3 should be acceptable to Arabic speakers, though some people might insist on ٣. In countries with less known character repertoire, there is a tendency to use numbers for aspects such as wifi passwords; for example, in Armenian. Numbers can backfire in other ways; for example, http://example.com/mon (for Monday) could be transformed into http://example.com/1 or http://example.com/2.

Base 36 should be a good compromise to a wide number of people around the world. Using words in base 36 from widely spoken language should be acceptable, particularly if the words spelled the same across languages; example, exit. One can always use Internationalized Resource Identifiers (IRIs) [[!RFC3987]] with a richer character repertoire: just pay the price.

http://example.com/中国                   # For Chinese speaker: good mnemonic - non-Chinese speakers: bad mnemonic
http://example.com/china                 # For Chinese speaker: acceptable - non-Chinese speakers: better

Token

Mnemonic tokes should be composed of short prounsable syllables combined into short prounsable words in most languages; having many consonantes together tends to be a hindrance. Numbers with letters acting as separators works.