Compact Uniform Resource Identifier (COMURI)

COMURI and DEURI are being writen in parallel. COMURI is a general approach and DEURI a subset for http://data.europa.eu.

Comuri is a compact mnemonic URI: human and machine friendly; it allowes direct identification of variants, and URI metadata; it covers the full data life-cycled including the archival phase (ultrapersistency).

It is a return to roots: A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource [[!RFC3986]]. Comuri avoids cluttering with metadata, taxonomy, semantics and similar: these are not the URI functions, though mnemonics are strongly encouraged. The amount of metadata that can be encoded into a URI is limited before the URI becomes too cumbersome: a better approach is to get the metadata in human and machine readable format.

This best practice guide explores many challenges about URIs. If the proposals put forward are not valid for some circumstances, it should help to find other solutions.

Introduction

Rationale

The intention is to have compact mnemonic URIs easy for the users (humans and machines); URI patterns should be intuitive to facilitate URI guessing. The most common URI pattern should be similar to shortened URI [[SHORT-URI]]:

http://example.com/foo                     # second level domain and one segment path

Indeed, the existence of URI shortening services is a symptom that something is wrong and that native short URIs are needed: Comuri should not require a shortening services.

Unwarranted complexity must be avoided. For example, only use longer URIs (third level domain, multisegment paths) when it cannot be avoided; many web sites can get by without language and format variants, so avoid this mechanism. It is not intended as an academic exercise: on the contrary, it is a very practical down to earth technique.

Direct identification in the URI of the language and format variants is straightforward using dot extensions; it is a current practice: it has been in Apache for over 15 years: again, nothing new, just recommended the current practices. Other variants can be addressed using mechanisms such as variant identification with query or Transparent Content Negotiation (TCN) [[!RFC2295]].

The whole approach is to make life easier for the users, so developers might have to work harder. For example, mapping between the internal data structure to URIs might need more work [[APACHE-RW]].

This work is in the context of the Data on the Web Best Practices Working Group and hence data requirements are very much taken into account. In particular, the URI ultrapersistency to accompany the requirement of data preservation, though data preservation is out of scope.

The approach is syntactic and it does not specifies the semantics of the URI components (domains and path segments); for example, it does not imposes on the first path segment the concepts of collection, type, or similar.

Comuri does not break any of the existing standards and it works with existing software as they are only conventions. Only the direct identification of metadata would require finer parameterisation (in some servers) or new development, where the principle of at least two independent implementations must be respected.

Design goals

Compact mnemonic URI
Human and machine friendly
URI guessing should work
Simplest possible characters such as number and lower case letters; base 36 recommended
One path segment, if possible - minimise hierarchy, harder to manage and it often leads to long URIs
Consider only the most common variants: format and language
Direct identification of variants and metadata
Ultrapersistent URI
Capability to identify online and offline resources
Capability to identify packed data
Granularity - capability to identify different data sizes – whole database and record
Multilingual capability - language neutral URI
Guidelines as opposed to imperative rules

Scope

Comuri focuses on specifying a compact mnemonic URI for the schemes http and file, though may also be applied to similar schemes such as ftp. The driving considerations are in the design goals.

Identification (i.e., having a URI) is the first step to access data. One has to take into account the nature of the resources identified; in particular, original site, archive site, and offline data. This does not imply that resource creation, management, and associated aspects are in scoped; in particular, it is out of scope: URI preservation, data preservation, redirection, central URI registry, and taxonomy.

Any part of this document stating http must be read as stating http and https, except if otherwise stated. Similar for ftp.

The following strings are placeholder names (metasyntactic variables): foo, bar, and qux.

Examples

# most comuris should be in this form - avoid unnecessary dot extensions such as "html" and "php"
http://example.com/roma                    # mnemonic for something about Rome
http://example.com/140710                  # mnemonic for 10 July 2014 as YYMMDD
http://eur-lex.europa.eu/2014L205          # EU Official Journal 2014 L205 - "L" folded to "l"

# direct identification of variants
http://example.com/roma.de                 # resource variant German - one extension is the language
http://example.com/140710.xml              # resource variant, XML
http://eur-lex.europa.eu/2014l205.en.html  # EU Official Journal 2014 L205, English, HTML

# longer path to consider data packing
http://eur-lex.europa.eu/2014/l205         # path with two segments
file:///2014/l205                          # two segment path to avoid a large top directory

# URI metadata
http://example.com/roma?                   # URI metadata
http://example.com/roma.de?                # URI metadata of variant German
http://example.com/roma.de.pdf?            # URI metadata of variant German in PDF
file:///2014/l205.comuri.html              # using "metadata file"

Comuri Syntax

Comuri syntax components

Example:

         http://ec.example.com:8080/203040?key=value#foo
         \__/  \__________________/\_____/ \_______/ \_/
          |              |            |        |      |
       scheme        authority       path    query  fragment

Closely follows the illustration in [[!RFC3986]].

Comuri schemes

http

Use for online data. It has server-side and browser side processing capability. Easier to generate dynamic data. For the URI metadata request, empty query can be used.

file

Use for offline data, mainly for archival. It lacks server-side and it has browser-side processing capability. Better to restrict to static data. For the URI metadata request, only metadata file can be used.

Comuri authority

The authority length should be at most to third level domain, otherwise comuris would be too long.

http://example.com            # second level domain - preferred - very compact
http://foo.example.com        # third level domain  - acceptable
http://bar.foo.example.com    # fourth level domain - too long

Fourth level domains and beyond should be avoided as it makes URIs too long. The use of fourth level domains is mostly due to misplaced ideologies that want to reflect in URIs the organisational hierarchy: this is not the function of URIs and it should be in the metadata. One should be particularly careful in the case of URIs for the general public. For example, the URI of a ship register maintained by the European Commission does not need to show the hierarchy of the department.

http://ship.europa.eu                # good - compact
http://ship.dgt-foo.ec.europa.eu     # bad – unnecessary long

Comuri path

Reduced character set are the characters "0-9" "a-z" "-_."

Base 36 character set are the characters "0-9" "a-z". Folding the upper case, the result is a case-insensitive character set.

Visual separator character set are the characters "-_"

Dot separator character is the character "."

Unnecessary trailing strings are strings such as /, php, jsp, asp or cgi.

Language tag is a tag from Tags for Identifying Languages [[!BCP47]].

Language code in two characters is a code from [[!ISO639-1]]. This subset is included in BCP47.

Format tag is a string representing the most commonly used file extension [[LFORMAT]] for the associated media-type [[!MEDIA-TYPES]]; there is no formal specification for file extensions

Dot extensions are the end strings in the last path segment separated by the dot separator character.

Language extension is the dot extension that indicates the language with language tag or language code in two characters.

Format extension is the dot extension that indicates the format with format tag.

Comuri path should have only one segment. It is couraged to avoid comuri paths of more than one segment. The http scheme can map the internal data structure to one segment URI. More than one segment might be appropriate for offline data with the scheme file, so data can be grouped into directories to avoid having a too large top directory. It is recommended to use the base 36 character set for the path.

Visual separator

The visual separator character set should be avoided; it is recommended to use the base 36 character set also as visual separator, so the string can be a base 36 number, otherwise it is more complex how to consider the visual separator character set as a blank to be ignored when considering the number or a character part of a string. Upper case letters are folded to lower case, hence the path is case-insensitive: in URI only the scheme and authority are case-insensitive.

http://example.com/2014-09               # visual separator "-" 
http://example.com/2014x09               # base 36 number "x" used as a visual separator

Without dot extensions

The path must not contain unnecessary trailing strings.

http://example.com/1020                  # one segment - preferred
http://example.com/1020/30               # two segments - acceptable, but try to avoid

http://example.com/roma                  # lower case
http://example.com/Roma                  # capital "R" folded into "r", "Roma" = "roma"

With dot extensions

Dot extensions are reserved for the direct identification of language and format variants a per the pattern:

http://example.com/foo.language.format

Where

language: language tag
format: format tag

For the direct identification of the version variant it is recommended to use variant identification with query. It is not recommended to use dot extension because it is not a common practice, there would be too many extensions, and the lack standardisation. There is little in HTTP [[!RFC2616]] and related specifications; one has to used the deprecated x- mechanism. Examples of other approaches are: Memento [RFC7089] and Wikipedia Page history [[WIKIHIS]].

http://example.com/palma.es              # resource variant, Spanish, format according to negotiation
http://example.com/palma.xml             # resource variant XML
http://example.com/palma.de.pdf          # resource variant, German, PDF

file:///2014/l205.en.html                # two segment path with "file" - acceptable so the top directory is not too big

If only one dot extension, servers should be capable of making the difference between a language and a format. Servers must respond with the two extensions, as per the negotiation.

http://example.com/palma.es              # request Spanish variant
http://example.com/palma.es.html         # response Spanish-HTML variant

http://example.com/palma.xml             # request variant XML
http://example.com/palma.en.xml          # response variant English-XML

Dot character in archival

When archiving, one might transform the dot separating the level domains into dash: both approaches have advantages and disadvantages.

http://example.com                       # original site
http://example.org/example.com           # archival site with dot - good: same domain name - bad: dot separator not used for variants
http://example.org/example-com           # archival site with dash - good: no dot deparator - bad: different domain name

Comuri query

Empty query is a query only the character "?"; i.e., without data. The function is obtaining the URI metadata.

This mechanism is a new convention that does not exist in software servers and it has to be implemented; this only means changing software servers such as Apache, it does not mean any changes to HTTP [[!RFC2616]]. At present, the response is the same for http://example.com/foo and http://example.com/foo?: the resource.

http://example.com/foo                   # resource
http://example.com/foo?                  # empty query - return the URI metadata for http://example.com/foo

Non-empty query should only be used when dot extension is not sufficient. The key comuri is reserved to indicate a Comuri query. An empty comuri key indicates Comuri query. The value no indicates a non-Comuri query. Other values are undefined.

The following query keys are specified; this might be extended:

lang: language variant
format: media-type variant [[!MEDIA-TYPES]]
version: previous version variant (history)

Where values must be:

lang: language tag
format: format tag
version: a string indicating the previous version

http://example.com/hello.rdf                         # dot extension - recommended
http://example.com/hello?format=rdf                  # query parameters - avoid

http://example.com/hello?                            # empty query - return the URI metadata for http://example.com/hello

http://example.com/hello.rdf?                        # empty query, dot extension to indicate the format - recommended
http://example.com/hello?comuri;format=rdf           # query parameters - empty comuri indicates a Comuri query

http://ec.example.com/hello?comuri=no;key=Bar        # "B" is not folded - "comuri" set to no

Comuri fragment

Fragment should be used as per URI specification, though avoid as much as possible.

Terminology

- Terms in the existing standards must be followed
- New terms must be in harmony with the existing terms
- Definitions are considered new terms

New terms

Some are news terms and some are just rewriting of existing terms.

COMURI: Abbreviation of Compact Uniform Resource Identifier A noun that follows the appropriate language morphology. As a proper noun it must be written as Comuri, as a common noun as comuri. For example, Comuri, Comuris, comuri, comuris.
URI guessing: From a pattern, it should be easy to guess other URIs. For example, if http://example.com/mon gives the Monday weather, http://example.com/tue should give the Tuesday weather.

Terms from URI

[[!RFC3986]]

Uniform Resource Identifier (URI)

A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource.

Uniform Resource Locator (URL)

the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location").

Syntax Components

The generic URI syntax consists of a hierarchical sequence of components referred to as the scheme, authority, path, query, and fragment.

  foo://example.com:8042/over/there?name=ferret#nose
  \_/   \______________/\_________/ \_________/ \__/
   |           |            |            |        |
scheme     authority       path        query   fragment
   |   _____________________|__
  / \ /                        \
  urn:example:animal:ferret:nose

Resource

This specification does not limit the scope of what might be a resource; rather, the term "resource" is used in a general sense for whatever might be identified by a URI. Familiar examples include an electronic document, an image, a source of information with a consistent purpose (e.g., "today's weather report for Los Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a collection of other resources. A resource is not necessarily accessible via the Internet; e.g., human beings, corporations, and bound books in a library can also be resources. Likewise, abstract concepts can be resources, such as the operators and operands of a mathematical equation, the types of a relationship (e.g., "parent" or "employee"), or numeric values (e.g., zero, one, and infinity).

Scheme

Each URI begins with a scheme name that refers to a specification for assigning identifiers within that scheme.

Path segment

A path consists of a sequence of path segments separated by a slash ("/") character.

Query

The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

Fragment

The fragment identifier component of a URI allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The identified secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations. A fragment identifier component is indicated by the presence of a number sign ("#") character and terminated by the end of the URI.

Terms from HTTP

[[!RFC2616]]

resource: A network data object or service that can be identified by a URI. Resources may be available in multiple representations (e.g. multiple languages, data formats, size, resolutions) or vary in other ways.
content negotiation: The mechanism for selecting the appropriate representation when servicing a request, as described in section 12. The representation of entities in any response can be negotiated (including error responses).
variant: A resource may have one, or more than one, representation(s) associated with it at any given instant. Each of these representations is termed a 'variant'. Use of the term 'variant' does not necessarily imply that the resource is subject to content negotiation.

variant list

A list containing variant descriptions, which can be bound to a transparently negotiable resource.

variant description

A machine-readable description of a variant resource, usually found in a variant list. A variant description contains the variant resource URI and various attributes which describe properties of the variant.

variant resource

A resource from which a variant of a negotiable resource can be retrieved with a normal HTTP/1.x GET request, i.e. a GET request which does not use transparent content negotiation.

list response

A list response returns the variant list of the negotiable resource, but no variant data. It can be generated when the server does not want to, or is not allowed to, return a particular best variant for the request.

Term from Linked Data

[[LINKED-DATA]]

Useful information: When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)

[[RDF-PRIMER]] [[sparql11-overview]]

Terms from Web Data

[[WEBDATA]]

Mnemonic: See the section Mnemonic.

Introduction

Rationale

Design goals

Scope

Examples

Echelons

Compact mnemonic

Direct identification of variant

Direct identification of metadata

Ultrapersistent URI