COMURI and DEURI are being writen in parallel. COMURI is a general approach and DEURI a subset for http://data.europa.eu.
Comuri is a compact mnemonic URI: human and machine friendly; it allowes direct identification of variants, and URI metadata; it covers the full data life-cycled including the archival phase (ultrapersistency).
It is a return to roots:
A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource
[[!RFC3986]].
Comuri avoids cluttering with metadata, taxonomy, semantics and similar:
these are not the URI functions,
though
mnemonics
are strongly encouraged.
The amount of metadata that can be encoded into a URI is limited before the URI becomes too cumbersome:
a better approach is to get the
metadata in human and machine readable format.
This best practice guide explores many challenges about URIs. If the proposals put forward are not valid for some circumstances, it should help to find other solutions.
The intention is to have compact mnemonic URIs easy for the users (humans and machines); URI patterns should be intuitive to facilitate URI guessing. The most common URI pattern should be similar to shortened URI [[SHORT-URI]]:
http://example.com/foo # second level domain and one segment path
Indeed, the existence of URI shortening services is a symptom that something is wrong and that native short URIs are needed: Comuri should not require a shortening services.
Unwarranted complexity must be avoided. For example, only use longer URIs (third level domain, multisegment paths) when it cannot be avoided; many web sites can get by without language and format variants, so avoid this mechanism. It is not intended as an academic exercise: on the contrary, it is a very practical down to earth technique.
Direct identification in the URI of the language and format variants is straightforward using dot extensions; it is a current practice: it has been in Apache for over 15 years: again, nothing new, just recommended the current practices. Other variants can be addressed using mechanisms such as variant identification with query or Transparent Content Negotiation (TCN) [[!RFC2295]].
The whole approach is to make life easier for the users, so developers might have to work harder. For example, mapping between the internal data structure to URIs might need more work [[APACHE-RW]].
This work is in the context of the Data on the Web Best Practices Working Group and hence data requirements are very much taken into account. In particular, the URI ultrapersistency to accompany the requirement of data preservation, though data preservation is out of scope.
The approach is syntactic and it does not specifies the semantics of the URI components (domains and path segments); for example, it does not imposes on the first path segment the concepts of collection, type, or similar.
Comuri does not break any of the existing standards and it works with existing software as they are only conventions. Only the direct identification of metadata would require finer parameterisation (in some servers) or new development, where the principle of at least two independent implementations must be respected.
Comuri focuses on specifying a
compact mnemonic URI
for the
schemes
http
and
file
,
though may also be applied to similar schemes such as
ftp
.
The driving considerations are in the
design goals.
Identification (i.e., having a URI) is the first step to access data. One has to take into account the nature of the resources identified; in particular, original site, archive site, and offline data. This does not imply that resource creation, management, and associated aspects are in scoped; in particular, it is out of scope: URI preservation, data preservation, redirection, central URI registry, and taxonomy.
Any part of this document stating
http
must be read as stating
http
and https
,
except if otherwise stated.
Similar for ftp
.
The following strings are placeholder names (metasyntactic variables):
foo
,
bar
, and
qux
.
# most comuris should be in this form - avoid unnecessary dot extensions such as "html" and "php" http://example.com/roma # mnemonic for something about Rome http://example.com/140710 # mnemonic for 10 July 2014 as YYMMDD http://eur-lex.europa.eu/2014L205 # EU Official Journal 2014 L205 - "L" folded to "l" # direct identification of variants http://example.com/roma.de # resource variant German - one extension is the language http://example.com/140710.xml # resource variant, XML http://eur-lex.europa.eu/2014l205.en.html # EU Official Journal 2014 L205, English, HTML # longer path to consider data packing http://eur-lex.europa.eu/2014/l205 # path with two segments file:///2014/l205 # two segment path to avoid a large top directory # URI metadata http://example.com/roma? # URI metadata http://example.com/roma.de? # URI metadata of variant German http://example.com/roma.de.pdf? # URI metadata of variant German in PDF file:///2014/l205.comuri.html # using "metadata file"
The echelons are in assencing order:
One should aim for the highest echelon, though even attaining lower echelons represents a gain.
The essence of Comuri is to be compact and mnemonic: this echelon should be attained . This can be implemented with existing softwares; servers doing a direct mapping between the internal data structure and the URIs might require finer parametrisation [[APACHE-RW]].
Direct identification variant is the URI syntactic mechanism to identify variants.
http://example.com/foo.en.pdf # English, PDF - supported by current softwares such as Apache
This is nothing new: using content negotiation or direct identification in the URI are just two complementary techniques to get variants.
Direct identification of metadata is the URI syntactic mechanism to identify variants and URI metadata.
http://example.com/foo? # metadata - requires implementation in the server software such as Apache
Direct identification refers to both: direct identification variant and direct identification of metadata.
Ultrapersistent URI covers the full life-cycle: original site, archiving into archival sites, and offline data.
Example:
http://ec.example.com:8080/203040?key=value#foo \__/ \__________________/\_____/ \_______/ \_/ | | | | | scheme authority path query fragment
Closely follows the illustration in [[!RFC3986]].
Use for online data. It has server-side and browser side processing capability. Easier to generate dynamic data. For the URI metadata request, empty query can be used.
Use for offline data, mainly for archival. It lacks server-side and it has browser-side processing capability. Better to restrict to static data. For the URI metadata request, only metadata file can be used.
The authority length should be at most to third level domain, otherwise comuris would be too long.
http://example.com # second level domain - preferred - very compact http://foo.example.com # third level domain - acceptable http://bar.foo.example.com # fourth level domain - too long
Fourth level domains and beyond should be avoided as it makes URIs too long. The use of fourth level domains is mostly due to misplaced ideologies that want to reflect in URIs the organisational hierarchy: this is not the function of URIs and it should be in the metadata. One should be particularly careful in the case of URIs for the general public. For example, the URI of a ship register maintained by the European Commission does not need to show the hierarchy of the department.
http://ship.europa.eu # good - compact http://ship.dgt-foo.ec.europa.eu # bad – unnecessary long
Reduced character set
are the characters "0-9
" "a-z
" "-_.
"
Base 36 character set
are the characters "0-9
" "a-z
".
Folding the upper case, the result is a
case-insensitive
character set.
Visual separator character set
are the characters "-_
"
Dot separator character
is the character ".
"
Unnecessary trailing strings
are strings such as
/
,
php
,
jsp
,
asp
or
cgi
.
Language tag is a tag from Tags for Identifying Languages [[!BCP47]].
Language code in two characters is a code from [[!ISO639-1]]. This subset is included in BCP47.
Format tag is a string representing the most commonly used file extension [[LFORMAT]] for the associated media-type [[!MEDIA-TYPES]]; there is no formal specification for file extensions
Dot extensions are the end strings in the last path segment separated by the dot separator character.
Language extension is the dot extension that indicates the language with language tag or language code in two characters.
Format extension is the dot extension that indicates the format with format tag.
Comuri path should have only one segment. It is couraged to avoid comuri paths of more than one segment.
The http
scheme can map the internal data structure to one segment URI.
More than one segment might be appropriate for
offline data
with the scheme
file
,
so data can be grouped into directories to avoid having a too large top directory.
It is recommended to use the
base 36 character set
for the path.
The visual separator character set should be avoided; it is recommended to use the base 36 character set also as visual separator, so the string can be a base 36 number, otherwise it is more complex how to consider the visual separator character set as a blank to be ignored when considering the number or a character part of a string. Upper case letters are folded to lower case, hence the path is case-insensitive: in URI only the scheme and authority are case-insensitive.
http://example.com/2014-09 # visual separator "-" http://example.com/2014x09 # base 36 number "x" used as a visual separator
The path must not contain unnecessary trailing strings.
http://example.com/1020 # one segment - preferred http://example.com/1020/30 # two segments - acceptable, but try to avoid http://example.com/roma # lower case http://example.com/Roma # capital "R" folded into "r", "Roma" = "roma"
Dot extensions are reserved for the
direct identification
of
language
and
format
variants
a per the pattern:
http://example.com/foo.language.format
Where
language
: language tagformat
: format tag
For the
direct identification
of the
version
variant it is recommended to use
variant identification with query.
It is not recommended to use dot extension because it is not a common practice,
there would be too many extensions,
and the lack standardisation.
There is little in HTTP
[[!RFC2616]]
and related specifications;
one has to used the deprecated
x-
mechanism.
Examples of other approaches are:
Memento [RFC7089]
and
Wikipedia Page history
[[WIKIHIS]].
http://example.com/palma.es # resource variant, Spanish, format according to negotiation http://example.com/palma.xml # resource variant XML http://example.com/palma.de.pdf # resource variant, German, PDF file:///2014/l205.en.html # two segment path with "file" - acceptable so the top directory is not too big
If only one dot extension, servers should be capable of making the difference between a language and a format. Servers must respond with the two extensions, as per the negotiation.
http://example.com/palma.es # request Spanish variant http://example.com/palma.es.html # response Spanish-HTML variant http://example.com/palma.xml # request variant XML http://example.com/palma.en.xml # response variant English-XML
When archiving, one might transform the dot separating the level domains into dash: both approaches have advantages and disadvantages.
http://example.com # original site http://example.org/example.com # archival site with dot - good: same domain name - bad: dot separator not used for variants http://example.org/example-com # archival site with dash - good: no dot deparator - bad: different domain name
Empty query
is a
query
only the character
"?
";
i.e.,
without data.
The function is obtaining the
URI metadata.
This mechanism is a new convention that does not exist in software servers and it has to be implemented;
this only means changing software servers such as Apache,
it does not mean any changes to HTTP [[!RFC2616]].
At present,
the response is the same for
http://example.com/foo
and
http://example.com/foo?
:
the resource.
http://example.com/foo # resource http://example.com/foo? # empty query - return the URI metadata for http://example.com/foo
Non-empty query should only be used when dot extension is not sufficient.
The key
comuri
is reserved to indicate a Comuri query.
An empty
comuri
key indicates Comuri query.
The value
no
indicates a non-Comuri query.
Other values are undefined.
The following query keys are specified; this might be extended:
lang
: language variantformat
: media-type variant [[!MEDIA-TYPES]]version
: previous version variant (history)Where values must be:
lang
: language tagformat
: format tagversion
: a string indicating the previous versionhttp://example.com/hello.rdf # dot extension - recommended http://example.com/hello?format=rdf # query parameters - avoid
http://example.com/hello? # empty query - return the URI metadata for http://example.com/hello
http://example.com/hello.rdf? # empty query, dot extension to indicate the format - recommended http://example.com/hello?comuri;format=rdf # query parameters - empty comuri indicates a Comuri query
http://ec.example.com/hello?comuri=no;key=Bar # "B" is not folded - "comuri" set to no
Fragment should be used as per URI specification, though avoid as much as possible.
A resource may be available in several variants.
http://example.com/foo # resource http://example.com/foo.de # German variant http://example.com/foo.en # English variant http://example.com/foo.es # Spanish variant http://example.com/foo.de.html # German variant in HTML http://example.com/foo.de.pdf # German variant in PDF
Variant dimensions are the types of data representations in a resources, such as language and format.
The following commonly-used variant dimensions are considered:
Other dimensions, such as screen size, have to be addressed with TCN or other mechanism.
Variant identifications are the mechanisms to directly identify variants.
The variant identification mechanisms are:
language
and
format
;
example
http://example.com/foo.de
version
cannot not use this mechanism
http://example.com/foo?lang=de;format=xml;version=5
Variant request must take precedence in any negotiation. For example, over the parameters in the header fields.
URI metadata is the metadata associated to the resource, such as the set that follows the Dublin Core [[!RFC5013]].
URI metadata request
mechanism to get the
URI metadata.
For
http
use
empty query.
For
file
,
use
metadata file.
Metadata file file with the URI metadata; it is recommended to use combined human-machine format.
The string
comuri
is reserved at the end of the path or when preceded by the format extension.
http://example.com/foo? # "URI metadata request" using the "empty string" file:///foo.comuri.html # "metadata file" using a comuri metadata file
To identify if a server supports URI metadata check the if there is one element with the
id="comuri"
.
URI metadata structure is a data structure appropriate for the URI metadata.
Dublin Core Previous versions language, format variants
- Terms in the existing standards must be followed
- New terms must be in harmony with the existing terms
- Definitions are considered new terms
Some are news terms and some are just rewriting of existing terms.
http://example.com/mon
gives the Monday weather,
http://example.com/tue
should give the Tuesday weather.
[[!RFC3986]]
A Uniform Resource Identifier (URI) is a compact sequence of characters that identifies an abstract or physical resource.
the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism (e.g., its network "location").
The generic URI syntax consists of a hierarchical sequence of components referred to as the scheme, authority, path, query, and fragment.
foo://example.com:8042/over/there?name=ferret#nose \_/ \______________/\_________/ \_________/ \__/ | | | | | scheme authority path query fragment | _____________________|__ / \ / \ urn:example:animal:ferret:nose
This specification does not limit the scope of what might be a resource; rather, the term "resource" is used in a general sense for whatever might be identified by a URI. Familiar examples include an electronic document, an image, a source of information with a consistent purpose (e.g., "today's weather report for Los Angeles"), a service (e.g., an HTTP-to-SMS gateway), and a collection of other resources. A resource is not necessarily accessible via the Internet; e.g., human beings, corporations, and bound books in a library can also be resources. Likewise, abstract concepts can be resources, such as the operators and operands of a mathematical equation, the types of a relationship (e.g., "parent" or "employee"), or numeric values (e.g., zero, one, and infinity).
Each URI begins with a scheme name that refers to a specification for assigning identifiers within that scheme.
A path consists of a sequence of path segments separated by a slash ("/") character.
The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.
The fragment identifier component of a URI allows indirect identification of a secondary resource by reference to a primary resource and additional identifying information. The identified secondary resource may be some portion or subset of the primary resource, some view on representations of the primary resource, or some other resource defined or described by those representations. A fragment identifier component is indicated by the presence of a number sign ("#") character and terminated by the end of the URI.
[[!RFC2616]]
A network data object or service that can be identified by a URI. Resources may be available in multiple representations (e.g. multiple languages, data formats, size, resolutions) or vary in other ways.
The mechanism for selecting the appropriate representation when servicing a request, as described in section 12. The representation of entities in any response can be negotiated (including error responses).
A resource may have one, or more than one, representation(s) associated with it at any given instant. Each of these representations is termed a 'variant'. Use of the term 'variant' does not necessarily imply that the resource is subject to content negotiation.
A list containing variant descriptions, which can be bound to a transparently negotiable resource.
A machine-readable description of a variant resource, usually found in a variant list. A variant description contains the variant resource URI and various attributes which describe properties of the variant.
A resource from which a variant of a negotiable resource can be retrieved with a normal HTTP/1.x GET request, i.e. a GET request which does not use transparent content negotiation.
A list response returns the variant list of the negotiable resource, but no variant data. It can be generated when the server does not want to, or is not allowed to, return a particular best variant for the request.
[[LINKED-DATA]]
When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
[[RDF-PRIMER]] [[sparql11-overview]]
[[WEBDATA]]