EXTUUIDS: Encoding common identifiers into UUIDs (draft)

Alexandru Dan Corlan

alexandru@corlan.net

October 12, 2012, last change October 18, 2012

Archived by WebCite® at http://www.webcitation.org/6BVGUDscA Accessed: 2012-10-18.


Abstract

This specification defines a compatible extension of the UUID encoding
scheme detailed in RFC-4122. It folds into the UUID format numerous
establoshed identifier schemes, including but not limited to: numbers,
physiscal units and ammounts, colors, instants in time, publications,
chemicals, various catalogs-for example of astronomical objects.

Versions of UUIDs specified by RFC-4122 provide for randomly generated
identifiers that could very rarely colide. Sometimes however, the object
that is identified already has its well established identifier. Generating
a sepparate UUID for it would result into the need to maintain a
correspondence database.

This draft document describes a scheme for folding the numerous well
established digital identification schemes currently in use into a new
version of the RFC-4122 UUID. The scheme fits into a new UUID version
(could be version 6 or another version). It is currently called EXTUUID.

Besides providing a compact identification tool, the proposed scheme also
allows efficient representation, addressing and programming interfaces for
a large databases consisting of heterogenous vectors of such data.

There is a current implementation effort in the form of the corlpack
project. This proposal for a network version of the corlid (which is
architecture-specific and not as broad) is being developed in parallel
with that implementation.

The homepage of this document is:
http://dan.corlan.net/software/corlpack/extuuid.html

Contributions and comments are welcome.

  Contents

1 Introduction
2 Motivation
3 Specification
 3.1 Type
 3.2 Character encoding schemes
  3.2.1 Scheme A. 6 bit upper and lower case alphanumeric
  3.2.2 Scheme B. 6 bit upper (any) case, decimal number and punctuation
  3.2.3 Scheme C. 5 bit upper (any) case and punctuation
  3.2.4 Scheme D. Packed alphabetic
  3.2.5 Scheme E. EBCDIC-6
  3.2.6 Scheme F. Packed alphanumeric
 3.3 Symserial
 3.4 Fixed
 3.5 Vectors
 3.6 Mapping into the 128 bit UUID format
 3.7 Human readable printed representation
  3.7.1 Minimal and smart parser
  3.7.2 Symbols
  3.7.3 Serial numbers
  3.7.4 Fixed numbers
  3.7.5 Vectors
 3.8 Algorithm for determining the type given printed identifier syntax
4 General implementation
 4.1 App types vs representation types
  4.1.1 Example: a currency value
5 History of this document
6 References

  1 Introduction

Various types of objects, such as numbers, symbols, serial numbers,
ammount, country codes, chemical substances, books and periodicals,
already have one or more well established numbering schemes. Further
encoding them with randomly generated UUIDs results in unnecessary
inefficiencies including the necessity of maintaining a sepparate database
of correspondences with the well known identities.

URN names [RFC] cover the uniform specification of many of these objects.
However, they also have unlimited, variable length which results into
further efficiency issues with both computational and communication use.
There are numerous other identification schemas that are not yet covered
by URNs.

Here we propose a representation format for producing unique 128 bit
identifiers that correspond to data objects that are relatively widely
used.

  2 Motivation

The primary motivation behind adoption of this scheme is the reduction of
the complexity of application programming interfaces that are otherwise
needed to store and communicate data of the types described below.

UUIDs generated based on this proposal are entirely compatible with
version 1-5 UUIDs and may be represented syntactically in the same way.

A meaningful form for reading and writing by humans is also specified.

  3 Specification

The RFC-4122 UUIDs use 6 bits for version and variant. This specification
covers the encoding of the other 122 bits, when the variant is 10x and the
version is 110 (version 6).

The 122 bits are specified as a continuous string from 0, representing the
most significant, to 121, representing the least significant.

The mapping of the 122 bits into the UUID format is described in
section 3.6.

The 122 bit string is designated below as EXTUUID.

The section on printed representation of EXTUUIDS, section 3.7 actually
specifies what data can be encoded as EXTUUIDS, the sections below only
specify how the encoding is assembled into the EXTUUID.

 3.1 Type

Types are represented by the first 2 bits.

00: Symserial
        is a character string (symbol) followed by an optional integer
        number, for existing numbering schemes such as PMID, ISSN, ISBN,
        MAC address, etc, but also for names and other symbols (when the
        number part is not used).

01: Fixed
        is composed, like a simserial, from a character string and a
        number, but the number is a fixed point decimal.

10: Vector or quant
        a vector of numbers with a context field specifying the meaning of
        the numbers; a quant is a specific type of inhomogeneously
        represented vector, consisting of a number accompanied by a unit
        of measure of a physical measurable that can be expressed as a
        product of SI units at integer powers (from -7 to 8), at
        fractional powers with a denominator of 1 to 7.

11: Reserved

 3.2 Character encoding schemes

In order to optimize the available space, a variety of character encoding
schemes are used in EXTUUIDs. These are specified below.

   3.2.1 Scheme A. 6 bit upper and lower case alphanumeric

Each character is encoded on 6 bits, as follows (hex encoding and the
corresponding graphic character).

00 A  06 G  0c M  12 S  18 Y  1e 2  24 8  2a e  30 k  36 q  3c w  
01 B  07 H  0d N  13 T  19 Z  1f 3  25 9  2b f  31 l  37 r  3d x  
02 C  08 I  0e O  14 U  1a .  20 4  26 a  2c g  32 m  38 s  3e y  
03 D  09 J  0f P  15 V  1b _  21 5  27 b  2d h  33 n  39 t  3f z  
04 E  0a K  10 Q  16 W  1c 0  22 6  28 c  2e i  34 o  3a u  
05 F  0b L  11 R  17 X  1d 1  23 7  29 d  2f j  35 p  3b v

   3.2.2 Scheme B. 6 bit upper (any) case, decimal number and punctuation

00 A  06 G  0c M  12 S  18 Y  1e 2  24 8  2a %  30 /  36 <  3c [  
01 B  07 H  0d N  13 T  19 Z  1f 3  25 9  2b ^  31 \  37 =  3d ]  
02 C  08 I  0e O  14 U  1a .  20 4  26 !  2c &  32 ~  38 >  3e (  
03 D  09 J  0f P  15 V  1b _  21 5  27 @  2d +  33 ,  39 ?  3f )  
04 E  0a K  10 Q  16 W  1c 0  22 6  28 #  2e -  34 '  3a |  
05 F  0b L  11 R  17 X  1d 1  23 7  29 $  2f *  35 `  3b

   3.2.3 Scheme C. 5 bit upper (any) case and punctuation

00 A  06 G  0c M  12 S  18 Y  1e *  
01 B  07 H  0d N  13 T  19 Z  1f /  
02 C  08 I  0e O  14 U  1a .  
03 D  09 J  0f P  15 V  1b _  
04 E  0a K  10 Q  16 W  1c +  
05 F  0b L  11 R  17 X  1d -

   3.2.4 Scheme D. Packed alphabetic

The number of bits needed to represent the 26 any case letters in base 26
is:

m = ceil(4.7004398 * n)

Where ceil(x) is the smallest integer number larger or equal to x. The
base26 numbers are as in scheme C, up to position 0x19 (26).

For example, 10 characters take 47 bits with this scheme.

   3.2.5 Scheme E. EBCDIC-6

This is an already existing standard that fits the 6-bit needs we tried to
fill with scheme B. It is not currently supported, but it might be in the
future.

   3.2.6 Scheme F. Packed alphanumeric

The minimum number of bits needed to represent characters in base 38 is:

m = ceil(5.2479276 * n)

Where ceil(x) is the smallest integer number larger or equal to x.

The base28 numbers are as in scheme A and B, up to position 0x25 (38).

For example, 10 characters take 53 bits with this scheme.

 3.3 Symserial

Symserial representation consists of 4 fields: c (encoding) on 3 bits, l
(length) on 5 bits, s (symbol), that is l*5 or l*6 bits or another length
given by the respective formula depending on the encoding, and q, the
serial number, that is on the rest of the bits, considered as an unsigned
integer.

If q is 0, then the uuid represents just a symbol.

The smallest possible named symserial has one 5 bit character, the c+l
that make up 8 bits and a 105 bit integer that allows values of a bit
above 4**31 to be represented.

The longest name with a fixed encoding has 22 characters. Two bits are
left for q. With packed encoding, 23 characters are possible, thus there
isn't much to gain. With 6 characters encodings, up to 18 charactes are
available.

For example, to encode the Messier catalogue, a historic, fixed list of
sky objects that cannot increase, one would need 110 entries that take 7
bits. Thus, a string to identify the catalogue could be up to 21
characters with a 5 bit encoding or 17 with a 6 bit encoding.

 3.4 Fixed

The fixed representation consists of 5 fields: c (encoding) on 3 bits, l
(length) on 5 bits, e (number of decimal digits after the decimal point)
on 5 bits, s (symbol), that is l*5 or l*6 bits or another length given by
the respective formula depending on the encoding, and q, the quantity,
that is on the rest of the bits, considered as an signed integer.

The ammount represented is q * 10 ** (-e).

For example, using encoding B, and using a 4 letter string to represent a
real currency, the first three letters being the 3-letter code of the
currency (for example GBP) and the last being '$' to indicate we are
refering to a currency, this would leave 120-3-5-5-24 = 83 bits that can
represent numbers up to 9.6714066 * 10 ** 24, that is almost 25
significant digits.

 3.5 Vectors

Vectors may have a number of subtypes. The first three bits represent the
subtype, y:

000 - int: single signed integer
        sign, followed by 116 bits.

001 - q: quant

        Quants are floating point numbers that have an associated unit of
        measure for a physical quantity, that is covertible into SI units.

        It consists of 7 `xponents' for the 7 fundamental SI units, x0
        thorugh x6, one 64 bit floating point number, f (IEEE double
        precision) and a 4 bit context field. They are stored as follows:

        +------------------------+---+---------+  
        | component              |   | bits    |  
        +------------------------+---+---------+  
        +------------------------+---+---------+  
        | context field          | a | 5--8    |  
        +------------------------+---+---------+  
        | xponents               | x | 9-57    |  
        +------------------------+---+---------+  
        | value                  | v | 58-121  |  
        +------------------------+---+---------+

        The xponents correspond, each, to a fundamental SI unit, as
        follows:

        +-------------+----------------+--------+-------+  
        | xponent     | SI unit        | abbrev | bits  +  
        +-------------+----------------+--------+-------+  
        | x0          | metre          |   m    |  9-15 |  
        | x1          | kilogram       |   kg   | 16-22 |  
        | x2          | second         |   s    | 23-29 |  
        | x3          | Ampere         |   A    | 30-36 |  
        | x4          | Kelvin         |   K    | 37-43 |  
        | x5          | candela        |   cd   | 44-50 |  
        | x6          | mole           |   mol  | 51-57 |  
        +-------------+----------------+--------+-------+

        The 7 bits used are: sign (s) on 1 bit, followed by a 3 bit
        numerator, n, and a 3 bit denominator, d. It means that the power
        to which the respective unit is raised is x =  -1sn∕d. For
        example, if x0 is recorded in binary as: 1001010 it means s=1
        (that is minus), n=1, d=2. If x1-x6 are 0, then the unit of
        measure is m -1∕2 or 1∕m2.

        A Joule is represented with an x0 reading 0010001, an x1 reading
        0001001, and x2 reading 1010001 and the rest 0, meaning
        kgm2∕s2 or Nm.

        A Watt is like a Joule except x2 reads 1011001, meaning
        J∕s.

010 - f: single float
        the exponent is (the last) 16 bits. There remains one sign bit and
        an 100 bit mantissa providing for 30 significant digits.

011 - t: time
        A field, l, of 7 bits specifies the number of bits that are used
        to represent fractions of a second. The next fields consists of a
        sign, and then 109-l bits, the number of full seconds since (or
        before, if the sign is 1) october 15, 1582. The last l bits
        represent fractions of a second since the last full second, in
        units of 2 -l.

        The best accuracy that can be specified for current events is the
        number of time intervals that fit in between 1582 and the relevant
        date using 116 bits.

        Until year 2200 this is (using 4):

          ((2200 - 1582) * 3600 * 24 * 365.25) / (2^109) = approx. 3.0048508E-23

        It is a about 30 yoctoseconds, 30 millionthes of an attosecond.
        This would be achieved using

        log2((2200 - 1582) * 3600 * 24 * 365.25) = approx. 34.182947

        That is 35 bits before the binary point and 71 after the point.

        The initial and most obvious option was that the signed 116 bits
        integer represents femtoseconds since the start of the Gregorian
        calendar (15 October 1582). It identifies instants every
        femtosecond for:

        2116 3600  * 24  * 365.2515 = 2.6325433  * 1012years

        before and after that event, which means 2632 thousand billion
        years, much more than the 13 billion years that the universe is
        supposed to have existed.

        However, this didn't look fine; attosecond laser pulses are
        produced and recorded when they occur could in principle be
        desirable; we don't need this resolution more than 2 mil years
        before and after today, so even zepto-seconds would be fine; also,
        116 bits are hard to parse and handle; we could allow 64 bits in a
        second, that is about 54.21 zepto-seconds and the higher integer
        could provide for 2116 -64 that is about 100 million years. This
        does not, then, allow for representation of the mere history of
        the Cambrian period, not to speak about the emergence of galaxyes,
        and then 54.21 zepto-seconds may be quite inaccurate to fix
        attoseconds in time.

        But, then, we don't need to, and cannot map the cambrian period
        with attosecond accuracy.

        So it resulted that a some kind of float was needed.

100 - int: generic signed integer vector
        n, 7 bits, number of signed integers; l, 7 bits, size in bits of
        each integer; z, context field, 104-n*l, then the n elements of
        the vector, l bits each, the first of which is the sign.

101 - uint: generic unsigned integer vector
        n, 7 bits, number of unsigned integers; l, 7 bits, size in bits of
        each integer; z, context field, 104-n*l, then the n elements of
        the vector, l bits each.

110 - f/xx: generic float vector
        n, 7 bits, number of floating points l, 7 bits, size in bits of
        each floating point, e, 5 bits, size of each exponent; z, context
        field, 99-n*l, then the n elements of the vector, l bits each of
        which e are the exponent, 1 sign and l-e-1 mantissa.

111 - standard float vector
        f, format, on 3 bits, specifies either:

             000-ieee-80 floating point number
                     , preceded by a 24 bit context field, z, that is
                     represented as 4 scheme B characters;

             001-ieee-64 floating point number
                     , preceded by a 40 bit context field, z, as 8 scheme
                     C characters;

             010-3 x ieee-32 floating point numbers
                     , preceded by an 8 bit context field, z as a decimal
                     number between 0 and 255;

             011-2 x ieee-32 floating point numbers
                     , preceded by a 40 bit context field, z as 8 scheme C
                     characters;

             100-1 x ieee-32 floating point numbers
                     , preceded by a 72 bit context field, z, that is
                     encoded as 12 scheme B characters;

             101-6 x ieee-16 floating point numbers
                     , preceded by an 8 bit context field, z as a decimal
                     number between 0 and 255;

             110-1-5 ieee-16 floating point number
                     , followed by a 2 bit number that is n-1, the number
                     of floating point numbers, and (102 - n*16) bits
                     (that can be: 32, 48, 64 or 80 bits) that are encoded
                     as 6-bit, scheme B characters.

             111
                     reserved for other schemes with non-uniform length.

 3.6 Mapping into the 128 bit UUID format

 3.7 Human readable printed representation

For each identifier there is an explicit printed representation, that uses
the format scheme:{scheme:}*value that may produce any EXTUUID
representation representation, and an implicit printed representation that
is transformed into an UUID in a deterministic but necesarily obvious way.

The most explicit way to provide such a printed representation is to
specify the first scheme as extuuid. For example: extuuid:int:212323232423
produces the exact 117 bit signed integer; extuuid:quant:22.4*m3 means
22.4 cubic meters.

The same expressions may appear without extuuid: or even without any
qualifier if they are in a context that conducts to them being driven as
inputs into the extuuid parser. However, if presented with "22.4" the
parser would not know for sure if it is a quant, a float, a single number
float vector or standard float vector, etc. It would decide in a
deterministic way following the algorithm described in section 3.8, but to
the user this may be not obvious.

   3.7.1 Minimal and smart parser

A minimal parser will only be able to parse the full canonical syntax
described below. A smart parser may be provided by an application, that
may infer the type and representation from shorter or more convenient
forms. A smart parser can also be implemented as a filter.

   3.7.2 Symbols

A string is interpreted as an extuuid symbol if preceded by any of the
following: extuuid:sym:, extuuid:sym[A-F]:, sym: or sym[A-F]. Encodings
are tried in order, and the first that allows the representation of the
symbol is used.

should not the 5-bit appear first, as it uses the least ammount of memory?
No, because it is also more ambiguous.

   3.7.3 Serial numbers

The generic format:

[extuuid:]serial[A-F]::

where the name is encoded with the specified scheme or the first that
fits. The number must be a decimal unsigned integer that may contain any
number of dashes or underscores that are ignored.

   3.7.4 Fixed numbers

[extuuid:]fixed[A-F][0-32]::[-].

It is like a serial except that it may contain an optional explicit number
of decimal places after the optional encoding and it may be negative. If
the number of decimal places is not given then the minimum such number is
used.

   3.7.5 Vectors

[extuuid:][{int|q|t|uint|f}[/[*[*]]]:][:][,]*[[/*][]]*

(full syntax will have to be provided)

The fields have the following meaning:

int
        an integer number or signed integer vector; if  or  are
        provided then it is a signed integer vector with the specified
        number and length; also, if a name is provided, or a list of
        values-instead of a single value-then it is a signed integer
        vector instead of a single integer; the unit must not occur in
        ints; the value is an optional sign followed by a string of
        decimal digits, with optionally interspersed undescores; the first
        and the last character in a value must be a digit.

uint
        an unsigned integer vector; it may have ,  and a number of
        values, but of course no unit.

q
        quant; it may not have dimensional attributes (n,l,e), nor name,
        but may have any number of units. The unit names are case
        sensitive abbreviations of the fundamental SI units. They are
        preceded by a '*' if the power is positive and a '/' if it is
        negative. The power, if higher than 1, must be given as a single
        digit. Fractional powers are given as /.

t
        time; it may have an "n" attribute representing the number of bits
        to be used for subsecond resolution; it may not have a name; the
        value may be either a decimal number without sign, a single
        decimal dot, representing seconds from the reference time, or it
        may be of the format:

        yyyy[-/]MM[-/]dd;hh:mm:ss.n*

        where yyyy is the year, MM is the month, dd is the day, hh the
        hour (0-23), mm the minute and ss.n* represent the seconds and
        fractions of a second.

        Any of the tailing components may be missing.

        Other formats are acceptable by preceding with a format string and
        ':'. The format string is not yet specified.

f
        a vector of one or more floating point numbers; name and dimension
        fields are accepted, units are not; the shortest representation
        possible for the given constants is used.

 3.8 Algorithm for determining the type given printed identifier syntax

  4 General implementation

 4.1 App types vs representation types

The types mentioned in section 3 are called "representation types".

For a particular use, the application programmer will develop an app type.

   4.1.1 Example: a currency value

The follwing steps must be to considered.

Design. Common currencies are stored using the "Fixed" type, with four
decimals after the decimal point and with a name consisting of three
letters (ISO 4217) followed by '$', encoded with type B. A canonical
example is:

extuuid:fixedB4:GBP$:222.31

Other formats are, however, also accepted for input by a smart parser,
such as GBP222.31, GBP$222.31, 222.31$GBP or 222.31GBP.

An identifier is selected for the type, say it is "money". A restriction
can be established that, for example, only ISO 4127 valid letter
combinations are acceptable.

Implementation An unpacked type for this kind of objects is implemented.
The object is probably a record consisting of a 3-character string and a
64 bit integer number or a fixed point type.

Besides operations for the type (such as addition or multiplication with a
float), constructors, predicates (is it a valid currency?) and selectors
(what is the currency name?) must be provided.

For interfacing with the EXTUUID system, the following functions need to
be provided:

  * Is_Money(U: EXTUUID) return Boolean; tests whether an object is of
    this type.
  * Enpack(M: Money) return Extuuid; packs the money object as an extuuid.
  * Unpack(U: Extuuid) return Money; the reverse of enpack.
  * Is_Money(S: String) return Boolean; tests whether the string is a
    valid input object of the money type (this is the smart parser).
  * Parse(S: String) return Money;
  * Parse(S: String) return Extuuid;
  * Format(U: Extuuid) return String; (this may have more arguments).

The functions that do not use the Money type can be recorded as generic
readers and formatters to be used but procedures that do not directly see
this module.

  5 History of this document

12 oct. first version.

16 oct. posted online

18 oct. revised, section on implementation and reference to corlpack added

  6 References

[1] RFC 4122 A Universally Unique IDentifier (UUID) URN Namespace. P.
Leach, M. Mealling, R. Salz. 2005

[2] RFC 2141 URN Syntax. R Moats. 1997.

[3] http://www.webcitation.org

[4] http://qalculator.sourceforge.net/