alexandru@corlan.net
October 12, 2012, last change October 18, 2012
Archived by WebCite® at http://www.webcitation.org/6BVGUDscA Accessed: 2012-10-18.
Abstract This specification defines a compatible extension of the UUID encoding scheme detailed in RFC-4122. It folds into the UUID format numerous establoshed identifier schemes, including but not limited to: numbers, physiscal units and ammounts, colors, instants in time, publications, chemicals, various catalogs-for example of astronomical objects. Versions of UUIDs specified by RFC-4122 provide for randomly generated identifiers that could very rarely colide. Sometimes however, the object that is identified already has its well established identifier. Generating a sepparate UUID for it would result into the need to maintain a correspondence database. This draft document describes a scheme for folding the numerous well established digital identification schemes currently in use into a new version of the RFC-4122 UUID. The scheme fits into a new UUID version (could be version 6 or another version). It is currently called EXTUUID. Besides providing a compact identification tool, the proposed scheme also allows efficient representation, addressing and programming interfaces for a large databases consisting of heterogenous vectors of such data. There is a current implementation effort in the form of the corlpack project. This proposal for a network version of the corlid (which is architecture-specific and not as broad) is being developed in parallel with that implementation. The homepage of this document is: http://dan.corlan.net/software/corlpack/extuuid.html Contributions and comments are welcome. Contents 1 Introduction 2 Motivation 3 Specification 3.1 Type 3.2 Character encoding schemes 3.2.1 Scheme A. 6 bit upper and lower case alphanumeric 3.2.2 Scheme B. 6 bit upper (any) case, decimal number and punctuation 3.2.3 Scheme C. 5 bit upper (any) case and punctuation 3.2.4 Scheme D. Packed alphabetic 3.2.5 Scheme E. EBCDIC-6 3.2.6 Scheme F. Packed alphanumeric 3.3 Symserial 3.4 Fixed 3.5 Vectors 3.6 Mapping into the 128 bit UUID format 3.7 Human readable printed representation 3.7.1 Minimal and smart parser 3.7.2 Symbols 3.7.3 Serial numbers 3.7.4 Fixed numbers 3.7.5 Vectors 3.8 Algorithm for determining the type given printed identifier syntax 4 General implementation 4.1 App types vs representation types 4.1.1 Example: a currency value 5 History of this document 6 References 1 Introduction Various types of objects, such as numbers, symbols, serial numbers, ammount, country codes, chemical substances, books and periodicals, already have one or more well established numbering schemes. Further encoding them with randomly generated UUIDs results in unnecessary inefficiencies including the necessity of maintaining a sepparate database of correspondences with the well known identities. URN names [RFC] cover the uniform specification of many of these objects. However, they also have unlimited, variable length which results into further efficiency issues with both computational and communication use. There are numerous other identification schemas that are not yet covered by URNs. Here we propose a representation format for producing unique 128 bit identifiers that correspond to data objects that are relatively widely used. 2 Motivation The primary motivation behind adoption of this scheme is the reduction of the complexity of application programming interfaces that are otherwise needed to store and communicate data of the types described below. UUIDs generated based on this proposal are entirely compatible with version 1-5 UUIDs and may be represented syntactically in the same way. A meaningful form for reading and writing by humans is also specified. 3 Specification The RFC-4122 UUIDs use 6 bits for version and variant. This specification covers the encoding of the other 122 bits, when the variant is 10x and the version is 110 (version 6). The 122 bits are specified as a continuous string from 0, representing the most significant, to 121, representing the least significant. The mapping of the 122 bits into the UUID format is described in section 3.6. The 122 bit string is designated below as EXTUUID. The section on printed representation of EXTUUIDS, section 3.7 actually specifies what data can be encoded as EXTUUIDS, the sections below only specify how the encoding is assembled into the EXTUUID. 3.1 Type Types are represented by the first 2 bits. 00: Symserial is a character string (symbol) followed by an optional integer number, for existing numbering schemes such as PMID, ISSN, ISBN, MAC address, etc, but also for names and other symbols (when the number part is not used). 01: Fixed is composed, like a simserial, from a character string and a number, but the number is a fixed point decimal. 10: Vector or quant a vector of numbers with a context field specifying the meaning of the numbers; a quant is a specific type of inhomogeneously represented vector, consisting of a number accompanied by a unit of measure of a physical measurable that can be expressed as a product of SI units at integer powers (from -7 to 8), at fractional powers with a denominator of 1 to 7. 11: Reserved 3.2 Character encoding schemes In order to optimize the available space, a variety of character encoding schemes are used in EXTUUIDs. These are specified below. 3.2.1 Scheme A. 6 bit upper and lower case alphanumeric Each character is encoded on 6 bits, as follows (hex encoding and the corresponding graphic character). 00 A 06 G 0c M 12 S 18 Y 1e 2 24 8 2a e 30 k 36 q 3c w 01 B 07 H 0d N 13 T 19 Z 1f 3 25 9 2b f 31 l 37 r 3d x 02 C 08 I 0e O 14 U 1a . 20 4 26 a 2c g 32 m 38 s 3e y 03 D 09 J 0f P 15 V 1b _ 21 5 27 b 2d h 33 n 39 t 3f z 04 E 0a K 10 Q 16 W 1c 0 22 6 28 c 2e i 34 o 3a u 05 F 0b L 11 R 17 X 1d 1 23 7 29 d 2f j 35 p 3b v 3.2.2 Scheme B. 6 bit upper (any) case, decimal number and punctuation 00 A 06 G 0c M 12 S 18 Y 1e 2 24 8 2a % 30 / 36 < 3c [ 01 B 07 H 0d N 13 T 19 Z 1f 3 25 9 2b ^ 31 \ 37 = 3d ] 02 C 08 I 0e O 14 U 1a . 20 4 26 ! 2c & 32 ~ 38 > 3e ( 03 D 09 J 0f P 15 V 1b _ 21 5 27 @ 2d + 33 , 39 ? 3f ) 04 E 0a K 10 Q 16 W 1c 0 22 6 28 # 2e - 34 ' 3a | 05 F 0b L 11 R 17 X 1d 1 23 7 29 $ 2f * 35 ` 3b 3.2.3 Scheme C. 5 bit upper (any) case and punctuation 00 A 06 G 0c M 12 S 18 Y 1e * 01 B 07 H 0d N 13 T 19 Z 1f / 02 C 08 I 0e O 14 U 1a . 03 D 09 J 0f P 15 V 1b _ 04 E 0a K 10 Q 16 W 1c + 05 F 0b L 11 R 17 X 1d - 3.2.4 Scheme D. Packed alphabetic The number of bits needed to represent the 26 any case letters in base 26 is: m = ceil(4.7004398 * n) Where ceil(x) is the smallest integer number larger or equal to x. The base26 numbers are as in scheme C, up to position 0x19 (26). For example, 10 characters take 47 bits with this scheme. 3.2.5 Scheme E. EBCDIC-6 This is an already existing standard that fits the 6-bit needs we tried to fill with scheme B. It is not currently supported, but it might be in the future. 3.2.6 Scheme F. Packed alphanumeric The minimum number of bits needed to represent characters in base 38 is: m = ceil(5.2479276 * n) Where ceil(x) is the smallest integer number larger or equal to x. The base28 numbers are as in scheme A and B, up to position 0x25 (38). For example, 10 characters take 53 bits with this scheme. 3.3 Symserial Symserial representation consists of 4 fields: c (encoding) on 3 bits, l (length) on 5 bits, s (symbol), that is l*5 or l*6 bits or another length given by the respective formula depending on the encoding, and q, the serial number, that is on the rest of the bits, considered as an unsigned integer. If q is 0, then the uuid represents just a symbol. The smallest possible named symserial has one 5 bit character, the c+l that make up 8 bits and a 105 bit integer that allows values of a bit above 4**31 to be represented. The longest name with a fixed encoding has 22 characters. Two bits are left for q. With packed encoding, 23 characters are possible, thus there isn't much to gain. With 6 characters encodings, up to 18 charactes are available. For example, to encode the Messier catalogue, a historic, fixed list of sky objects that cannot increase, one would need 110 entries that take 7 bits. Thus, a string to identify the catalogue could be up to 21 characters with a 5 bit encoding or 17 with a 6 bit encoding. 3.4 Fixed The fixed representation consists of 5 fields: c (encoding) on 3 bits, l (length) on 5 bits, e (number of decimal digits after the decimal point) on 5 bits, s (symbol), that is l*5 or l*6 bits or another length given by the respective formula depending on the encoding, and q, the quantity, that is on the rest of the bits, considered as an signed integer. The ammount represented is q * 10 ** (-e). For example, using encoding B, and using a 4 letter string to represent a real currency, the first three letters being the 3-letter code of the currency (for example GBP) and the last being '$' to indicate we are refering to a currency, this would leave 120-3-5-5-24 = 83 bits that can represent numbers up to 9.6714066 * 10 ** 24, that is almost 25 significant digits. 3.5 Vectors Vectors may have a number of subtypes. The first three bits represent the subtype, y: 000 - int: single signed integer sign, followed by 116 bits. 001 - q: quant Quants are floating point numbers that have an associated unit of measure for a physical quantity, that is covertible into SI units. It consists of 7 `xponents' for the 7 fundamental SI units, x0 thorugh x6, one 64 bit floating point number, f (IEEE double precision) and a 4 bit context field. They are stored as follows: +------------------------+---+---------+ | component | | bits | +------------------------+---+---------+ +------------------------+---+---------+ | context field | a | 5--8 | +------------------------+---+---------+ | xponents | x | 9-57 | +------------------------+---+---------+ | value | v | 58-121 | +------------------------+---+---------+ The xponents correspond, each, to a fundamental SI unit, as follows: +-------------+----------------+--------+-------+ | xponent | SI unit | abbrev | bits + +-------------+----------------+--------+-------+ | x0 | metre | m | 9-15 | | x1 | kilogram | kg | 16-22 | | x2 | second | s | 23-29 | | x3 | Ampere | A | 30-36 | | x4 | Kelvin | K | 37-43 | | x5 | candela | cd | 44-50 | | x6 | mole | mol | 51-57 | +-------------+----------------+--------+-------+ The 7 bits used are: sign (s) on 1 bit, followed by a 3 bit numerator, n, and a 3 bit denominator, d. It means that the power to which the respective unit is raised is x = -1sn∕d. For example, if x0 is recorded in binary as: 1001010 it means s=1 (that is minus), n=1, d=2. If x1-x6 are 0, then the unit of measure is m -1∕2 or 1∕m2. A Joule is represented with an x0 reading 0010001, an x1 reading 0001001, and x2 reading 1010001 and the rest 0, meaning kgm2∕s2 or Nm. A Watt is like a Joule except x2 reads 1011001, meaning J∕s. 010 - f: single float the exponent is (the last) 16 bits. There remains one sign bit and an 100 bit mantissa providing for 30 significant digits. 011 - t: time A field, l, of 7 bits specifies the number of bits that are used to represent fractions of a second. The next fields consists of a sign, and then 109-l bits, the number of full seconds since (or before, if the sign is 1) october 15, 1582. The last l bits represent fractions of a second since the last full second, in units of 2 -l. The best accuracy that can be specified for current events is the number of time intervals that fit in between 1582 and the relevant date using 116 bits. Until year 2200 this is (using 4): ((2200 - 1582) * 3600 * 24 * 365.25) / (2^109) = approx. 3.0048508E-23 It is a about 30 yoctoseconds, 30 millionthes of an attosecond. This would be achieved using log2((2200 - 1582) * 3600 * 24 * 365.25) = approx. 34.182947 That is 35 bits before the binary point and 71 after the point. The initial and most obvious option was that the signed 116 bits integer represents femtoseconds since the start of the Gregorian calendar (15 October 1582). It identifies instants every femtosecond for: 2116 3600 * 24 * 365.2515 = 2.6325433 * 1012years before and after that event, which means 2632 thousand billion years, much more than the 13 billion years that the universe is supposed to have existed. However, this didn't look fine; attosecond laser pulses are produced and recorded when they occur could in principle be desirable; we don't need this resolution more than 2 mil years before and after today, so even zepto-seconds would be fine; also, 116 bits are hard to parse and handle; we could allow 64 bits in a second, that is about 54.21 zepto-seconds and the higher integer could provide for 2116 -64 that is about 100 million years. This does not, then, allow for representation of the mere history of the Cambrian period, not to speak about the emergence of galaxyes, and then 54.21 zepto-seconds may be quite inaccurate to fix attoseconds in time. But, then, we don't need to, and cannot map the cambrian period with attosecond accuracy. So it resulted that a some kind of float was needed. 100 - int: generic signed integer vector n, 7 bits, number of signed integers; l, 7 bits, size in bits of each integer; z, context field, 104-n*l, then the n elements of the vector, l bits each, the first of which is the sign. 101 - uint: generic unsigned integer vector n, 7 bits, number of unsigned integers; l, 7 bits, size in bits of each integer; z, context field, 104-n*l, then the n elements of the vector, l bits each. 110 - f/xx: generic float vector n, 7 bits, number of floating points l, 7 bits, size in bits of each floating point, e, 5 bits, size of each exponent; z, context field, 99-n*l, then the n elements of the vector, l bits each of which e are the exponent, 1 sign and l-e-1 mantissa. 111 - standard float vector f, format, on 3 bits, specifies either: 000-ieee-80 floating point number , preceded by a 24 bit context field, z, that is represented as 4 scheme B characters; 001-ieee-64 floating point number , preceded by a 40 bit context field, z, as 8 scheme C characters; 010-3 x ieee-32 floating point numbers , preceded by an 8 bit context field, z as a decimal number between 0 and 255; 011-2 x ieee-32 floating point numbers , preceded by a 40 bit context field, z as 8 scheme C characters; 100-1 x ieee-32 floating point numbers , preceded by a 72 bit context field, z, that is encoded as 12 scheme B characters; 101-6 x ieee-16 floating point numbers , preceded by an 8 bit context field, z as a decimal number between 0 and 255; 110-1-5 ieee-16 floating point number , followed by a 2 bit number that is n-1, the number of floating point numbers, and (102 - n*16) bits (that can be: 32, 48, 64 or 80 bits) that are encoded as 6-bit, scheme B characters. 111 reserved for other schemes with non-uniform length. 3.6 Mapping into the 128 bit UUID format 3.7 Human readable printed representation For each identifier there is an explicit printed representation, that uses the format scheme:{scheme:}*value that may produce any EXTUUID representation representation, and an implicit printed representation that is transformed into an UUID in a deterministic but necesarily obvious way. The most explicit way to provide such a printed representation is to specify the first scheme as extuuid. For example: extuuid:int:212323232423 produces the exact 117 bit signed integer; extuuid:quant:22.4*m3 means 22.4 cubic meters. The same expressions may appear without extuuid: or even without any qualifier if they are in a context that conducts to them being driven as inputs into the extuuid parser. However, if presented with "22.4" the parser would not know for sure if it is a quant, a float, a single number float vector or standard float vector, etc. It would decide in a deterministic way following the algorithm described in section 3.8, but to the user this may be not obvious. 3.7.1 Minimal and smart parser A minimal parser will only be able to parse the full canonical syntax described below. A smart parser may be provided by an application, that may infer the type and representation from shorter or more convenient forms. A smart parser can also be implemented as a filter. 3.7.2 Symbols A string is interpreted as an extuuid symbol if preceded by any of the following: extuuid:sym:, extuuid:sym[A-F]:, sym: or sym[A-F]. Encodings are tried in order, and the first that allows the representation of the symbol is used. should not the 5-bit appear first, as it uses the least ammount of memory? No, because it is also more ambiguous. 3.7.3 Serial numbers The generic format: [extuuid:]serial[A-F]:: where the name is encoded with the specified scheme or the first that fits. The number must be a decimal unsigned integer that may contain any number of dashes or underscores that are ignored. 3.7.4 Fixed numbers [extuuid:]fixed[A-F][0-32]: :[-] . It is like a serial except that it may contain an optional explicit number of decimal places after the optional encoding and it may be negative. If the number of decimal places is not given then the minimum such number is used. 3.7.5 Vectors [extuuid:][{int|q|t|uint|f}[/ [* [* ]]]:][ :] [, ]*[[/*] [ ]]* (full syntax will have to be provided) The fields have the following meaning: int an integer number or signed integer vector; if or are provided then it is a signed integer vector with the specified number and length; also, if a name is provided, or a list of values-instead of a single value-then it is a signed integer vector instead of a single integer; the unit must not occur in ints; the value is an optional sign followed by a string of decimal digits, with optionally interspersed undescores; the first and the last character in a value must be a digit. uint an unsigned integer vector; it may have , and a number of values, but of course no unit. q quant; it may not have dimensional attributes (n,l,e), nor name, but may have any number of units. The unit names are case sensitive abbreviations of the fundamental SI units. They are preceded by a '*' if the power is positive and a '/' if it is negative. The power, if higher than 1, must be given as a single digit. Fractional powers are given as / . t time; it may have an "n" attribute representing the number of bits to be used for subsecond resolution; it may not have a name; the value may be either a decimal number without sign, a single decimal dot, representing seconds from the reference time, or it may be of the format: yyyy[-/]MM[-/]dd;hh:mm:ss.n* where yyyy is the year, MM is the month, dd is the day, hh the hour (0-23), mm the minute and ss.n* represent the seconds and fractions of a second. Any of the tailing components may be missing. Other formats are acceptable by preceding with a format string and ':'. The format string is not yet specified. f a vector of one or more floating point numbers; name and dimension fields are accepted, units are not; the shortest representation possible for the given constants is used. 3.8 Algorithm for determining the type given printed identifier syntax 4 General implementation 4.1 App types vs representation types The types mentioned in section 3 are called "representation types". For a particular use, the application programmer will develop an app type. 4.1.1 Example: a currency value The follwing steps must be to considered. Design. Common currencies are stored using the "Fixed" type, with four decimals after the decimal point and with a name consisting of three letters (ISO 4217) followed by '$', encoded with type B. A canonical example is: extuuid:fixedB4:GBP$:222.31 Other formats are, however, also accepted for input by a smart parser, such as GBP222.31, GBP$222.31, 222.31$GBP or 222.31GBP. An identifier is selected for the type, say it is "money". A restriction can be established that, for example, only ISO 4127 valid letter combinations are acceptable. Implementation An unpacked type for this kind of objects is implemented. The object is probably a record consisting of a 3-character string and a 64 bit integer number or a fixed point type. Besides operations for the type (such as addition or multiplication with a float), constructors, predicates (is it a valid currency?) and selectors (what is the currency name?) must be provided. For interfacing with the EXTUUID system, the following functions need to be provided: * Is_Money(U: EXTUUID) return Boolean; tests whether an object is of this type. * Enpack(M: Money) return Extuuid; packs the money object as an extuuid. * Unpack(U: Extuuid) return Money; the reverse of enpack. * Is_Money(S: String) return Boolean; tests whether the string is a valid input object of the money type (this is the smart parser). * Parse(S: String) return Money; * Parse(S: String) return Extuuid; * Format(U: Extuuid) return String; (this may have more arguments). The functions that do not use the Money type can be recorded as generic readers and formatters to be used but procedures that do not directly see this module. 5 History of this document 12 oct. first version. 16 oct. posted online 18 oct. revised, section on implementation and reference to corlpack added 6 References [1] RFC 4122 A Universally Unique IDentifier (UUID) URN Namespace. P. Leach, M. Mealling, R. Salz. 2005 [2] RFC 2141 URN Syntax. R Moats. 1997. [3] http://www.webcitation.org [4] http://qalculator.sourceforge.net/