A.D. Corlan
(RO) >
software
CORLPACK: R-like objects, Extended UUID (extuuid) support, simplified API for Ada-2005
A.D. Corlan
October 2, 2012. Last changed: April 24, 2013.
Archived by WebCite® at
http://www.webcitation.org/6DoDWzbLf
[Wonderful world]
[How you can help]
[Download]
[Earlier intro]
[References]
Corlpack (Common Objects of the R Language PACKage) is an Ada
package with a collection of data types and utility functions for
programmers of computational applications.
It is currently at version 0.5, that is an alpha, unstable
version. However, below is the general roadmap for version 1.0.
The Wonderful World of CORLPACK 1.0
The primary purpose of Corlpack is to simplify computational application
programming by reducing the number of APIs that need to be learned,
while still maintaining reasonable implementation efficiency.
Application programmers would only need to learn a small set of
generic functions and data types. Corlpack aims to separate the systems
side of an application (for example: access to datafiles of a variety
of formats) from the application itself.
We try to do this by introducing two data types: the extUUID
as a generic, 128 bit object that can represent a broad variety of usual
application-level data types, such as identifiers, coordinates
or quantities with unit, and the table of extUUIDs as a
generic container.
The choice of fixed size UUIDs and tables is motivated by the need
for reasonable efficiency, that is frequently essential for computational
applications.
The Corlpack table
The table is a two-dimensional array the elements of which may be
addressed either by numeric indices or by key values, such as
names. Each cell of the array is either an extUUID or a pointer to
another structure that may be a large string, a table or something
else.
Named vectors, lists, matrices, data frames and structures from the
R language [2] can easily be represented as corlpack tables by
enforcing suitable restrictions on the content of cells, dimensions or
content and number of the keys. The keys are restricted to single
symbols for the dimensions that have names.
Otherwise, the Corlpack tables are a generalisation of both R named
lists in that they may be bidimensional. They are also a
generalisation of the Common-Lisp arrays of objects of arbitrary type,
in that they may also be indexed with names instead of numbers. Also,
tables aim to be much faster to implement as addressing can be done by
calculating an offset, while still preserving the possibility of
having heterogenous content.
A small number of operators will allow access and modification of tables
and the vectors of ExtUUIDs that host them, mostly by using the plain array
indexing in Ada and an offset computation function.
The VUUID
Collections of data structures, for example one master table and all
the other tables, strings and other structures that are pointed to by
the master table, are stored as Ada variable size arrays of ExtUUIDs,
named VUUIDs. They are similar to fixed Strings from the standard
Ada library, except they are made of ExtUUIDs rather than Characters.
Typically, a VUUID is loaded from a datafile of a specified
format (such as an image, a movie, a csv/xls table, a netcdf file, a
fits record, a file system directory structure, etc) with a simple
Load invocation and may be saved in a data file of any suitable format
with a Save. Some of the formats are specific for some varieties
of structure (for example, images), other are generic, for any
possible VUUID.
A registering mechanism is provided for the `systems' programmer
to add new Load and Save methods.
VUUIDs may also be used without structuring elements, as simple
sequences of ExtUUIDs, for example to represent short expressions and
formulas or just heterogenous vectors of data.
The Extended UUID (ExtUUID)
The extended
UUID is the generic `scalar' type of Corlpack 1.0. It always
contains exactly 128 bits, thus fast addressable aggregates such as
vectors and tables are feasible without dynamic allocation. The
ExtUUIDs are designed to fit in the RFC 4122 scheme of 128 bits.
Each ExtUUID consists of a type tag, that is 13--28 bits long and
data. The format of the data bits is established by the value of the type
tag. The data bits may contain subtypes that further determine
the semantics of the remaining byte.
From the application programmer point of view, there is a flat list
of user types (named KINDs) of ExtUUIDs. There will be generic
operators for creating (Make_UUID), changing
(Set), accessing components (Get) of, operating with ("+","-","/","*","<",">",equality,etc) and
converting between ExtUUIDs of diferent kinds. There are also two generic operations,
(Read and Format) that convert to and from human readable string representations
of ExtUUIDs.
Get and Set deal with UUIDs uniformly, as sets of vectors of
integer, float, character and time objects.
For the generic operations, relatively fast registering and dispatching
mechanisms are provided for specific methods, so both the list of types
and their operations may be extended.
There is also a set of system ExtUUIDs named Konectors that provide structural
representation inside a plain array of ExtUUIDs (a VUUID), for example strings,
vectors or tables.
The KINDs that should be supported in version 1.0 for UUIDs are listed below.
If a recent existing release already supports them, it is added in parentheses.
- the "Not Available" (missing) value (0.5)
- alphanumeric symbols (0.4.2)
- time---a specific instant in physical real time (0.4.2)
- quant: a floating point number together with SI unit (0.4.2)
- fixed: a fixed point number with a unit such as a currency (0.4.2)
- integer, floating point, fraction or complex numbers
- spherical coordinates, unbound or bound to spheres of interest
(earth, planets, sky as seen form earth)
- V1-V5 UUID as specified by RFC 4122 (0.4.2)
- serial numers: ISBN, ISSN, EAN, PMID, UNII, LOINC, MAC address, etc (0.4.2)
- runs and compilations
- subcomponents of a document
- formats for formatted output
- data formats (of files)
- konnectors (pointers, strings, arrays and lists) for structuring a VUUID. (0.4.2)
Other types
For user convenience, other types: Int (64 bit integer), Real (80-bit
float) and Text (unboundend string), Time (Ada calendar time) are
provided with some of the same generic functions that are also
available for UUIDs. Also, thin interfaces to libraries such as
Ada.Text_IO, Ada.Calendar, Ada.Exceptions, elementary numerical
functions and others are available, reducing the number of 'with'
and 'use' statements, as well as generic instantiations that the user
needs to make.
The Hq
The Hq data structure was a previous attempt to design and implement
an efficient and versatile object like the table, but it proved to be
too complex. The first such attempt was using niliada conses, but they proved
too slow for large objects because of the necessity of garbage collection.
Hopefully, the VUUID-based tables will not have any of these drawbacks.
While the Hq is relatively complicated to use, and its development
is currently frozen, it is still kept for potential
use in the future, for very large (of the order of gigabytes) structures
that would require even faster and more memory efficient algorithms.
How you can help
Currently, Corlpack development mostly needs:
- testing;
- development of new ExtUUID
user types. There are examples, for the ISSN, ISBN, LOINC, PMID and
others in the sources. You could write one for any urn and
similar encoding scheme that fits easily in about 100 bits.
Examples are: ResearcherID, ORCID, SICI, ISAN, IETF urns, URN:LEX,
astronomical object catalogs (GSC, USNO, NGC, etc), things like
mac-addresses, ip addreses (ipv6 doesn't fit in an obvious way, but an
ipv6 network does);
- writing data load/save plugins for a variety of data formats
such as PNG or FITS.
Any other contribution is also welcome.
Download and Install
Version 0.5.2
This version features an executable tool, named `corl' that
provides an interface between shells and some corlpack functions.
It also includes minor improvements and cleanups such as removal
of compilation warnings and better error support when loading vuuids
(reporting line numbers of errors).
corlpack_0.5.2.ada
released april 24, 2013.
Version 0.5.1
We added arithmetical operators for quant extuuids (numbers with
units of measurement) and between quants, integers and floats. There
is also a new mechanism to add unit names and definitions that are
recognised by the reader. You can write: Len: UUID:= 4*foot + 2.5*inch
corlpack_0.5.1.ada
released january 24, 2013.
Version 0.5
First implementation of the corlpack tables, that are a generalisation
of all R data types. Workaround for the problem of random bits in
gaps of packed records and arrays preventing strightforward equality
testing of some UUIDs. Simplified error reporting. Load/save drivers
for key/value files.
corlpack_0.5.ada
released january 15, 2013.
Version 0.4
- Complete UUID APIs. UUIDs are now seen, each, as a set of
integer, float and character vectors---up to 4 vectors of each
kind--and the get and set generic functions may operate on the
respective components. For example, a serial is the name (kind) of the
serial (a character vector) and an integer number (serial number). A quant
is seen a single-value float vector and two integer vectors of 7 integers
each--the first are the numerators of the powers at which SI base units are raised, the seconds
are denominators of the fractional powers.
- added Pluggable UUID types (by registering make, read and format
methods for new uuid types). Corlpack.Issn as an example implementation of a plugged-in
UUID user type.
- Added VUUIDs, vectors of (extended) uuids
- Added string UUIDs that can represent ascii-7 strings of arbitrary type.
that do not exist independently, but as a data structure
inside a VUUID.
- fixed some errors in reading/writing time
- more tests
corlpack_0.4.2.ada
released january 8, 2013. Adds matrices, pointers and lists as structuring elements
in vectors of UUIDs. Plugin mechanism for loaders and savers of files of various
structures into/from ext-UUID vectors.
corlpack_0.4.1.ada
released january 1, 2013. Includes Extended UUID support for: Extended
UUID suport for PubMed ID (pmid), Unique ingredient identifier (unii)
Logical Intervention Indentifiers, Names and Codes (loinc),
webcitation URIs (wbct).
corlpack_0.4.ada
released december 27, 2012
For installation and usage, see below.
Version 0.3
Implementation of an
extended scheme
of 128-bit UUIDs (EXTUUID) including:
- limited length symbols with various encodings (18 to 22 characters)
- high resolution time
- physical quantities with units decomposable in SI units (for example: 22.4m/s)
- financial quantities as fixed point numbers with currency/financial instrument unit
- V1-V5 UUIDs of the RFC-4122 specification
- potential support for: serial numbers, colors, coordinates, functions, etc
Preliminary implementation of I/O of the first five varieties of UUIDs.
corlpack_0.3.ada
released october 18, 2012
To install, change directory into your Ada input path and say:
gnatchop corlpack_0.3.ada
To use, precede your application with:
with Corlpack; use Corlpack;
See the .ads file for details.
Version 0.2
Early, unstable, alpha version.
corlpack_0.2.ada released october 4, 2012
To install, change directory into your Ada input path and say:
gnatchop corlpack_0.2.ada
To use, precede your application with:
with Corlpack; use Corlpack;
See the .ads file for details.
General introduction
Corlpack currently contains:
- Unified processing of symbols and UUIDs [1], using an efficient encoding scheme
for alphanumeric identifiers (symbols) up to 20 character in length.
Since version 0.3 a broad range of data types, besides symbols, including measurement
units, fixed point numbers, named numbers and vectors, have been
folded into the UUID schema, and called EXTUUID. Pluggable modules
to add new types as well as support for VUUIDs (vectors
of uuids) have been added in version 0.4.
Vectors of UUIDs are planned to also represent structures such as
tables, lists and arrays, covering data types currently available in
R, Qalc, Python and others. It will be less efficient to deal with a
large float array in corlpack compared to native Ada, but should prove
substantially more efficient compared to R or Python.
- A thin interface to the most frequently used functions in the
standard Ada and Gnat libraries, including numerics, I/O, random
generation, unbounded strings, that removes the need to use/with many
standard packages.
- A data type (named 'Hq') that can efficiently represent any data
structures from R [2], the statistical and analytics language. It aims
to provide a more efficient variant of VUUID for dealing with large
datastructure, although the same data will be possible to represent
with VUUIDs as well.
The Hq data type includes possibly named vectors of floating point numbers,
integers, symbols or strings as well as named lists of other lists and
such vectors. A whole tree of lists and components is represented as a
single h(un)q of memory, that can be allocated at once (even on the
stack), thus saving much overhead in heap allocation and garbage
collection. Another data type, Cursor, is provided for access into the
structure.
Data types
- Real is a floating point type that is supposedly the only one used
in ones application and in more complex Corlpack types.
- Int is an integer type. (Corlpack could easily be made into a generic
package parameterised by Real and Int).
- Text is a string type, that currently renames the Unbounded_String from the
standard Ada library. Might be replaced in the future with a more cooked version
implemented via Hq.
- Time currently renames Ada.Calendar.Text, but another version
is provided with the EXTUUID (Corlid) implementation.
- Uuid is an 128-bit uuid object. However, in this particular formulation
the bits are not stored as in the uuid standard [1] for efficiency reasons.
Symbols that are composed of up to 20 alphanumeric characters plus the
'_' and the ':' characters can be converted to uuids and back (corresponding
to 1/4 of the random UUID space). Such symbols are named Corlpack identifiers.
They are in fact numbers in base 64, but trailing ':' are not displayed.
- Corlid is also an 128-bit uuid object, that
may be converted to an UUID and that supports, besides symbols,
a broad range of other identifiable objects (see above).
- Hq is a parametric record consisting of spaces of Uuid,
Character, Real, Int and Boolean objects (together named scalars) that
are refered by one or more list elements. A Hq is a tree that
has vectors of scalars as leafs, that is semantically equivalent with
the named lists in R [2], with the following exceptions:
- the names of list and vector elements must be Corlpack identifiers
- factors are not supported as such, must be implemented as vectors of Corlpack identifiers
- there are no vectors of strings, they must be implemented as lists of strings
Missing values are supported. Hq is distinguished from other similar
implementations (like niliada) by the fact that a whole structure,
with lists, vectors and names, is stored in a single block of memory,
thus potentially reducing garbage collection overhead
substantially. If used carefully
(avoiding for example too much copying by parameter passing) it could result
in much faster computations than possible with large structures of small dynamically
allocated objects.
- Cursor
A cursor is a reference inside a Hq, implemented as a pointer to the Hq
structure and indices to a current list and vector element.
Operators overview
- To_Text, To_String rename the Unbounded String version.
- Sqrt,Log,Exp,**,trigonometric functions rename the elementary functions for the Real type.
- Read returns a scalar or string read from the standard input.
- Format formated printing of scalars into a string.
- Write,Nl formated printing of scalars (or a newline) into a Text buffer or standard output.
- Save,Log save or append a Text buffer (in)to a file with interprocess locking.
- Load load a Text buffer from a file.
- Trim eliminate whitespace around a string.
- Moment_Random random generator that changes with each run.
- Random generate a random Int or Real. For Real, uniform and truncated normal
distributions currently supported.
- Reset_Generator,Generator_StateControl the sequence of random numbers.
- Clock,To_String quick interface to Ada.Calendar.
- To_String,To_Uuid uuid equivallent of symbols.
- Read,Format (since v0.3) convertion of the extended form of
UUIDs from internal (corlid) 128-bit format
to string/printable format and back.
- Random_Uuid_Symbol generate a random symbol with a prefix.
- [Plain_]YYYY_Vector,Plain_String,List where YYYY can be: Real, Int, Bool, Id
are constructors of the simplest forms of Hq. The Plain variants do not have names,
missing values or variable length, but also have lower overhead.
- Is_[Plain_]YYYY[_Vector] predicates for Hq objects or their components
(pointed to by a cursor).
- Data,Name,Is_Na contents of Hq objects.
- Length,Size Current and maximum size of Hq object or a component.
- Ascend,Descend,Move Move a cursor inside a Hq.
- Store,Append,Append_Na,Set_Na,Set_Name Modify a Hq or a component.
The sepparate procedures Hqtest, Corlidtest and Testuuid contain regression tests.
References
[1] RFC 4122 A Universally Unique IDentifier (UUID) URN Namespace.
P. Leach, M. Mealling, R. Salz.
[2] R Language Definition. www.r-project.org