Runes

1 Introduction

The runes interface in Closure arose from the need for Unicode characters and strings, while not having a Lisp implementation available that offers those characters. The runes API offers an interface very similar to the character and string interface in standard Common Lisp.

There are two implementations:

rune-is-character: This is for Unicode-aware Lisp implementations; rune is a synonym for character and rod is a synonym for string.
rune-is-integer: This is for environments that are not Unicode-aware. runes really are (unsigned-byte 16). And rods are specialized vectors of those runes.

Note that in any of these models, a rod is vector of rune objects. So that you can use all the standard Common Lisp sequence functions on rod objects; further it is guaranteed that eql works on rune objects.

Additionally there are two reader macros. #/… to read runes and #"…" to read rods.

Although most Common Lisp implementations these days have Unicode support, using runes is still a good idea in applications or libraries that aim to be highly portable amoung different implementations. For one thing you get Unicode support for the occasional non-Unicode aware implementation; For the other thing you can be sure, that certain things remain constant. Like the input syntax — is the ASCII Formfeed character called #\Page or #\Formfeed? Or: What code point is used to represent the end of a line? Additionally behavior of string-upcase and friends can vary a lot.

2 Runes

Runes are like characters. However different from Common Lisp, we specify that a rune is a single Unicode code point. And: For every Unicode code point there is a rune.

Implementation Note — In the current implementation a rune might be represented as an (unsigned-byte 16); this is a historical accident, as at the time of writing, the original author was not aware that there will be Unicode code points beyond 2¹⁶.

Depending on the model choosen, a rune might be either be represented as an unsigned-byte, a character or some otherwise opaque structure.

[Type] rune

[Function] code-rune code: Returns the rune which identifies the Unicode code point code.

[Function] rune-code rune: Returns the code point that rune identifies.

[Function] char-rune char: Returns the rune that corresponds to the Common Lisp character char.

[Function] rune-char rune &optional (default *invalid-rune*): Returns the a Common Lisp character that corresponds to the rune rune. If the particular rune is not representable as a character in the implementation at hand, default is returned. If, in this case, default is nil, an error is signaled.

[Special Variable] *invalid-rune*: Rune to use as a replacement in rune-char and rod-string for runes not representable as characters. If nil, an error is signalled instead.

Predicates

[Function] runep object: Returns true if object is a rune. Note that unless the rune-is-structure model is selected, we can't tell runes apart from either characters or integers, depending on the model choosen.

[Function] white-space-rune-p rune: Returns true, if the rune rune is a white space. White space defined as either ASCII Space, Linefeed, Carrige Return, or Tabulator. (Code points decimal 32, 10, 13, and 9).

[Function] digit-rune-p rune &optional (radix 10)

If rune is a digit according to the base radix, the weight of the digit is returned; otherwise nil is returned. radix should be an integer in the range [2; 36].

Only arabic digits and latin letters are ever considered to be digits.

[Function] rune= x y

[Function] rune⇐ rune &rest more

[Function] rune>= rune &rest more

[Function] rune-equal rune1 rune2: Returns true, if rune1 and rune2 differ only by case.

3 Rods

Rods are vectors of runes. We specifically opted for not further warp rods into some structure say, for the benefit that the whole bunch of Common Lisp sequence functions work on rods.

[Type] rod: This type refers to a vector of runes. Depending on the implementation model choosen, it is the appropriate subtype of vector.

[Type] simple-rod: This type refers to a simple vector of runes. Depending on the implementation model choosen, it is the appropriate subtype of simple-vector.

[Function] make-rod size: Returns a freshly allocated rod of length size. The initial content of the rod returned is unspecified.

[Function] sloopy-rod-p object: Returns true, if object looks like a rod.

[Function] rune rod index: Returns the indexth rune from rod. It is an error if index is not a positive integer less than the length of rod.

[Function] (setf rune) new-value rod index: Modifies the index'th rune of rod to become new-value. It is an error of index is not a positive integer less than the length of rod. It also is an error if new-value is not a rune.

[Function] %rune rod index: Like the rune accessor, but rod is assumed to be simple-rod and index is assumed to be a legitimate index. This is a low safety variant for speed — use with care.

[Function] (setf %rune) new rod index

[Function] rod object

[Function] rod-subseq rod start &optional end: Returns a freshly allocated rod, that contains in sequence, all the runes in rod, indexed by the range from start to end (exclusively). end defaults to the length of rod.

[Function] rod= rod1 rod2: Returns true, if rod1 and rod2 represent the same sequence of runes.

[Function] rod< rod1 rod2

[Function] rod-equal rod1 rod2: Returns true, if the sequences of runes are rune-equal. That is rod1 and rod2 are compared to each other while ignoring the case of the runes.

4 Case Conversion

Note, that unlike string-upcase and string-downcase we do not make the promise that the length of a case converted string is the same as the original string.

[Function] rune-downcase rune: Converts the rune rune to downcase. If the rune has no case or no down case equivalent is available the original rune is returned.

[Function] rune-upcase rune: Returns the upcase equivalent of rune. If there the rune has no case or has no upcase equivalent the original rune is returned.

[Function] rod-downcase rod: Converts each rune from rod to downcase and returns the resulting rod. Note that the result can have a length different from the input argument.

[Function] rod-upcase rod: Converts each rune from rod to upcase and returns the resulting rod. Note that the result can have a length different from the input argument.

5 Character and String Conversion

There are some convenience functions provided to convert from vanilla Common Lisp characters and strings to rune and rod objects.

[Function] rod-string rod &optional (default-char *invalid-rune*): Turns the rod rod into a Common Lisp string. default-char is as for rune-char.

[Function] string-rod string: Converts the Common Lisp string string into a rod.

6 Syntax

[Syntax] #/…

This syntax is used to read a rune. It is similar to the Common Lisp #\… syntax.

#/U+nnnn

#xnnnn

rune with the code nnnn hexadecimal

The following semi-standard rune names are defined:

`#/Null`	`#x0000`
`#/Space`	`#x0020`
`#/Newline`	`#x000A`
`#/Return`	`#x000D`
`#/Tab`	`#x0009`
`#/Page`	`#x000C`

The following ASCII runes are defined:

`#/nul`	`#x0000`	null character
`#/soh`	`#x0001`	start of header
`#/stx`	`#x0002`	start of text
`#/etx`	`#x0003`	end of text
`#/eot`	`#x0004`	end of transmission
`#/enq`	`#x0005`	enquiry
`#/ack`	`#x0006`	acknowledgment
`#/bel`	`#x0007`	bell
`#/bs`	`#x0008`	backspace
`#/ht`	`#x0009`	horizontal tab
`#/lf`	`#x000A`	line feed
`#/vt`	`#x000B`	vertical tab
`#/ff`	`#x000C`	form feed
`#/cr`	`#x000D`	carriage return
`#/so`	`#x000E`	shift out
`#/si`	`#x000F`	shift in
`#/dle`	`#x0010`	data link escape
`#/dc1`	`#x0011`	device control 1
`#/dc2`	`#x0012`	device control 2
`#/dc3`	`#x0013`	device control 3
`#/dc4`	`#x0014`	device control 4
`#/nak`	`#x0015`	negative acknowledgement
`#/syn`	`#x0016`	synchronous idle
`#/etb`	`#x0017`	end of transmission block
`#/can`	`#x0018`	cancel
`#/em`	`#x0019`	end of medium
`#/sub`	`#x001A`	substitute
`#/esc`	`#x001B`	escape
`#/fs`	`#x001C`	file separator
`#/gs`	`#x001D`	group separator
`#/rs`	`#x001E`	record separator
`#/us`	`#x001F`	unit separator
`#/del`	`#x007F`	delete

Additional control characters:

`#/nbsp`	`#x00A0`	non breakable space
`#/shy`	`#x00A0`	soft hyphen

[Syntax] #"…": The printed representation of a rod is #" the runes and then another ". A double quote character is escaped by "\".