Using Serd¶
The serd API is declared in serd.h
:
#include <serd/serd.h>
Communicating with the outside world via syntax is performed using two main types: the Reader, which reads text and fires callbacks, and the Writer, which writes text when driven by corresponding functions. Both work in a streaming fashion but still support pretty-printing, so the pair can be used to pretty-print, translate, or otherwise process arbitrarily large documents very quickly. The context of a stream is tracked by the Environment, which stores the current base URI and set of namespace prefixes.
String Views¶
For performance reasons,
most functions in serd that take a string take a SerdStringView
,
rather than a bare char*
.
This forces code to be explicit about string measurement,
which discourages common patterns of repeated measurement of the same string.
For convenience, several macros are provided for constructing string views:
Constructs a view of an empty string, for example:
SerdStringView empty = SERD_EMPTY_STRING();
Constructs a view of a string literal, for example:
SerdStringView hello = SERD_STATIC_STRING("hello");Note that this measures its argument with
sizeof
, so care must be taken to only use it with string literals, or the length may be incorrect.
Constructs a view of a string by measuring it with
strlen
, for example:SerdStringView view = SERD_MEASURE_STRING(string_pointer);This can be used to make a view of any string.
Constructs a view of a slice of a string with an explicit length, for example:
SerdStringView slice = SERD_STRING_VIEW(string_pointer, 4);
Typically, these macros are used inline when passing parameters, and so can be thought of as syntax for the types of strings in code.
Nodes¶
Nodes are the basic building blocks of data.
Nodes are essentially strings,
but also have a SerdNodeType
,
and optionally either a datatype or a language.
In RDF, a node is either a literal, URI, or blank. Serd can also represent “CURIE” nodes, or shortened URIs, which represent prefixed names often written in documents.
Fundamental Constructors¶
There are five fundamental node constructors, which can be used to create any node:
serd_new_plain_literal()
Creates a new string literal with an optional language tag.
serd_new_typed_literal()
Creates a new string literal with a datatype URI.
serd_new_blank()
Creates a new blank node ID.
serd_new_curie()
Creates a new shortened URI.
serd_new_uri()
Creates a new URI from a string.
Convenience Constructors¶
For convenience, many other constructors are also provided which construct common types of nodes:
serd_new_simple_node()
Creates a new simple blank, CURIE, or URI node.
serd_new_string()
Creates a new string literal (with no datatype or language).
serd_new_parsed_uri()
Creates a new URI from a parsed URI view.
serd_new_file_uri()
Creates a new file URI from a path.
serd_new_boolean()
Creates a new boolean literal.
serd_new_decimal()
Creates a new decimal literal.
serd_new_double()
Creates a new double literal.
serd_new_float()
Creates a new float literal.
serd_new_integer()
Creates a new integer literal.
serd_new_blob()
Creates a new binary blob literal using xsd:base64Binary encoding.
The datatype or language, if present, can be retrieved with serd_node_datatype()
or serd_node_language()
, respectively.
Note that no node has both a datatype and a language.
Statements¶
A SerdStatement
is a tuple of either 3 or 4 nodes:
the subject, predicate, object, and optional graph.
Statements declare that a subject has some property.
The predicate identifies the property,
and the object is its value.
A statement is a bit like a very simple machine-readable sentence. The “subject” and “object” are as in natural language, and the predicate is like the verb, but more general. For example, we could make a statement in English about your intrepid author:
drobilla has the first name “David”
We can break this statement into 3 pieces like so:
Subject |
Predicate |
Object |
---|---|---|
drobilla |
has the first name |
“David” |
To make a SerdStatement
out of this, we need to define some URIs.
In RDF, the subject and predicate must be resources with an identifier
(for example, neither can be a string).
Conventionally, predicate names do not start with “has” or similar words,
since that would be redundant in this context.
So, we assume that http://example.org/drobilla
is the URI for drobilla,
and that http://example.org/firstName
has been defined somewhere to be
a property with the appropriate meaning,
and can make an equivalent SerdStatement
:
SerdNode* subject = serd_new_curie("eg:drobilla");
SerdNode* predicate = serd_new_curie("eg:firstName");
SerdNode* object = serd_new_string("David");
SerdStatement* statement = serd_statement_new(
subject, predicate, object, NULL, NULL);
The last two fields are the graph and the cursor. The graph is another node that can be used to group statements, for example by the URI of the document they were loaded from. The cursor represents the location in a document where the statement was loaded from, if applicable.
Accessing Fields¶
Statement fields can be accessed with
serd_statement_node()
, for example:
const SerdNode* s = serd_statement_node(statement, SERD_SUBJECT);
Alternatively, an accessor function is provided for each field:
const SerdNode* p = serd_statement_predicate(statement);
const SerdNode* o = serd_statement_object(statement);
const SerdNode* g = serd_statement_graph(statement);
Every statement has a subject, predicate, and object,
but the graph may be null.
The cursor may also be null (as it would be in this case),
but if available it can be accessed with serd_statement_cursor()
:
const SerdNode* c = serd_statement_cursor(statement);
Comparison¶
Two statements can be compared with serd_statement_equals()
:
if (serd_statement_equals(statement1, statement2)) {
printf("Match\n");
}
Statements are equal if all four corresponding pairs of nodes are equal. The cursor is considered metadata, and is ignored for comparison.
It is also possible to match statements against a pattern using NULL
as a wildcard,
with serd_statement_matches()
:
SerdNode* rdf_type = serd_new_uri(
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type");
if (serd_statement_matches(statement, NULL, rdf_type, NULL, NULL)) {
printf("%s has type %s\n",
serd_node_string(serd_statement_subject(statement)),
serd_node_string(serd_statement_object(statement)));
}
Lifetime¶
A statement only contains const references to nodes,
it does not own nodes or manage their lifetimes internally.
The cursor, however, is owned by the statement.
A statement can be copied with serd_statement_copy()
:
SerdStatement* copy = serd_statement_copy(statement);
The copied statement will refer to exactly the same nodes, though the cursor will be deep copied.
In most cases, statements actually come from a reader or model,
and are managed by them,
but a statement owned by the application must be freed with serd_statement_free()
:
serd_statement_free(copy);
World¶
So far, we have only used nodes and statements,
which are simple independent objects.
Higher-level facilities in Serd require a SerdWorld
,
which represents the global library state.
A program typically uses just one world,
which can be constructed using serd_world_new()
:
SerdWorld* world = serd_world_new();
All “global” library state is handled explicitly via the world. Serd does not contain any static mutable data, allowing it to be used concurrently in several parts of a program, for example in plugins.
If multiple worlds are used in a single program, they must never be mixed: objects “inside” one world can not be used with objects inside another.
Note that the world is not a database, it only manages a small amount of library state for things like configuration and logging.
Generating Blanks¶
Blank nodes, or simply “blanks”, are used for resources that do not have URIs. Unlike URIs, they are not global identifiers, and only have meaning within their local context (for example, a document). The world provides a method for automatically generating unique blank identifiers:
const SerdNode* blank = serd_node_copy(serd_world_get_blank(world));
Note that the returned pointer is to a node that will be updated on the next call to serd_world_get_blank()
,
so it is usually best to copy the node,
like in the example above.
Model¶
A SerdModel
is an indexed set of statements.
A model can be used to store any set of data,
from a few statements (for example, a protocol message),
to an entire document,
to a database with millions of statements.
A new model can be created with serd_model_new()
:
SerdModel* model = serd_model_new(world, SERD_INDEX_SPO);
Combinations of flags can be used to enable different indices,
or the storage of graphs and cursors.
For example, to be able to quickly search by predicate,
and store a cursor for each statement,
the flags SERD_INDEX_PSO
and SERD_STORE_CURSORS
could be added like so:
SerdModel* model = serd_model_new(
world, SERD_INDEX_SPO | SERD_INDEX_PSO | SERD_STORE_CURSORS);
Model Operations¶
Models are value-like and can be copied with serd_model_copy()
and compared with serd_model_equals()
:
SerdModel* copy = serd_model_copy(model);
assert(serd_model_equals(copy, model));
When a model is no longer needed, it can be destroyed with serd_model_free()
:
serd_model_free(model);
Destroying a model invalidates all nodes and statements within that model, so care should be taken to ensure that no dangling pointers are created.
The size of a model in statements can be accessed with serd_model_size()
and serd_model_empty()
:
if (serd_model_empty(model)) {
printf("Model is empty\n");
} else if (serd_model_size(model) > 1000) {
printf("Model has over 1000 statements\n");
}
Adding Statements¶
Statements can be added to the model with serd_model_add()
:
SerdNode* s = serd_new_uri("http://example.org/thing");
SerdNode* p = serd_new_uri("http://example.org/name");
SerdNode* o = serd_new_string("Thing");
serd_model_add(model, s, p, o, NULL);
Alternatively, if you already have a statement (for example from another model),
serd_model_insert()
can be used instead.
For example, the first statement in one model could be added to another like so:
serd_model_insert(model, serd_model_begin(other_model);
An entire range of statements can be inserted at once with serd_model_add_range()
.
For example, all statements in one model could be copied into another like so:
SerdRange* all = serd_model_all(other_model, SERD_ORDER_SPO);
serd_model_add_range(model, all);
serd_range_free(all);
Iteration¶
An iterator is a reference to a particular statement in a model.
serd_model_begin()
returns an iterator to the first statement in the model,
and serd_model_end()
returns a sentinel that is one past the last statement in the model:
SerdIter* i = serd_model_begin(model);
if (serd_iter_equals(i, serd_model_end(model))) {
printf("Model is empty\n");
} else {
const SerdStatement* s = serd_iter_get(i);
printf("First statement subject: %s\n",
serd_node_string(serd_statement_subject(s)));
}
An iterator can be advanced to the next statement with serd_iter_next()
,
which returns true if the iterator has reached the end:
if (!serd_iter_next(i)) {
const SerdStatement* s = serd_iter_get(i);
printf("Second statement subject: %s\n",
serd_node_string(serd_statement_subject(s)));
}
Iterators are dynamically allocated,
and must eventually be destroyed with serd_iter_free()
:
serd_iter_free(i);
Ranges¶
It is often more convenient to work with ranges of statements, rather than iterators to individual statements.
The simplest range,
the range of all statements in the model,
is returned by serd_model_all()
:
SerdRange* all = serd_model_all(model, SERD_ORDER_SPO);
The order argument can be used to specify a particular order for statements, which can be useful for optimizing certain algorithms. In most cases, this function is simply used to scan the entire model, so the default SPO (subject, predicate, object) order is appropriate, and is always available.
It is possible to iterate over a range by advancing the begin iterator, in much the same way as advancing an iterator:
if (serd_range_empty(all)) {
printf("Model is empty\n");
} else {
const SerdStatement* s = serd_range_front(all);
printf("First statement subject: %s\n",
serd_node_string(serd_statement_subject(s)));
}
if (!serd_range_next(all)) {
const SerdStatement* s = serd_range_front(all);
printf("Second statement subject: %s\n",
serd_node_string(serd_statement_subject(s)));
}
Pattern Matching¶
There are several functions that can be used to quickly find statements in the model that match a pattern.
The simplest is serd_model_ask()
which checks if there is any matching statement:
if (serd_model_ask(model, NULL, rdf_type, NULL, NULL)) {
printf("Model contains a type statement\n");
}
To access the unknown fields,
an iterator to the matching statement can be found with serd_model_find()
instead:
SerdIter* i = serd_model_find(model, NULL, rdf_type, NULL, NULL);
const SerdNode* instance = serd_statement_subject(serd_iter_get(i));
Similar to serd_model_ask()
,
serd_model_count()
can be used to count the number of matching statements:
size_t n = serd_model_count(model, instance, rdf_type, NULL, NULL);
printf("Instance has %zu types\n", n);
To iterate over the matching statements,
serd_model_range()
can be used,
which returns a range that includes only statements that match the pattern:
SerdRange* range = serd_model_range(model, instance, rdf_type, NULL, NULL);
for (; !serd_range_empty(range); serd_range_next(range)) {
const SerdStatement* s = serd_range_front(range);
printf("Instance has type %s\n",
serd_node_string(serd_statement_get_object(s)));
}
serd_range_free(range);
Indexing¶
A model can contain several indices that use different orderings to support different kinds of queries. For good performance, there should be an index where the least significant fields in the ordering correspond to wildcards in the pattern (or, in other words, one where the most significant fields in the ordering correspond to nodes given in the pattern). The table below lists the indices that best support a kind of pattern, where a “?” represents a wildcard in the pattern.
Pattern |
Good Indices |
---|---|
s p o |
Any |
s p ? |
SPO, PSO |
s ? o |
SOP, OSP |
s ? ? |
SPO, SOP |
? p o |
POS, OPS |
? p ? |
POS, PSO |
? ? o |
OSP, OPS |
? ? ? |
Any |
If graphs are enabled, then statements are indexed both with and without the graph fields, so queries with and without a graph wildcard will have similar performance.
Since indices take up space and slow down insertion, it is best to enable the fewest indices possible that cover the queries that will be performed. For example, an applications might enable just SPO and OPS order, because they always search for specific subjects or objects, but never for just a predicate without specifying any other field.
Getting Values¶
Sometimes you are only interested in a single node,
and it is cumbersome to first search for a statement and then get the node from it.
A more convenient way is to use serd_model_get()
.
To get a value, specify a triple pattern where exactly one of the subject, predicate, and object is a wildcard.
If a statement matches, then the node that “fills” the wildcard will be returned:
const SerdNode* t = serd_model_get(model, instance, rdf_type, NULL, NULL);
if (t) {
printf("Instance has type %s\n", serd_node_string(t));
}
If multiple statements match the pattern, then the matching node from an arbitrary statement is returned. It is an error to specify more than one wildcard, excluding the graph.
The similar serd_model_get_statement()
instead returns the matching statement:
const SerdStatement* ts = serd_model_get_statement(
model, instance, rdf_type, NULL, NULL);
if (ts) {
printf("Instance %s has type %s in graph %s\n",
serd_node_string(serd_statement_subject(ts)),
serd_node_string(serd_statement_object(ts)));
}
Erasing Statements¶
Individual statements can be erased with serd_model_erase()
,
which takes an iterator:
SerdIter* some_type = serd_model_find(model, NULL, rdf_type, NULL, NULL);
serd_model_erase(model, some_type);
serd_iter_free(some_type);
The similar serd_model_erase_range()
takes a range and erases all statements in the range:
SerdRange* all_types = serd_model_range(model, NULL, rdf_type, NULL, NULL)
serd_model_erase_range(model, all_types);
serd_range_free(all_types);
Reading and Writing¶
Reading and writing documents in a textual syntax is handled by the SerdReader
and SerdWriter
, respectively.
Serd is designed around a concept of event streams,
so the reader or writer can be at the beginning or end of a “pipeline” of stream processors.
This allows large documents to be processed quickly in an “online” fashion,
while requiring only a small constant amount of memory.
If you are familiar with XML,
this is roughly analogous to SAX.
A common simple setup is to simply connect a reader directly to a writer.
This can be used for things like pretty-printing,
or converting a document from one syntax to another.
This can be done by passing the sink returned by serd_writer_sink()
to the reader constructor, serd_reader_new()
.
First, in order to write a document, an environment needs to be created. This defines the base URI and any namespace prefixes, which is used to resolve any relative URIs or prefixed names, and may be used to abbreviate the output. In most cases, the base URI should simply be the URI of the file being written. For example:
SerdStringView host = SERD_EMPTY_STRING();
SerdStringView out_path = SERD_STATIC_STRING("/some/file.ttl");
SerdNode* base = serd_new_file_uri(out_path, host);
SerdEnv* env = serd_env_new(serd_node_string_view(base));
Namespace prefixes can also be defined for any vocabularies used:
serd_env_set_prefix(
env,
SERD_STATIC_STRING("rdf"),
SERD_STATIC_STRING("http://www.w3.org/1999/02/22-rdf-syntax-ns#"));
We now have an environment set up for our document,
but still need to specify where to write it.
This is done by creating a SerdByteSink
,
which is a generic interface that can be set up to write to a file,
a buffer in memory,
or a custom function that can be used to write output anywhere.
In this case, we will write to the file we set up as the base URI:
SerdByteSink* out = serd_byte_sink_new_filename(out_path, 4096);
The second argument is the page size in bytes, so I/O will be performed in chunks for better performance. The value used here, 4096, is a typical filesystem block size that should perform well on most machines.
With an environment and byte sink ready, the writer can now be created:
SerdWriter* writer = serd_writer_new(world, SERD_TURTLE, 0, env, out);
Output is written by feeding statements and other events to the sink returned by serd_writer_sink()
.
SerdSink
is the generic interface for anything that can consume data streams.
Many objects provide the same interface to do various things with the data,
but in this case we will send data directly to the writer:
SerdReader* const reader = serd_reader_new(world,
SERD_TURTLE,
0,
env,
serd_writer_sink(writer),
4096);
The third argument of serd_reader_new()
takes a bitwise OR
of SerdReaderFlag
flags that can be used to configure the reader.
In this case only SERD_READ_LAX
is given,
which tolerates some invalid input without halting on an error,
but others can be included.
For example, passing SERD_READ_LAX | SERD_READ_RELATIVE
would enable lax mode and preserve relative URIs in the input.
Now that we have a reader that is set up to directly push its output to a writer, we can finally process the document:
SerdStatus st = serd_reader_read_document(reader);
if (st) {
printf("Error reading document: %s\n", serd_strerror(st));
}
Alternatively, one “chunk” of input can be read at a time with serd_reader_read_chunk()
.
A “chunk” is generally one top-level description of a resource,
including any anonymous blank nodes in its description,
but this depends on the syntax and the structure of the document being read.
The reader pushes events to its sink as input is read,
so in this scenario the data should now have been re-written by the writer
(assuming no error occurred).
To finish and ensure that a complete document has been read and written,
serd_reader_finish()
can be called followed by serd_writer_finish()
.
However these will be automatically called on destruction if necessary,
so if the reader and writer are no longer required they can simply be destroyed:
serd_reader_free(reader);
serd_writer_free(writer);
Note that it is important to free the reader first in this case,
since finishing the read may push events to the writer.
Finally, closing the byte sink will flush and close the output file,
so it is ready to be read again later.
Similar to the reader and writer,
this can be done explicitly with serd_byte_sink_close()
,
or implicitly with serd_byte_sink_free()
if the byte sink is no longer needed:
serd_byte_sink_free(out);
Reading into a Model¶
A document can be loaded into a model by setting up a reader that pushes data to a model “inserter” rather than a writer:
SerdModel* model = serd_model_new(world, SERD_INDEX_SPO);
SerdSink* inserter = serd_inserter_new(model, NULL);
The process of reading the document is the same as above, only the sink is different:
SerdReader* const reader = serd_reader_new(world,
SERD_TURTLE,
0,
env,
inserter,
4096);
SerdStatus st = serd_reader_read_document(reader);
if (st) {
printf("Error loading model: %s\n", serd_strerror(st));
}
Writing a Model¶
A model, or parts of a model, can be written by writing the desired range with serd_write_range()
:
serd_write_range(serd_model_all(model, SERD_ORDER_SPO),
serd_writer_sink(writer),
0);
By default,
this writes the range in chunks suited to pretty-printing with anonymous blank nodes (like “[ … ]” in Turtle or TriG).
The flag SERD_NO_INLINE_OBJECTS
can be given to instead write the range in a simple SPO order,
which can be useful in other situations because it is faster and emits statements in strictly increasing order.
Stream Processing¶
The above examples show how a document can be either written to a file or loaded into a model, simply by changing the sink that the data is written to. There are also sinks that filter or transform the data before passing it on to another sink, which can be used to build more advanced pipelines with several processing stages.
Canonical Literals¶
The “canon” is a stream processor that converts literals with supported XSD datatypes into canonical form.
For example, this will rewrite an xsd:decimal literal like “.10” as “0.1”.
A canon is created with serd_canon_new()
,
which needs to be passed the “target” sink that the transformed statements should be written to,
for example:
SerdSink* canon = serd_canon_new(world, inserter, 0);
The last argument is a bitwise OR
of SerdCanonFlag
flags.
For example, SERD_CANON_LAX
will tolerate and pass through invalid literals,
which can be useful for cleaning up questionabe data as much as possible without losing any information.
Filtering Statements¶
The “filter” is a stream processor that filters statements based on a pattern.
It can be configured in either inclusive or exclusive mode,
which passes through only statements that match or don’t match the pattern,
respectively.
A filter is created with serd_filter_new()
,
which takes a target, pattern, and inclusive flag.
For example, all statements with predicate rdf:type
could be filtered out when loading a model:
SerdSink* filter = serd_filter_new(
inserter, // Target
NULL, // Subject
rdf_type, // Predicate
NULL, // Object
NULL, // Graph
true); // Inclusive
If false
is passed for the last parameter instead,
then the filter operates in exclusive mode and will instead insert only statements with predicate rdf:type
.