Improving LV2 Discovery Performance
Posted on in Hacking, LAD, LV2, Lilv, RDFLV2 plugins (and other data, like presets) are discovered by parsing Turtle files installed in various bundle directories.
If the host only needs to know which things are installed, then only manifest.ttl
files need to be loaded.
That takes only a fraction of a second (60 ms on this machine, which has 897 plugins installed), but if the host needs more information (like the plugin's label or what ports it has), then the data files need to be loaded as well, which can take several seconds.
The parser (from serd) is very fast, so the overwhelming majority of this time is spent inserting data into a model.
I was recently doing some work on lilv related to discovery, and got sidetracked into investigating how much of this overhead could be eliminated. Quite a bit, as it turns out, but before explaining how, I'll give a brief summary of the relevant fundamentals so people who don't spend their days in the triple mines can understand what I'm talking about.
LV2 data is written in the Turtle syntax, which can have a lot of structure, but is ultimately just a shorthand for data that is simply a “flat” series of triples. For example, the abbreviated Turtle
<http://example.org/amp>
a lv2:Plugin ;
doap:name "Amplifier" .
represents the two triples
<http://example.org/amp> rdf:type lv2:Plugin .
<http://example.org/amp> doap:name "Amplifier" .
which describe a plugin with the name “Amplifier”.
Two implications of this format are important to understand here:
-
All data is a set of triples which can be stored in some lexicographical order to be quickly searchable (for example, a set ordered subject-first can be used to quickly find triples that start with a given subject).
-
All data can be streamed via a simple “flat” interface (for example a function with three parameters), and is trivial to inspect and/or filter on the fly (much like line-based text in POSIX pipelines).
The data may not be stored as a “literal” ordered set of triples (they're actually quads in memory for one thing), but this simplified way of thinking about it is good enough for this explanation.
Which order is best depends on the query.
For example (borrowing a bit of syntax from SPARQL where ?name
represents a query variable), if the host asks something like “what's the name of this plugin” or
<http://example.org/amp> doap:name ?name .
then it needs to find triples that start with the given subject and predicate (<http://example.org/amp>
and doap:name
) to see what object (?name
) they have.
So, a subject-predicate-object or “SPO” index will work well.
If, however, the host asks something like “which things are plugins” or
?plugin rdf:type lv2:Plugin .
then an SPO index doesn't help because there's no known subject to search for. Most queries are like this, with either a subject or object wildcard (querying relatively “fixed” predicates is rare), so lilv has always had both an SPO and OPS index to support them.
This is convenient, but twice the indexing means roughly twice the overhead. The SPO index is the natural order used in the syntax, and supports most code (which mainly looks up properties of known things), but the OPS index isn't used so much. Can it be removed entirely to speed things up? It is used for many important things, but conveniently, there's only a few fixed cases of those (like finding plugins as above). So, we can take advantage of the streaming nature of the data to record this information while it's being read, instead of inserting it into an index only to query it out later.
To do this in the implementation, I introduced the concept of a “skimmer”.
A skimmer inserts triples into a model as usual, but first "skims" them and records items of interest for later use.
For example, a skimmer checks whether a triple has rdf:type lv2:Plugin
, and if so, records the subject in a result set that stores only plugin URIs.
Some cases are a bit trickier, and actually pulling this off cleanly took quite an overhaul, but details aside, it turns out that this approach can be used to eliminate the OPS index entirely without losing any functionality.
How much of an improvement does this make?
On this machine, using lv2ls -n
as a crude benchmark, the time goes from 3.11 to 1.72 seconds, or about a 45% improvement.
The memory consumption goes down a bit as well, from (even more crudely) a max-RSS of about 138 to 112 MiB, or about a 19% improvement (everything here is an average of three runs).
The improvement will vary dramatically based on what's installed, for example in some earlier tests with a different configuration I saw drops from around 7 to 4 seconds, but in general, not bad!
Are there any downsides to this? Yes, primarily two:
-
As with any significant change, there is the possibility of regression. Discovery is a little more restricted than before, so although I've done my best to test things, there is a possibility of things not being discovered anymore. Usually the fix for this is adding information to the manifest that should have been there anyway.
-
Though the implementation itself doesn't make queries that require the OPS index, hosts themselves can since lilv provides generic query functions. To support this, there's a new boolean option,
LILV_OPTION_OBJECT_INDEX
, which can be used to enable or disable the OPS index. Since most hosts don't need it, I decided to disable this by default. That will be a regression for hosts that do, however, who need to explicitly opt in to the old behaviour by enabling this option. A warning is printed if a query will be slow because of the missing index, so it should at least be obvious if this happens.
These changes are now in the main
branch in git, which will be released soon, probably as version 0.26.0.