Beautiful C and C++ Documentation with Sphinx

Posted on 2020-11-26 17:12 in Hacking, C, Cpp, LAD, LV2

Like many, I've long suffered under the antiquated and inflexible HTML documentation generated by Doxygen. Having recently worked on some Python documentation using Sphinx, though, I found it powerful and pleasant enough to use. It also has a way of encouraging actually writing documentation, rather than just generating a dump of glorified comments, which is a good thing. Though I'm not at all a fan of ReStructuredText syntax (which at times seems like it's trying to be cryptic on purpose), Sphinx is undeniably powerful, and I like the "assemble a bunch of plainish text files" approach in general. The support for multiple languages is also very appealing, though not without its problems, as we'll get to.

So, is it possible to use Sphinx to generate documentation for C and C++ libraries? Yes! As explained somewhat recently in a post by Sy Brand, there is a project called Breathe that integrates Doxygen (for extracting documentation) with Sphinx (for generating output). That sounded promising, so I attempted to migrate a library to using Breathe instead of Doxygen's HTML support. Unfortunately, though, I encountered quite a few roadblocks where I couldn't quite get output that I was happy with. Worse, the project itself is very complicated, and as I poked around in swaths of originally generated but manually modified code, I decided that Breathe was not for me. That would feel like just exchanging one inflexible and unhackable system for another.

What, then, to do? Though I realize that deep integration via modules like Breathe is usually the way things are done with Sphinx, I am a KISS sort of person, so I like to think of it as something more like a Static Site Generator: it reads a bunch of plainish text input files, and outputs HTML (or whatever other presentation format). How do we describe C and C++ things in Sphinx? It turns out that recent versions have built-in support for these "domains" now, which define markup for describing everything in these languages. This means that everything to do with nicely formatting and cross-referencing C and C++ is already dealt with out of the box. Excellent.

So, taking a step back and assessing the situation: we have some XML files that describe the documentation, and we have a tool that reads text files and produces nice documentation. This strikes me as a relatively straightforward task for a nice and simple "files in, files out" script, not somewhere a Goldbergian contraption that mashes Doxygen into Sphinx is required. So, after investigating any other promising options (no such luck), I resigned myself to trying to write such a thing, at the very least to see if it's feasible. I certainly have no time or interest in writing and maintaining a Documentation System, but a self-contained script to convert one thing to another seems reasonable enough.

As it turns out, I wouldn't call it trivial, but it's certainly feasible. I ended up with a ~700 line Python script that does everything I need (though this is of course not the same as everything possible). It's a bit "gluey" and makes some assumptions about the structure and so on, but it does the job and is something I feel I can maintain as necessary. I won't be publishing or supporting this as an independent project any time soon, and make no claims about it being general purpose, but feel free to steal it if any of this sounds appealing.

With this, I was able to get around some long-standing gripes I have with Doxygen, and easily make whatever I wanted to happen a reality, so I'm pretty happy with this approach. Everything is nicely decoupled, so I don't feel over-invested in any of the tools involved. If, for example, someone finally writes a good clang-based extractor that gains traction (JSON please, I did not enjoy this revisitation of the horrors of XML at all), I should be able to switch to using that easily enough. I've actually found this somewhat crude and UNIXey approach quite convenient: you can simply look at the ReST files to understand what is happening, or tweak them a bit and run Sphinx to test what you're aiming for, and so on. Text files are good.

So, after however many years, I think I've found an approach to documentation I'm actually quite happy with, that can support all of the languages that I use, and in general doesn't seem to get in my way. Hooray. For starters, I did my window system portability layer, Pugl. The generated documentation for the C API can be seen at https://lv2.gitlab.io/pugl/c/singlehtml/, and the C++ at https://lv2.gitlab.io/pugl/cpp/singlehtml/. This is more or less the standard Alabaster theme with a few tweaks, which I'm not sure feels appropriate for API documentation (and is much more bloated with a bunch of Javascript than I'd like), but it's pretty enough, at least. I'll tinker with themes later when I feel like jumping down that rabbit hole.

The slightly cumbersome links are an artifact of the one problem I encountered using Sphinx domains: you can't really document C and C++ APIs nicely in the same documentation set. If you use the cpp domain everywhere, you get name mangling in links even for C symbols, which is really unfortunate, and you can't really mix them. To take a contrived example, if you have a struct MylibThing in C, then a type alias in C++ like using Thing = MylibThing, Sphinx isn't clever enough to figure out that MylibThing is from C, and will generate warnings and not link correctly. Perhaps someday it will, which would be nice, but for now I opted to simply generate completely separate documentation sets. This means the C documentation is duplicated in the C++ documentation so that things can be hyperlinked, which isn't ideal, but I can live with it. A certain amount of redundancy is inherent in multi-language documentation anyway.

As I add Python bindings to most libraries, having a unified documentation system for all of these languages will be very nice. There is one additional thing I'll need at some point for the LV2 documentation in particular: a domain for RDF properties and classes. The LV2 documentation really suffers from an unnatural code (via Doxygen) and data (via lv2specgen) documentation split, and my hope is that Sphinx can provide a nice environment for writing documentation that refers to both worlds freely. That, unfortunately, will be much more work, but hopefully writing a custom Sphinx domain isn't too hard...