Case study: using Zig build system to build a complex R and JavaScript project
14 October 2024
Overview
This is a case study in the use of the Zig build system to build a project consisting of 15 R packages and substantial JavaScript into an isolated runtime environment suitable for development and testing under Linux.
The project is RCloud, which currently uses a collection of Bash, R and JavaScript scripts to build itself from source. It relies on public binary package repositories to include its R and JS dependencies.
The Zig solution developed has these parts:
Implementation
The Zig build of RCloud uses two stages:
zig build update
to generate a build.zig file which describes the steps necessary to build all 15 packages and their current 64 transitive dependencieszig build (install)
to download dependencies and build all packages and the JavaScript bundles necessary, resulting in all necessary artifacts being placed inzig-out
. This stage imports thebuild.zig
file generated in the previous step.
A JSON configuration file is used by both steps. The file declares the
public R repositories which should be searched for dependencies. For
RCloud, those are CRAN and RForge. The file also lists every dependent
source tarball by package name, along with its hash. The first time
zig build update
is run, this asset list can be empty, and it is
filled by the code generator.
Subsequent runs of zig build update
will query the latest versions
from configured repositories, and update the dependent packages to
point to their latest versions.
Initially, zig build update
sets the hashes to empty strings. The
main build step is responsible for fetching tarballs and verifying
hashes. When it encounters an empty string, it issues a warning and
updates the hash. If the hash field is non-empty, it must match the
computed hash of the downloaded asset, or else the build will fail.
Care is taken to enable correct incremental builds by leveraging Zig build's cache. For example, downloaded assets are not re-downloaded. Local R packages are only rebuilt if any of their source files change. Same for the JavaScript bundles.
Outer build script
(Some of the code samples have been edited for this article. Links to the code are at the end of this article.)
The top level build file imports build tools from an external package,
r-build-zig, and imports a generated build.zig
file. (For the
initial bootstrap, a minimal initial generated/build.zig file must be
provided.)
const std = @import("std"); const r_build_zig = @import("r-build-zig"); const generated_build = @import("generated/build.zig");
The main build entry point declares the overall organisation of the
build. There are four major pieces: fetching the assets and building,
building the htdocs (JavaScript), generating the R build script given
a set of local directories and config.json
file, and making a
distribution tarball.
pub fn build(b: *std.Build) !void { const target = b.standardTargetOptions(.{}); const optimize = b.standardOptimizeOption(.{}); const config_path = "config.json"; const update_step = b.step("update", "Generate R package build files"); const tarball_step = b.step("dist", "Make a source archive"); // declare build install rules try fetch_assets_and_build(b, config_path, target, optimize); // declare rules for htdocs build_htdocs(b); // declare step: update try generate_build_script( b, config_path, &.{ "packages", "rcloud.client", "rcloud.packages", "rcloud.support", }, update_step, target, optimize, ); // declare step: dist try make_tarball(b, tarball_step); }
The main build, which relies on a generated build.zig
file, is
driven by the following function. It uses the build tool
fetch-assets
provided by r-build-zig
to fetch and cache all
third-party sources, in parallel and with hash checksum verification.
Then it calls into the generated build.zig
file, supplying the
assets directory as an argument, to generate the build rules to build
everything.
fn fetch_assets_and_build( b: *Build, config_path: []const u8, target: ResolvedTarget, optimize: OptimizeMode, ) !void { // get the fetch-assets tool const exe = b.dependency("r-build-zig", .{ .target = target, .optimize = optimize, }).artifact("fetch-assets"); // run fetch-assets tool const step = b.addRunArtifact(exe); _ = step.addFileArg(b.path(config_path)); const out_dir = step.addOutputDirectoryArg("assets"); // supply assets dir to generated build script try generated_build.build(b, out_dir); }
Generating build.zig
This next section declares the steps required to run the external tool
which generates another build.zig
file, and copies it into the
source directory. It is triggered by zig build update
. The tool
accepts a variable number of R package source code directories, which
it searches recursively to discover package definitions. It uses those
definitions to discover internal and external dependencies. The outer
script copies the generated build.zig
artifact to a known location
in the source tree.
fn generate_build_script( b: *Build, config_path: []const u8, relative_source_package_paths: []const []const u8, update_step: *Step, target: ResolvedTarget, optimize: OptimizeMode, ) !void { const exe = b.dependency("r-build-zig", .{ .target = target, .optimize = optimize, }).artifact("generate-build"); const step = b.addRunArtifact(exe); _ = step.addArg(config_path); const out_dir = step.addOutputDirectoryArg("deps"); for (relative_source_package_paths) |path| { _ = step.addArg(path); } // copy the generated build.zig file to generated directory const uf = b.addUpdateSourceFiles(); uf.addCopyFileToSource(out_dir.path(b, "build.zig"), "generated/build.zig"); update_step.dependOn(&uf.step); }
Generated build rules
The following segment shows the Zig build code generated by the
generate-build
tool for a particular package. It will invoke R CMD
INSTALL
, supplying a library directory where R can find this
package's dependencies, and place the built artifacts. It captures
stdout
and stderr
, saving the former and discarding the latter.
It then declares explicit dependencies on prior steps which will have built its required packages. These requirements are determined by parsing the repository metadata about the package.
The final steps copy the artifacts to the output directories. The
addFileInput
line is discussed in the Caching section.
const stringr = b.addSystemCommand(&.{"R"}); stringr.addArgs(&.{"CMD", "INSTALL", "-l"}); _ = stringr.addDirectoryArg(libdir.getDirectory()); _ = stringr.addFileArg(asset_dir.path(b, "stringr_1.5.1.tar.gz")); stringr.step.name = "stringr"; const stringr_out = stringr.captureStdOut(); _ = stringr.captureStdErr(); stringr.step.dependOn(&cli.step); stringr.step.dependOn(&glue.step); stringr.step.dependOn(&lifecycle.step); stringr.step.dependOn(&magrittr.step); stringr.step.dependOn(&rlang.step); stringr.step.dependOn(&stringi.step); stringr.step.dependOn(&vctrs.step); // see "Caching" section stringr.addFileInput(b.path("config.json")); const stringr_install = b.addInstallDirectory(.{ .source_dir = libdir.getDirectory().path(b, "stringr"), .install_dir = .{ .custom = "lib" }, .install_subdir = "stringr", }); stringr_install.step.dependOn(&stringr.step); b.getInstallStep().dependOn(&stringr_install.step); b.getInstallStep().dependOn(&b.addInstallFileWithDir( stringr_out, .{ .custom = "logs" }, "stringr.log", ).step);
Building JavaScript
The build script drives external tools to install npm
requirements
and then invoke a binary tool installed in node_modules
. For caching
purposes, note the addFileInput
calls as well as expectExitCode
.
Since both of these tools do not depend on their command line
arguments (they have none, actually), we need to use addFileInput
to
tell Zig that they need not be repeated unless those files have
changed.
Both tools generate artifacts inside the source directory, so we first copy our source to the cache and run the tools from the cache directory, to keep our source directories clean.
The grunt
tool generates binary artifacts inside our source
directories, so we add an installation step which copies those
directories to the install location, but only after the grunt
step.
fn build_htdocs(b: *Build) void { const wf = b.addWriteFiles(); // copy htdocs source _ = wf.addCopyDirectory(b.path("htdocs"), "htdocs", .{}); // install js requirements const npm_ci = b.addSystemCommand(&.{ "npm", "ci" }); npm_ci.setCwd(wf.getDirectory()); npm_ci.addFileInput(wf.addCopyFile(b.path("package.json"), "package.json")); npm_ci.addFileInput(wf.addCopyFile(b.path("package-lock.json"), "package-lock.json")); npm_ci.expectExitCode(0); // run grunt const grunt = b.addSystemCommand(&.{"node_modules/grunt-cli/bin/grunt"}); grunt.setCwd(wf.getDirectory()); grunt.addFileInput(wf.addCopyFile(b.path("Gruntfile.js"), "Gruntfile.js")); grunt.addFileInput(wf.addCopyFile(b.path("VERSION"), "VERSION")); grunt.addFileInput(wf.getDirectory().path(b, "node_modules/grunt-cli/bin/grunt")); grunt.expectExitCode(0); // which depends on npm_ci grunt.step.dependOn(&npm_ci.step); // add an install step for post-grunt htdocs const htdocs_install = b.addInstallDirectory(.{ .source_dir = wf.getDirectory().path(b, "htdocs"), .install_dir = .prefix, .install_subdir = "htdocs", }); htdocs_install.step.dependOn(&grunt.step); // install built htdocs files b.getInstallStep().dependOn(&htdocs_install.step); }
Caching
Caching in the Zig build system is an important part of its operation. It saves development time by only building artifacts when their dependencies have changed. This introduces special considerations when building atypical artifacts.
Zig uses sensible heuristics to determine when an artifact needs to be rebuilt, and it provides an API for the developer to add additional dependencies. The main heuristic when using external tools is that given identical inputs, a tool will produce identical outputs. Therefore, if the inputs haven't changed and the output is still in the cache, there is no need to run the tool again. Zig provides options to override this behaviour, for example to specify that a tool should always be run, regardless of its cache status.
Here we see that the step
is defined to run an exe
, and it accepts
a single addFileArg
argument, which is our config file. Zig will
hash the contents of the file and skip running this step on subsequent
builds if the hash is unchanged.
const step = b.addRunArtifact(exe); _ = step.addFileArg(b.path(config_path)); const out_dir = step.addOutputDirectoryArg("assets");
This heuristic is good for simple cases. However, if a tool should
also be re-run if some external files (not provided as arguments) have
changed, Zig provides the addFileInput
function to identify those
files. From the generated build code:
const stringr = b.addSystemCommand(&.{"R"}); stringr.addArgs(&.{"CMD", "INSTALL", "-l"}); // ... stringr.addFileInput(b.path("config.json"));
This will force the R CMD INSTALL
tool to be run again if the
contents of config.json
are changed.
Limitations and future work
There are many limitations in the current work. For the build process, some of the build caching could be more precise. For example, editing the configuration file will cause all assets to be downloaded again and a rebuild of all packages. In practice, this is not severe because updates to the configuration file should only happen if dependencies are being added, deleted or updated; and in any of those cases, a complete system rebuild is advisable. However, it would be nice if a full rebuild was optional during development.
Version constraints with two-sided inequalities are not supported.
While an analysis of the CRAN package repository reveals that there
are no packages using two sided constraints, in a complex environment
such as RCloud it is conceivable that some local packages may use >=
constraints and others may use <
constraints to avoid bugs
introduced by new releases. Reconciling these to find a suitable
package version is not currently supported by our system. While this
does not limitation does not pose a practical problem at the time of
this writing, it is conceivable that CRAN policies and practices may
not be adopted by other R package repositories.
Initially, I was unable to efficiently use the Zig package manager
(zig fetch
) to manage R package tarballs, because Zig extracts the
contents and discards the archive. Upon reflection, it may be possible
to accommodate R's expectations and install dependencies from a source
directory. If this is possible, it would remove the need for a
separate fetch-assets
tool.
Wish lists
Zig wish list
Explicitly support binary assets in build.zig.zon files
Zig currently expects dependencies to be in specific formats. For
example, it expects that a .tar.gz
dependency will be a source tree
tarball. After fetching, it untars the source and hashes its contents.
The original tarball is not available via the build system.
This conflicts with R CMD INSTALL
's expectation of a tarball as an
input argument. Passing the extracted source tree to R is also not
viable, as R performs an in-source build, leaving artifacts inside the
directory. Some package builds are not idempotent, so this causes a
failure on subsequent builds. We can mitigate this by copying the
extracted directory in a separate step, or even re-archiving it to
recover a .tar.gz
file, but this adds additional complexity.
Of course, this is a limitation of R, not Zig, but if there were a way to specify to Zig that a particular dependency should be considered an opaque binary, it would make these atypical uses more convenient.
Zig's build cache and build.zig.zon
file could be a very useful
binary asset manager, and it would obviate the need to supply tools
like fetch-assets
.
Make zig build system more extensible
Currently, Zig build scripts can import the build scripts of other packages only if they are fully self-contained. For a single package, a build script can import dependencies from other packages. But doing so makes that build script impossible to import from a different package.
The two step build process along with code generation I used in this case study was required due to this limitation. Ideally, the top level build script should have been able to import the external library providing the R-specific dependency resolution (and its dependencies), and be able to apply the Zig build rules to the current Build object during a one step build process.
R wish list
Expose private functions for dependency information
Several of the internals of R's build planning are in private functions. It would be nice if they could be made part of the public API. The fact that there are several third-party attempts to provide reproducible build systems for R suggests there may be a need for a more complete API.
Pre-flight build plans before starting
Currently, when installing a package at the user's request via
install.packages
, R develops a build plan which determines the
correct installation order of the package and its available transitive
dependencies. However, it fails to account for three important sources
of failures, two of which are readily correctable.
- The package may depend on a package not available in the current set of known repositories.
- The package may express a constrained requirement, such as
lib >= 1.2.1
, which cannot be satisfied given the available packages. - The package may depend on a system requirement that is not present.
The third failure is difficult to detect ahead of actually trying to build the package, so I leave that aside for the moment.
But the first two are knowable, and the library I developed demonstrates one way to correctly identify these failures as part of its build planning process. It would be nice if this functionality were a part of base R.
Links
- RCloud.
An earlier
autotools
-based system I developed is also still present. - r-build-zig. The Zig build generator.
- r-repo-parse. The R package repository parser.
Additional background
These sections go a bit deeper for readers less familiar with Zig or R who wish to learn more.
Background on the Zig implementation
Zig build system
Zig is currently under active development, and this work was performed using pre-release development versions of 0.14.0.
Like most build systems, Zig's is declarative. However, the full power
of the imperative language is available, with some limitations. The
user provides a single entry point which receives a pointer to a
Build
object, and the user adds build targets, rules and
dependencies via the extensive Build API.
Zig's build system makes extensive use of caching to speed up incremental builds and the developer's edit-compile-run experience. Understanding this caching in an atypical case such as this one was a challenge but proved crucial to making things work correctly.
Zig package manager
Zig also provides a package manager, whereby a developer can specify dependencies on third-party packages, and have those dependencies be automatically fetched, verified against a checksum, and built.
At first, this seemed like it would be suitable to fetch external R package dependencies, but ultimately this did not work due a mismatch between Zig's expectation of a package and R's expectation. So I implemented a separate asset declaration and fetching tool. Future work could take another look at this.
Background on R packages
At the outset, I should emphasise that RCloud's requirements are not typical of R package development. Within open source R projects, it is practically unique in the simultaneous development and delivery of 15 R packages in a single repository. R provides simple and practical tools for end-users to install a given package and all of its dependencies from a set of configured package repositories, with a single command. The tools R provides are sufficient for the vast majority of use cases.
Beyond the need to build a large collection of packages and their dependencies from source, RCloud has additional requirements, including supply chain security and repeatable builds, that go beyond the typical package developer's requirements. This motivates the development of a more general set of build tools.
Base R tools
R provides a command line tool, activated by R CMD INSTALL
, which
can build and install a single R package from a tarball or local
directory, to a specified output location.
This tool requires that any dependencies needed by the package being built be already installed in the output location. Therefore, given a collection of packages to install, the developer must build them and all of their dependencies in the correct order.
Transitive dependencies
Base R does not provide public functions to assist in discovering transitive dependencies or the installation order, but there are private functions defined for this purpose.
This means that given a single package's source code, base R provides no public way for the user to install that package successfully, unless it has no third-party dependencies. And further, it provides no public way for the user to discover the full set of packages needed to successfully build the package in question.
Due to R's dynamic nature, we can access the private functions we need to fulfil some of the requirements above, but not all.
Ultimately, I chose to implement a parser for R package and repository
metadata in Zig, and to calculate transitive dependencies that merge
and satisfy version constraints (e.g. lib >= 1.2.1
), which I
describe more in the next section.
I won't cover the details of that work in this case study, but it is crucial to generating the correct build rules to build R packages in the correct order.
Version constraints
Like most package manager ecosystems, R packages can declare their dependencies on external packages with specific constraints on the version. This is used to ensure the depended package has the necessary functionality or bug fix. Version constraints are loosely specified in the official R documentation, and it is stated that any operator may be used.
However, the largest package repository, CRAN, enforces certain
policies which results in nearly all packages published in their
repository using greater-than-or-equal constraints, such as lib >=
1.2.1
. Other repositories have different policies, but they have
fewer packages than CRAN.
Version constraints are not enforced as part of a build plan. Instead,
they are checked by attempting to install a package and its
dependencies, either as part of the installation process or as part of
R CMD check
(a required step before a package is accepted by
repositories like CRAN). Both steps take a significant amount of time,
because a package with many dependencies may successfully build or
check all of its dependencies before failing on the last one.
My desk research hasn't yet found any discussion of the challenges of
satisfying version constraints while building a complex collection of
packages. In RCloud's case, several of its dependencies declare
different version constraints on transitive dependencies, and these
need to be merged. For example, one may declare lib > 1.0.1
and
another my declare lib >= 1.2.3
. The first need to be merged into
the second's constraint.
Luckily, I did not encounter any contradictory constraints, nor did I
encounter any two-sided constraints (e.g. lib > 1.0
and lib <
2.0
). But I know of no mechanism that would correctly handle this
case within the R ecosystem.
In practice, given the policies enforced by large R package repositories such as CRAN, the lack of support for two-sided constraints is not a problem. But there do exist smaller package repositories, and they may have or develop different policies than CRAN.
System requirements
Many R packages depend on certain system libraries, such as openssl
or icu
, and it is up to the developer to identify and install those
tools prior to building. Typically, that's one role of a configure
script, to verify or complain to the developer if there are missing
system requirements.
The alternative is to start the build process, wait for it to fail, and attempt to identify from the error what library is missing, install it, and repeat. This is the process I went through to initially build this project.
In practice, once I discovered the initial set of system requirements,
I created a shell.nix
and flake.nix
to easily reproduce those
requirements on other hosts.
Contact: @mocom@mastodon.social. Published using C-c C-e P p. If you use this content commercially, kindly make an appropriate donation to Zig Software Foundation in my name.