This repository contains the relevant code for evaluating the Timepiece system, as described in the PLDI2023 submission #11, "Modular Control Plane Verification via Temporal Invariants".
All the relevant code for testing Timepiece's features as described in our paper can be run using the Makefile, which builds a Docker container to run Timepiece's benchmarks. We also include the code of Angler, an additional tool that we used to extract the Internet2 network configurations' details from Batfish and output a JSON file. Our Docker image can also be used to run Timepiece using this JSON input.
The minimal required dependencies are Docker and Make. To generate plots, we require a LaTeX distribution with pgfplots.
We use bind mounts to read and write from the results and logs files.
We've tested this artifact using Linux and MacOS, and our Makefile uses common
Unix commands such as sh, date, mkdir, mv and rm to manipulate files.
You may need to change these commands to the appropriate Windows equivalents
if running this artifact on Windows.
Be sure you have space free to build the docker image.
The image used for the benchmarks is around 1.35 GB, and may fail to
build if your machine is out of space.
Running docker system prune can help delete old containers and
images you no longer use.
Ensure that you have make installed and available from the command line.
Ensure your system has pdflatex installed, along with the following packages
(available in most standard LaTeX distributions like TeX Live and MikTeX):
tikzetoolboxmathptmxsiunitxpgfplots(version >=1.17)pgfplotstablexspacestandalone
To start the evaluation, build the Docker image for Timepiece using make image.
This will build a Docker image that can run Timepiece.
You should see a series of messages to your screen as Docker describes
the stages of the build.
We include the INTERNET2.angler.json file that we generated in the repository
to save build time. This file is used to test Internet2 and is quite large (~1GB).
We provide the infrastructure to regenerate this file from the Internet2
network configurations (if so desired) in the angler directory.
Now run make bench to run the benchmarks.
This will use the defaults defined in the Makefile for running the benchmarks,
which give each benchmark only a minute to run before timing out (controlled by
the TIMEOUT variable).
As such, many of the larger benchmarks will time out sooner than our reported
results in the paper.
Nonetheless, this should test that everything can run as intended.
make bench will run each fattree policy benchmark modularly and monolithically,
from a fattree of k=4 pods to k=40 pods (controlled by the MINSIZE and MAXSIZE variables).
The policies run are controlled by the POLICIES variable.
Output is captured by the Python script run_all.py, which buffers
it to the screen: you may therefore have to wait some time before any output appears.
By default, each benchmark runs once (controlled by the NTRIALS variable):
you can automatically instruct run_all.py to take the minimum of multiple trials
(to reduce the impact of noise) by increasing NTRIALS.
The output of each benchmark should be a report of the time taken to verify each benchmark. For the fattree benchmarks, we expect all benchmarks to either pass all checks (i.e. no counterexamples are reported), or timeout. Output should roughly resemble:
$ make bench
Running benchmark k=4 with options: reach
Trial 0 of 1 started ...
k=4
Inferred destination node: edge-19
Environment.ProcessorCount: ...
All the modular checks passed!
Modular verification took ...
Check statistics:
Maximum check time: node ... in ...
Minimum check time: node ... in ...
Average check time: ...
Median check time: node ... in ...
99th percentile check time: node ... in ...
Total check time: ...
Running benchmark k=8 with options: reach
Trial 0 of 1 started ...
k=8
Inferred destination node: edge-79
Environment.ProcessorCount: ...
All the modular checks passed!
...
After a benchmark finishes running, the log of all output is saved to the logs/
directory as a $(POLICY).txt file ($(POLICY)-m.txt for monolithic benchmarks).
The --dat option, supplied to run_all.py, instructs the script to generate a .dat file.
This file is used to generate the plot for the corresponding benchmark. These files are
saved to the results/ directory as $(POLICY).dat or $(POLICY)-m.dat.
Each line of the .dat file is the average across trials of each statistic.
For modular benchmarks, these include:
- the number of nodes (n),
- the maximum time of a single node (max),
- minimum time of a single node (min),
- average time of a single node (avg),
- median time of a single node (med),
- 99th percentile time of a single node (99p),
- total time of all nodes' checks (total)
- wall-clock time of all nodes' checks (wall)
For example:
n max min avg med 99p total wall
293 2965 16 224.11945392491467 158 1876 65667 25350
As the benchmarks are parallelized, the wall-clock time illustrates the actual time
taken by the benchmarks once each check is distributed across your machine's cores.
Timepiece will attempt to use as many cores as are available to it: this statistic
is reported by the line starting with Environment.ProcessorCount: ...
(note that this number may be half as many as are available, for reasons unclear
to us).
We conducted our experiments on machines with 96 cores: we do not expect reviewers
to have access to such machines, so the wall clock times may be higher during
evaluation; nonetheless, the overall trends should be similar as the size of the
benchmark increases.
For monolithic benchmarks, statistics are simply the number of nodes and the total time of all checks. For example:
n total
2000 7200000
Run make plots to generate all the LaTeX plots corresponding to the benchmarks.
These plots should resemble those in Figure 6 in the paper, although they will
necessarily have fewer data points as our default timeout is set to only 60 seconds.
To better indicate timeouts in our paper, we manually added data points of 7,200,000ms
in the .dat file to emphasize when timeouts occurred.
This is not done automatically by our logging tool run_all.py: if you wish to
include these data points, you must add rows to the appropriate .dat files, e.g.:
n total
20 4936
# the following rows are manually added
80 7200000
180 7200000
320 7200000
500 7200000
720 7200000
980 7200000
1280 7200000
1620 7200000
2000 7200000
As mentioned above, you will need LaTeX and the pdflatex binary to be able to generate
these plots, along with the packages mentioned above.
We generate plots by using a template .tex file called plot.tex.
The generated plots have slightly different names from the plots in the paper, as the policies are listed in shorthand rather than their full names. The mapping between them is:
results/reach.pdf: the SpReach benchmarkresults/lengthWeak.pdf: the SpLen benchmarkresults/valley.pdf: the SpVf benchmarkresults/hijack.pdf: the SpHijack benchmarkresults/allReach.pdf: the ApReach benchmarkresults/allLengthWeak.pdf: the ApLen benchmarkresults/allValley.pdf: the ApVf benchmarkresults/allHijack.pdf: the ApHijack benchmark
For the Internet2 benchmark, we run the docker container again, asking it to use
Timepiece.Angler to evaluate the Internet2 benchmark.
Timepiece.Angler works differently from Timepiece.Benchmarks: it reads in a given
.angler.json file and deserializes it to determine the network's behavior.
This process takes a few seconds, at the end of which standard output should report:
Internet2 benchmark identified...
Successfully deserialized INTERNET2.angler.json
This should then be followed by a series of counterexamples at particular
nodes.
You may need to rerun or increase the timeout if the counterexamples are not reported:
setting TIMEOUT=300 (5 minutes) should be more than enough for the modular benchmark.
As Internet2 has over 200 external peers, each counterexample will run
for close to 300 lines of output, describing which check fails and what the state
of the counterexample is.
This behavior is expected, as Internet2's BlockToExternal property may not be enforced
for some of its connections (whether intentionally or not).
Example output may look something like:
Internet2 benchmark identified...
Successfully deserialized JSON file INTERNET2.angler.json
Environment.ProcessorCount: 4
Counterexample for node 10.11.1.17:
Inductive check failed!
symbolic external-route-10.11.1.17 := RouteEnvironment(Result=RouteResult(Exit=False,Fallthrough=False,Returned=False,Value=False), LocalDefaultAction=False, Prefix=0.0.0.0/0, Weight=0, Lp=0, AsPathLength=0, Metric=0, OriginType=0, Tag=0, Communities={})
symbolic external-route-108.59.25.20 := RouteEnvironment(Result=RouteResult(Exit=False,Fallthrough=False,Returned=False,Value=False), LocalDefaultAction=False, Prefix=0.0.0.0/0, Weight=0, Lp=0, AsPathLength=0, Metric=0, OriginType=0, Tag=0, Communities={})
symbolic external-route-108.59.26.20 := RouteEnvironment(Result=RouteResult(Exit=False,Fallthrough=False,Returned=False,Value=False), LocalDefaultAction=False, Prefix=0.0.0.0/0, Weight=0, Lp=0, AsPathLength=0, Metric=0, OriginType=0, Tag=0, Communities={})
symbolic external-route-109.105.98.9 := RouteEnvironment(Result=RouteResult(Exit=False,Fallthrough=False,Returned=False,Value=False), LocalDefaultAction=False, Prefix=0.0.0.0/0, Weight=0, Lp=0, AsPathLength=0, Metric=0, OriginType=0, Tag=0, Communities={})
symbolic external-route-117.103.111.154 := RouteEnvironment(Result=RouteResult(Exit=False,Fallthrough=False,Returned=False,Value=False), LocalDefaultAction=False, Prefix=0.0.0.0/0, Weight=0, Lp=0, AsPathLength=0, Metric=0, OriginType=0, Tag=0, Communities={})
# and so on
...
We can also attempt to monolithically verify Internet2 by running make logs/internet2-m.txt.
In our experiments, we did not see monolithic verification complete after over
two hours.
As we do not use run_all.py to execute Timepiece.Angler, the benchmark must
be manually terminated by sending a SIGTERM signal.
We do so using timeout inside the container.
Because of this, the benchmark will simply exit once the timeout is reached.
To reproduce the evaluation for each fattree policy, simply run
make TIMEOUT=7200
This runs each policy with a 2-hour timeout and generates its plot.
You may compare the .txt log and .dat result files with those included
in the paper-results directory: note that the
paper-results dat files omit some statistics (max, min, avg)
and label others differently (tk=modular wall, ms=monolithic total).
The plots should roughly correspond to those in Figure 6 (page 15) of
the paper.
(Note that Figure 1 on page 2 is a simplified depiction of Figure 6(d),
without median or 99th percentile times reported.)
As we claim in the paper, results should show timeouts occur quickly for the monolithic benchmarks (around k=8 or k=12), with the exception of the SpReach (reach) benchmark, which is sufficiently simple to run to completion. Times should be larger for Ap benchmarks over Sp benchmarks.
We cannot guarantee that the exact times reported in the paper will hold in the artifact environment (e.g. the 99th percentile ApReach time for k=40 may be above 9 seconds). Nonetheless, the trends of our results should be supported by the artifact times.
To reproduce the evaluation of Internet2, run
make wan TIMEOUT=7200
This runs both Internet2 modularly and monolithically with a 2-hour timeout. The details reported in the .txt logs should align with those given in text on page 18 of the paper (under Wide-area networks).
We include INTERNET2.angler.json for users' convenience,
but it can also be built from scratch.
Simply clone the angler GitHub
repository, then follow the insutrctions therein to convert the
configuration files (under examples/INTERNET2).
Angler uses Batfish to convert the configuration
files to its own intermediate JSON representation.
You will need to use Docker again to start a container for Batfish,
and then Python to run Angler.
You can explore different properties and benchmarks by modifying the C# files, or changing the fattree policies used by make (using the POLICIES variable). There are alternative policies 'l' (line 107) and 'vl' (line 82) that check slightly-different properties, as shown in the code.
It is also straightforward to add a new policy to test:
- Define a
Network.csinstance (Sp.csandVf.csillustrate this concept for specific subclasses ofNetwork). You can make the network parametric in terms of the fattree size by following those examples. - Add a case for this network instance to [
Benchmark.cs] (seeRun()andBenchmarkTypeExtensions.Parse()). - Run
Timepiece.Benchmarkswith your new policy.
Modifications to the benchmarks' properties, annotations and other behavior
is also possible.
Properties and annotations are arbitrary C# functions which take Zen
objects as arguments.
You may refer to the Zen documentation
for more information on what operations Zen supports.
We defined temporal operators in our work as shorthands for common functions,
which return a Func<Zen<T>, Zen<BigInteger>, Zen<bool>> as output.
Zen<T>is an input representing a route of typeT, lifted to ZenZen<BigInteger>represents the time inputZen<bool>is the output Zen-lifted boolean indicating if the function holds on the inputs.
The Timepiece.Lang file provides these operators
along with other useful static functions.
We use Time as a shorthand for Zen<BigInteger> in this file.
These include:
Globally: the "script-G" operator from the paper.Until: the "script-U" operator from the paper.Finally: the "script-F" operator from the paper.Intersect: lifted set intersection (the "square-cap" operator from the paper).Union: lifted set union (the "square-cup" operator from the paper).
For instance, if one wanted to add a new Complement
operator to Timepiece.Lang to represent the lifted set complement
operator specified in the paper, one could write:
public static Func<Zen<T>, Zen<BigInteger>, Zen<bool>> Complement<T>(Func<Zen<T>, Zen<BigInteger>, Zen<bool>> f)
{
return (r, t) => Zen.Not(f(r, t));
}Timepiece.Networks.Network
contains the generic definition of a network instance.
The Network constructor on line 110 is used in our benchmarks:
note that it takes two types of properties as input, both expressed
as Func<Zen<T>, Zen<bool>>: stableProperties, which represent Finally properties
that hold at a certain convergence time (4 in the case of our fattree benchmarks);
and safetyProperties, which represent Globally properties that always holds.
Users may add symbolic variables to their network instances
to represent nondeterministic behavior.
The Timepiece.SymbolicValue module
provides the API for defining these values.
Timepiece.Benchmarks.SymbolicDestination
provides a subclass to SymbolicValue for our Ap benchmarks.
You may need to run docker rm tptest if you receive an error
that the benchmarking container cannot start,
as the tptest container is already in use.
If you run a benchmark again after a log file has previously been generated,
make may skip that benchmark.
You must delete or rename the old log files before running make again.
The make clean recipe will delete the logs and results files,
while the make archive recipe will move them to a new directory
specifying the current time.