Lightweight benchmarking in the uplc excutable#7824
Conversation
| UPLC.termMapNames (\(PLC.NamedDeBruijn _ i) -> PLC.NamedDeBruijn "" i) dbTerm | ||
| let !evalCtx = mkDefaultEvalCtx semvar | ||
| performGC | ||
| -- Store the term in an IORef so GHC cannot CSE/share the result of |
There was a problem hiding this comment.
This was Claude's idea, although it needed several attempts after I had to point out that it was failing to take account of call-by-need. I probably wouldn't have thought of this particular trick myself.
| ) | ||
| <$> allASTs trace | ||
|
|
||
| {- TODO: This is an exact copy of some code in `PlutusBenchmark.Common`. Check |
There was a problem hiding this comment.
To have a single version of this we'd have to move it into plutus-ledger-api to avoid dependency problems and I didn't want to do that right now because it wasn't clear whether it'd affect the standard benchmarking process: see #7796.
|
Our benchmarks run the same script hundreds of times in a row, so the CPU gets to "learn" it and clocks it as much as 20% faster than it often runs on-chain, where a given script is usually jumbled in with many different ones rather than repeated back-to-back. By "learn" I mean the branch predictor and instruction cache tune themselves to that one script's control flow, which a tight repeat loop maximizes and a mixed block mostly doesn't. So for a one-off script we're measuring close to a best case, not the real cost; a heavily reused contract sits closer to the benchmark. |
| t1 <- getCPUTime | ||
| let !ok = either (const False) (const True) r | ||
| loop (k - 1) ok (total + (t1 - t0)) | ||
| (lastOk, totalPs) <- loop count True 0 |
There was a problem hiding this comment.
Nit: can check allOk as opposed to lastOk.
We occasionally get suspicious results from Criterion benchmarks for UPLC scripts. Criterion does a lot of work and carries a lot of data about between iterations of the thing that it's benchmarking and it can sometimes be difficult to tell whether there might be some conflict between the benchmarking process and the thing being benchmarked. It also takes a bit of effort to set up a benchmark in the first place. This PR adds a
uplc timeoption that runs a UPLC script some number of times and reports the average time taken to run it. This doesn't produce exactly the same results as Criterion, but it does seem to a be a useful way of getting a second opinion when Criterion results look questionable.There has also been a
uplc benchmarkoption for some time which uses Criterion to benchmark a single script. This has never given very convincing results, but I updated it to useevaluateCekLikeInProdand that seems to have helped.The initial code for this was produced by Claude, but it needed quite a lot of prompting to convince it to produce something that was ready to merge, along with some manual editing.
uplc timeBelow are three graphs of the times reported by the new
uplc timecommand for varying numbers of iterations for a few randomly chosen validation scripts. The blue points represent the time taken to execute a flat file, the black ones are the time taken to execute the corresponding textual UPLC file, and the horizontal red line is the time reported by the Criterionvalidationbenchmark. All numbers in this comment were obtained on my desktop machine.The corresponding graphs for all of the other validation scripts are very similar. The time reported by
uplcforniterations is initially quite large but then drops rapidly to a more or less constant value, which is always slightly below the time reported by Criterion. The drop-off at the start is presumably due to cache effects: for larger numbers of iterations the evaluator and/or script will be in the cache and thus will execute more quickly. Criterion works by recording the time taken to run a script multiple times in batches of increasing size, takes the average for all of these, then takes the average of those averages. This process is questionable because it will give undue weight to smaller batches at the start, which will take longer because of the warm-up effect; Criterion is also doing more work overall (it has a list of all of the benchmarks it's going to run which will be retained throughout the entire benchmarking run, and it also accumulates a lot of information while running each benchmark), which may explain the discrepancies with the results returned byuplc time. The table below shows the times reported byuplc time(100 iterations) with the times reported by Criterion: Criterion always reports a larger time, but it is always less than 10% more, which suggests thatuplcgives sensible results. It also takesuplcabout 4 seconds to measure the times of all of the scripts, as opposed to about 45 minutes for Criterion (which is set to benchmark for each script for 30 seconds, which in retrospect may be rather long; perhaps we could reduce the time limit?)The question of what the "true" execution time of a script is also arises. On the chain a collection of usually different scripts will be executed once each, with the node doing other work in between, so maybe the cache warm-up won't happen and we should be looking at points with small n in these graphs. On the other hand, perhaps the evaluator does remain in the cache between script executions and we do get increased speeds because of warm-up. For the validation examples, the first reported validation time (n=1) is mostly between 1.2 and 1.7 times the figure for n=100: maybe the default n=100 figure is OK? It's certainly more reproducible than the n=1 time.
uplc benchmarkThe
uplc benchmarkoption was added some time ago, but has never given very good results. Here's a comparison of the times given byuplc benchmarkon the first few validation scripts with the results of the Criterionvalidationbenchmark.All of the rest of the results are similar; however,
uplc benchmarkwas usingCek.runCekDeBruijninstead ofevaluateCekLikeInProd, and changing it makes the results much closer:uplc benchmarkperhaps reports slight faster times in general (the first number is a bit smaller than the second), but the results are pretty close. This is much better and shows the importance of usingevaluateCekLikeInProdwhen execution time is important.