Skip to content

Lightweight benchmarking in the uplc excutable#7824

Open
kwxm wants to merge 15 commits into
masterfrom
kwxm/uplc/timing-option
Open

Lightweight benchmarking in the uplc excutable#7824
kwxm wants to merge 15 commits into
masterfrom
kwxm/uplc/timing-option

Conversation

@kwxm

@kwxm kwxm commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

We occasionally get suspicious results from Criterion benchmarks for UPLC scripts. Criterion does a lot of work and carries a lot of data about between iterations of the thing that it's benchmarking and it can sometimes be difficult to tell whether there might be some conflict between the benchmarking process and the thing being benchmarked. It also takes a bit of effort to set up a benchmark in the first place. This PR adds a uplc time option that runs a UPLC script some number of times and reports the average time taken to run it. This doesn't produce exactly the same results as Criterion, but it does seem to a be a useful way of getting a second opinion when Criterion results look questionable.

There has also been a uplc benchmark option for some time which uses Criterion to benchmark a single script. This has never given very convincing results, but I updated it to use evaluateCekLikeInProd and that seems to have helped.

The initial code for this was produced by Claude, but it needed quite a lot of prompting to convince it to produce something that was ready to merge, along with some manual editing.

uplc time

Below are three graphs of the times reported by the new uplc time command for varying numbers of iterations for a few randomly chosen validation scripts. The blue points represent the time taken to execute a flat file, the black ones are the time taken to execute the corresponding textual UPLC file, and the horizontal red line is the time reported by the Criterion validation benchmark. All numbers in this comment were obtained on my desktop machine.

auction11 future2 pubkey1

The corresponding graphs for all of the other validation scripts are very similar. The time reported by uplc for n iterations is initially quite large but then drops rapidly to a more or less constant value, which is always slightly below the time reported by Criterion. The drop-off at the start is presumably due to cache effects: for larger numbers of iterations the evaluator and/or script will be in the cache and thus will execute more quickly. Criterion works by recording the time taken to run a script multiple times in batches of increasing size, takes the average for all of these, then takes the average of those averages. This process is questionable because it will give undue weight to smaller batches at the start, which will take longer because of the warm-up effect; Criterion is also doing more work overall (it has a list of all of the benchmarks it's going to run which will be retained throughout the entire benchmarking run, and it also accumulates a lot of information while running each benchmark), which may explain the discrepancies with the results returned by uplc time. The table below shows the times reported by uplc time (100 iterations) with the times reported by Criterion: Criterion always reports a larger time, but it is always less than 10% more, which suggests that uplc gives sensible results. It also takes uplc about 4 seconds to measure the times of all of the scripts, as opposed to about 45 minutes for Criterion (which is set to benchmark for each script for 30 seconds, which in retrospect may be rather long; perhaps we could reduce the time limit?)

Script                        uplc       Criterion     Change
----------------------------------------------------------------------------------
auction_1-1                 70.477 µs	 70.48 μs      +0.0%
auction_1-2                235.754 µs	 254.9 μs      +8.1%
auction_1-3                232.593 µs	 250.3 μs      +7.6%
auction_1-4                 90.136 µs	 91.57 μs      +1.6%
auction_2-1                 67.865 µs	 70.96 μs      +4.6%
auction_2-2                246.877 µs	 254.9 μs      +3.2%
auction_2-3                308.704 µs	 328.5 μs      +6.4%
auction_2-4                235.785 µs	 248.6 μs      +5.4%
auction_2-5                 85.436 µs	 91.68 μs      +7.3%
coop-1                      89.660 µs	 92.64 μs      +3.3%
coop-2                     282.500 µs	 297.5 μs      +5.3%
coop-3                     843.250 µs	 891.3 μs      +5.7%
coop-4                     399.404 µs	 394.5 μs      -1.2%
coop-5                     157.251 µs	 167.5 μs      +6.5%
coop-6                     260.691 µs	 276.1 μs      +5.9%
coop-7                     122.791 µs	 130.1 μs      +6.0%
crowdfunding-success-1      81.824 µs	 86.25 μs      +5.4%
crowdfunding-success-2      81.501 µs	 85.94 μs      +5.4%
crowdfunding-success-3      81.262 µs	 86.58 μs      +6.5%
currency-1                  94.984 µs	 99.75 μs      +5.0%
escrow-redeem_1-1          132.843 µs	 141.8 μs      +6.7%
escrow-redeem_1-2          132.428 µs	 142.5 μs      +7.6%
escrow-redeem_2-1          151.612 µs	 163.5 μs      +7.8%
escrow-redeem_2-2          160.318 µs	 163.9 μs      +2.2%
escrow-redeem_2-3          153.093 µs	 165.5 μs      +8.1%
escrow-refund-1             60.342 µs	 63.48 μs      +5.2%
future-increase-margin-1    94.968 µs	 99.45 μs      +4.7%
future-increase-margin-2   197.088 µs	 213.4 μs      +8.3%
future-increase-margin-3   206.918 µs	 212.9 μs      +2.9%
future-increase-margin-4   195.185 µs	 197.5 μs      +1.2%
future-increase-margin-5   310.977 µs	 332.2 μs      +6.8%
future-pay-out-1            98.780 µs	 99.70 μs      +0.9%
future-pay-out-2           202.322 µs	 213.7 μs      +5.6%
future-pay-out-3           198.968 µs	 213.3 μs      +7.2%
future-pay-out-4           311.374 µs	 336.2 μs      +8.0%
future-settle-early-1       96.335 µs	 99.35 μs      +3.1%
future-settle-early-2      199.612 µs	 214.0 μs      +7.2%
future-settle-early-3      196.233 µs	 213.9 μs      +9.0%
future-settle-early-4      238.002 µs	 255.7 μs      +7.4%
game-sm-success_1-1        148.237 µs	 158.9 μs      +7.2%
game-sm-success_1-2         76.755 µs	 80.19 μs      +4.5%
game-sm-success_1-3        256.007 µs	 253.8 μs      -0.9%
game-sm-success_1-4         86.977 µs	 91.58 μs      +5.3%
game-sm-success_2-1        147.953 µs	 158.6 μs      +7.2%
game-sm-success_2-2         79.621 µs	 80.34 μs      +0.9%
game-sm-success_2-3        233.373 µs	 252.8 μs      +8.3%
game-sm-success_2-4         86.558 µs	 91.15 μs      +5.3%
game-sm-success_2-5        249.434 µs	 251.0 μs      +0.6%
game-sm-success_2-6         85.997 µs	 91.07 μs      +5.9%
guardrail-sorted-large     170.746 µs	 180.6 μs      +5.8%
guardrail-sorted-small      28.983 µs	 29.60 μs      +2.1%
guardrail-unsorted-large   229.530 µs	 247.5 μs      +7.8%
guardrail-unsorted-small    27.230 µs	 28.51 μs      +4.7%
multisig-sm-01             153.996 µs	 163.0 μs      +5.8%
multisig-sm-02             151.258 µs	 159.1 μs      +5.2%
multisig-sm-03             150.726 µs	 159.6 μs      +5.9%
multisig-sm-04             154.597 µs	 161.9 μs      +4.7%
multisig-sm-05             207.542 µs	 219.2 μs      +5.6%
multisig-sm-06             153.500 µs	 162.6 μs      +5.9%
multisig-sm-07             150.032 µs	 159.1 μs      +6.0%
multisig-sm-08             149.017 µs	 160.2 μs      +7.5%
multisig-sm-09             149.974 µs	 160.6 μs      +7.1%
multisig-sm-10             205.431 µs	 220.7 μs      +7.4%
ping-pong-1                125.944 µs	 133.3 μs      +5.8%
ping-pong-2                143.107 µs	 132.8 μs      -7.2%
ping-pong_2-1               79.549 µs	 83.38 μs      +4.8%
prism-1                     63.489 µs	 66.42 μs      +4.6%
prism-2                    155.182 µs	 167.2 μs      +7.7%
prism-3                    141.508 µs	 148.4 μs      +4.9%
pubkey-1                    55.251 µs	 57.75 μs      +4.5%
stablecoin_1-1             389.700 µs	 391.3 μs      +0.4%
stablecoin_1-2              72.798 µs	 78.01 μs      +7.2%
stablecoin_1-3             419.232 µs	 446.7 μs      +6.6%
stablecoin_1-4              81.872 µs	 82.66 μs      +1.0%
stablecoin_1-5             523.262 µs	 565.4 μs      +8.1%
stablecoin_1-6             100.480 µs	 102.1 μs      +1.6%
stablecoin_2-1             367.265 µs	 394.1 μs      +7.3%
stablecoin_2-2              77.289 µs	 78.55 μs      +1.6%
stablecoin_2-3             415.673 µs	 447.9 μs      +7.8%
stablecoin_2-4              78.179 µs	 82.93 μs      +6.1%
token-account-1             75.545 µs	 75.89 μs      +0.5%
token-account-2            123.911 µs	 134.4 μs      +8.5%
uniswap-1                  145.886 µs	 156.6 μs      +7.3%
uniswap-2                   84.884 µs	 88.23 μs      +3.9%
uniswap-3                  651.396 µs	 712.5 μs      +9.4%
uniswap-4                  125.376 µs	 132.4 μs      +5.6%
uniswap-5                  431.375 µs	 465.4 μs      +7.9%
uniswap-6                  117.843 µs	 124.7 μs      +5.8%
vesting-1                  138.194 µs	 142.0 μs      +2.8%

The question of what the "true" execution time of a script is also arises. On the chain a collection of usually different scripts will be executed once each, with the node doing other work in between, so maybe the cache warm-up won't happen and we should be looking at points with small n in these graphs. On the other hand, perhaps the evaluator does remain in the cache between script executions and we do get increased speeds because of warm-up. For the validation examples, the first reported validation time (n=1) is mostly between 1.2 and 1.7 times the figure for n=100: maybe the default n=100 figure is OK? It's certainly more reproducible than the n=1 time.

uplc benchmark

The uplc benchmark option was added some time ago, but has never given very good results. Here's a comparison of the times given by uplc benchmark on the first few validation scripts with the results of the Criterion validation benchmark.

Script                              uplc    Criterion  difference
----------------------------------------------------------------------------------
auction_1-1.flat                  97.42 μs	 70.48 μs     -27.7%
auction_1-2.flat                  286.3 μs	 254.9 μs     -11.0%
auction_1-3.flat                  286.2 μs	 250.3 μs     -12.5%
auction_1-4.flat                  122.8 μs	 91.57 μs     -25.4%
auction_2-1.flat                  97.41 μs	 70.96 μs     -27.2%
auction_2-2.flat                  287.2 μs	 254.9 μs     -11.2%
auction_2-3.flat                  370.8 μs	 328.5 μs     -11.4%
auction_2-4.flat                  286.3 μs	 248.6 μs     -13.2%
auction_2-5.flat                  123.3 μs	 91.68 μs     -25.6%
coop-1.flat                       130.0 μs	 92.64 μs     -28.7%
coop-2.flat                       424.0 μs	 297.5 μs     -29.8%
coop-3.flat                       1.021 ms	 891.3 μs     -12.7%
coop-4.flat                       528.3 μs	 394.5 μs     -25.3%
coop-5.flat                       226.7 μs	 167.5 μs     -26.1%
coop-6.flat                       394.6 μs	 276.1 μs     -30.0%
coop-7.flat                       185.1 μs	 130.1 μs     -29.7%

All of the rest of the results are similar; however, uplc benchmark was using Cek.runCekDeBruijn instead of evaluateCekLikeInProd, and changing it makes the results much closer: uplc benchmark perhaps reports slight faster times in general (the first number is a bit smaller than the second), but the results are pretty close. This is much better and shows the importance of using evaluateCekLikeInProd when execution time is important.

Script                              uplc benchmark Criterion   Change
----------------------------------------------------------------------------------
auction_1-1.flat                  70.76 μs	 70.48 μs      -0.4%
auction_1-2.flat                  252.8 μs	 254.9 μs      +0.8%
auction_1-3.flat                  245.2 μs	 250.3 μs      +2.1%
auction_1-4.flat                  90.54 μs	 91.57 μs      +1.1%
auction_2-1.flat                  70.79 μs	 70.96 μs      +0.2%
auction_2-2.flat                  249.8 μs	 254.9 μs      +2.0%
auction_2-3.flat                  323.3 μs	 328.5 μs      +1.6%
auction_2-4.flat                  243.5 μs	 248.6 μs      +2.1%
auction_2-5.flat                  90.49 μs	 91.68 μs      +1.3%
coop-1.flat                       94.33 μs	 92.64 μs      -1.8%
coop-2.flat                       302.6 μs	 297.5 μs      -1.7%
coop-3.flat                       887.1 μs	 891.3 μs      +0.5%
coop-4.flat                       401.9 μs	 394.5 μs      -1.8%
coop-5.flat                       169.1 μs	 167.5 μs      -0.9%
coop-6.flat                       277.0 μs	 276.1 μs      -0.3%
coop-7.flat                       129.5 μs	 130.1 μs      +0.5%
crowdfunding-success-1.flat       84.18 μs	 86.25 μs      +2.5%
crowdfunding-success-2.flat       84.49 μs	 85.94 μs      +1.7%
crowdfunding-success-3.flat       84.29 μs	 86.58 μs      +2.7%
currency-1.flat                   97.28 μs	 99.75 μs      +2.5%
escrow-redeem_1-1.flat            140.3 μs	 141.8 μs      +1.1%
escrow-redeem_1-2.flat            140.2 μs	 142.5 μs      +1.6%
escrow-redeem_2-1.flat            162.3 μs	 163.5 μs      +0.7%
escrow-redeem_2-2.flat            163.2 μs	 163.9 μs      +0.4%
escrow-redeem_2-3.flat            162.5 μs	 165.5 μs      +1.8%
escrow-refund-1.flat              63.15 μs	 63.48 μs      +0.5%
future-increase-margin-1.flat     98.05 μs	 99.45 μs      +1.4%
future-increase-margin-2.flat     212.7 μs	 213.4 μs      +0.3%
future-increase-margin-3.flat     213.0 μs	 212.9 μs      -0.0%
future-increase-margin-4.flat     195.2 μs	 197.5 μs      +1.2%
future-increase-margin-5.flat     325.7 μs	 332.2 μs      +2.0%
future-pay-out-1.flat             97.59 μs	 99.70 μs      +2.2%
future-pay-out-2.flat             211.5 μs	 213.7 μs      +1.0%
future-pay-out-3.flat             212.7 μs	 213.3 μs      +0.3%
future-pay-out-4.flat             327.2 μs	 336.2 μs      +2.8%
future-settle-early-1.flat        98.50 μs	 99.35 μs      +0.9%
future-settle-early-2.flat        214.5 μs	 214.0 μs      -0.2%
future-settle-early-3.flat        213.0 μs	 213.9 μs      +0.4%
future-settle-early-4.flat        251.2 μs	 255.7 μs      +1.8%
game-sm-success_1-1.flat          155.6 μs	 158.9 μs      +2.1%
game-sm-success_1-2.flat          79.47 μs	 80.19 μs      +0.9%
game-sm-success_1-3.flat          249.1 μs	 253.8 μs      +1.9%
game-sm-success_1-4.flat          91.23 μs	 91.58 μs      +0.4%
game-sm-success_2-1.flat          155.2 μs	 158.6 μs      +2.2%
game-sm-success_2-2.flat          79.58 μs	 80.34 μs      +1.0%
game-sm-success_2-3.flat          246.6 μs	 252.8 μs      +2.5%
game-sm-success_2-4.flat          90.84 μs	 91.15 μs      +0.3%
game-sm-success_2-5.flat          245.4 μs	 251.0 μs      +2.3%
game-sm-success_2-6.flat          91.27 μs	 91.07 μs      -0.2%
guardrail-sorted-large.flat       178.9 μs	 180.6 μs      +1.0%
guardrail-sorted-small.flat       29.44 μs	 29.60 μs      +0.5%
guardrail-unsorted-large.flat     247.0 μs	 247.5 μs      +0.2%
guardrail-unsorted-small.flat     28.76 μs	 28.51 μs      -0.9%
multisig-sm-01.flat               162.4 μs	 163.0 μs      +0.4%
multisig-sm-02.flat               157.3 μs	 159.1 μs      +1.1%
multisig-sm-03.flat               157.6 μs	 159.6 μs      +1.3%
multisig-sm-04.flat               160.5 μs	 161.9 μs      +0.9%
multisig-sm-05.flat               218.4 μs	 219.2 μs      +0.4%
multisig-sm-06.flat               160.3 μs	 162.6 μs      +1.4%
multisig-sm-07.flat               158.4 μs	 159.1 μs      +0.4%
multisig-sm-08.flat               161.9 μs	 160.2 μs      -1.1%
multisig-sm-09.flat               160.1 μs	 160.6 μs      +0.3%
multisig-sm-10.flat               219.6 μs	 220.7 μs      +0.5%
ping-pong-1.flat                  132.3 μs	 133.3 μs      +0.8%
ping-pong-2.flat                  132.4 μs	 132.8 μs      +0.3%
ping-pong_2-1.flat                82.02 μs	 83.38 μs      +1.7%
prism-1.flat                      65.30 μs	 66.42 μs      +1.7%
prism-2.flat                      168.6 μs	 167.2 μs      -0.8%
prism-3.flat                      147.4 μs	 148.4 μs      +0.7%
pubkey-1.flat                     57.09 μs	 57.75 μs      +1.2%
stablecoin_1-1.flat               392.9 μs	 391.3 μs      -0.4%
stablecoin_1-2.flat               77.43 μs	 78.01 μs      +0.7%
stablecoin_1-3.flat               441.1 μs	 446.7 μs      +1.3%
stablecoin_1-4.flat               82.48 μs	 82.66 μs      +0.2%
stablecoin_1-5.flat               562.9 μs	 565.4 μs      +0.4%
stablecoin_1-6.flat               100.4 μs	 102.1 μs      +1.7%
stablecoin_2-1.flat               394.6 μs	 394.1 μs      -0.1%
stablecoin_2-2.flat               77.34 μs	 78.55 μs      +1.6%
stablecoin_2-3.flat               440.3 μs	 447.9 μs      +1.7%
stablecoin_2-4.flat               81.83 μs	 82.93 μs      +1.3%
token-account-1.flat              74.62 μs	 75.89 μs      +1.7%
token-account-2.flat              132.7 μs	 134.4 μs      +1.3%
uniswap-1.flat                    154.4 μs	 156.6 μs      +1.4%
uniswap-2.flat                    88.85 μs	 88.23 μs      -0.7%
uniswap-3.flat                    699.9 μs	 712.5 μs      +1.8%
uniswap-4.flat                    131.6 μs	 132.4 μs      +0.6%
uniswap-5.flat                    465.2 μs	 465.4 μs      +0.0%
uniswap-6.flat                    123.8 μs	 124.7 μs      +0.7%
vesting-1.flat                    140.0 μs	 142.0 μs      +1.4%

@kwxm kwxm requested review from Unisay and zeme-wana June 23, 2026 14:56
@kwxm kwxm added Benchmarks Plutus Exe No Changelog Required Add this to skip the Changelog Check labels Jun 23, 2026
UPLC.termMapNames (\(PLC.NamedDeBruijn _ i) -> PLC.NamedDeBruijn "" i) dbTerm
let !evalCtx = mkDefaultEvalCtx semvar
performGC
-- Store the term in an IORef so GHC cannot CSE/share the result of

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was Claude's idea, although it needed several attempts after I had to point out that it was failing to take account of call-by-need. I probably wouldn't have thought of this particular trick myself.

)
<$> allASTs trace

{- TODO: This is an exact copy of some code in `PlutusBenchmark.Common`. Check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To have a single version of this we'd have to move it into plutus-ledger-api to avoid dependency problems and I didn't want to do that right now because it wasn't clear whether it'd affect the standard benchmarking process: see #7796.

@Unisay

Unisay commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Our benchmarks run the same script hundreds of times in a row, so the CPU gets to "learn" it and clocks it as much as 20% faster than it often runs on-chain, where a given script is usually jumbled in with many different ones rather than repeated back-to-back. By "learn" I mean the branch predictor and instruction cache tune themselves to that one script's control flow, which a tight repeat loop maximizes and a mixed block mostly doesn't. So for a one-off script we're measuring close to a best case, not the real cost; a heavily reused contract sits closer to the benchmark.

t1 <- getCPUTime
let !ok = either (const False) (const True) r
loop (k - 1) ok (total + (t1 - t0))
(lastOk, totalPs) <- loop count True 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can check allOk as opposed to lastOk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Benchmarks No Changelog Required Add this to skip the Changelog Check Plutus Exe

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants