The overhead of coroutine processes is fairly high. A clock driver
implemented through a coroutine process is mostly overhead. This was
partially addressed in commit 2398b792 by microoptimizing yielding.
This commit eliminates the coroutine process overhead completely by
introducing dedicated clock processes. It also simplifies the logic
to a simple toggle.
This change improves runtime by about 12% on Minerva SRAM SoC.
When a literal is used on the left-hand side of a numeric operator,
Python is able to constant-fold some expressions:
>>> dis.dis(lambda x: 0 + 0 + x)
1 0 LOAD_CONST 1 (0)
2 LOAD_FAST 0 (x)
4 BINARY_ADD
6 RETURN_VALUE
If a literal is used on the right-hand side such that the left-hand
side is variable, this doesn't happen:
>>> dis.dis(lambda x: x + 0 + 0)
1 0 LOAD_FAST 0 (x)
2 LOAD_CONST 1 (0)
4 BINARY_ADD
6 LOAD_CONST 1 (0)
8 BINARY_ADD
10 RETURN_VALUE
PyRTL generates fairly redundant code due to the pervasive masking,
and because of that, transforming expressions into the former form,
where possible, improves runtime by about 10% on Minerva SRAM SoC.