In my PhD, I studied the phenomenon of metastability in multi-clock systems.
Metastability is a psuedo-stable state in which a physical system can take an unbounded amount of time before transitioning to a stable state. This is commonly illustrated with the following mechanical analogy:
This is a “ball and hill” dynamical system with two stable states S1
and
S2
. Assume the ball is initially placed at a random position somewhere on
the hill then allowed to roll, measuring the time t
till it reaches either
S1
or S2
. You may be able to see that t
will be higher the closer the
initial placement of the ball was to the hill top (call this difference Δd
).
In a Newtonian universe, Δd
can be infinitely reduced and consequently t
can be arbitrarily increased. If the ball is initially placed at the
hypothetical tipping point itself (Δd = 0
), it will remain stuck there
forever (t = infinity
).
This begs the following interesting question ...
Is it possible to engineer the hill such that an upper bound can be placed
on t
?
For example, what about the following "clever solution"?
Here, we're attempting to make the ball roll to S2
faster by changing the
hill's shape. Alas, the hill still has a tipping point and Δd
can still be
arbitrarily small.
It's not that the new shape isn't good enough. It is in fact fundamentally impossible to place an upper bound on the resolution time of a continuous multistable system whose initial state is unconstrained. In the example above, no matter the shape of the hill, so long as:
- there are two stable states, and
- slope is continuous
then no upper bound can be placed on t
.
It took decades of elusive failures and many fooled attempts before computer engineers finally accepted this hard truth (and even today, the counter-intuitive nature of the problem is still resulting in numerous design goofs).
So how does this problem affect modern computers?
Modern computers are built from multiple digital components that run with
different time references (i.e. different clocks), such as processor cores,
memory units and GPUs. Many of these components are synchronous
systems but treat
signals from their neighbors as
asynchronous because
they're derived from different time references. On every clock tick, each
component must decide whether each of its asynchronous inputs is in a logic
high or low state. However, asynchronous inputs can transition at any time,
including times that are very close to the decision's tipping point. This
situation is essentially the same as the ball and hill analogy: a binary
decision is required but the distance to the tipping point (Δd
) cannot be
constrained. In consequence, there will always be a non-zero probability that
the component will fail.
In this context a "failure" can refer to arbitrary violations of how the component (and consequently the whole computer) is intended to function. In practice, this can mean things like applications terminating unexpectedly, data corruption or one of the infamous OS fails.
These sorts of failures cannot be avoided but their risk can be arbitrarily decreased by allowing components longer times to decide if their asynchronous inputs are logic low or high (this is analogous to being extra patient before judging whether the ball has reached a stable state). The catch, however, is that the system must suffer a performance penalty because of decision delays.
I published a set of workaround speculation-based solutions that enable computer systems to perform decision-dependent computations before a metastable decision is made. A thorough discussion of these solutions and other thoughts on the problem are published in my PhD thesis.