Metastability

In my PhD, I studied the phenomenon of metastability in multi-clock systems.

Metastability is a psuedo-stable state in which a physical system can take an unbounded amount of time before transitioning to a stable state. This is commonly illustrated with the following mechanical analogy:

This is a “ball and hill” dynamical system with two stable states S1 and S2. Assume the ball is initially placed at a random position somewhere on the hill then allowed to roll, measuring the time t till it reaches either S1 or S2. You may be able to see that t will be higher the closer the initial placement of the ball was to the hill top (call this difference Δd). In a Newtonian universe, Δd can be infinitely reduced and consequently t can be arbitrarily increased. If the ball is initially placed at the hypothetical tipping point itself (Δd = 0), it will remain stuck there forever (t = infinity).

This begs the following interesting question ...

Is it possible to engineer the hill such that an upper bound can be placed on t?

For example, what about the following "clever solution"?

Here, we're attempting to make the ball roll to S2 faster by changing the hill's shape. Alas, the hill still has a tipping point and Δd can still be arbitrarily small.

It's not that the new shape isn't good enough. It is in fact fundamentally impossible to place an upper bound on the resolution time of a continuous multistable system whose initial state is unconstrained. In the example above, no matter the shape of the hill, so long as:

there are two stable states, and
slope is continuous

then no upper bound can be placed on t.

It took decades of elusive failures and many fooled attempts before computer engineers finally accepted this hard truth (and even today, the counter-intuitive nature of the problem is still resulting in numerous design goofs).

So how does this problem affect modern computers?

Modern computers are built from multiple digital components that run with different time references (i.e. different clocks), such as processor cores, memory units and GPUs. Many of these components are synchronous systems but treat signals from their neighbors as asynchronous because they're derived from different time references. On every clock tick, each component must decide whether each of its asynchronous inputs is in a logic high or low state. However, asynchronous inputs can transition at any time, including times that are very close to the decision's tipping point. This situation is essentially the same as the ball and hill analogy: a binary decision is required but the distance to the tipping point (Δd) cannot be constrained. In consequence, there will always be a non-zero probability that the component will fail.

In this context a "failure" can refer to arbitrary violations of how the component (and consequently the whole computer) is intended to function. In practice, this can mean things like applications terminating unexpectedly, data corruption or one of the infamous OS fails.

These sorts of failures cannot be avoided but their risk can be arbitrarily decreased by allowing components longer times to decide if their asynchronous inputs are logic low or high (this is analogous to being extra patient before judging whether the ball has reached a stable state). The catch, however, is that the system must suffer a performance penalty because of decision delays.

I published a set of workaround speculation-based solutions that enable computer systems to perform decision-dependent computations before a metastable decision is made. A thorough discussion of these solutions and other thoughts on the problem are published in my PhD thesis.