Solving Combinatorial Problems on Adaptive Many-Core Architectures (2018)

.footer.venue[Many Core Workshop 2018 - York, UK - June 15<sup>th</sup> 2018]

---

# Solving Combinatorial Problems on Adaptive Many-Core Architectures

Ghaith Tarawneh.red[*] and Andrey Mokhov

<ul affiliations>
  <li>School of Engineering, Newcastle University, UK</li>
</ul>

---

.footer.venue[Many Core Workshop 2018 - York, UK - June 15<sup>th</sup> 2018]

---

## Outline

---

# The POETS Project

---

## POETS Project Team

---

## What is POETS?

Massively-parallel architecture consisting of an extremely large number
(>10.sup[6]) of very simple cores that communicate by exchanging short
messages .block[(few bytes)]

* Cores are embedded in a fast parallel communications fabric (the .emph[core mesh]).

* Emphasis is on achieving high scalability by taking computation to the data, instead of the other way around.

---

## Scientific Computing .subslide[(1/2)]

POETS is not a general purpose architecture but it provides .emph[significant speedup] for certain classes of problems, particularly in scientific computing.

* Finite element analysis
  * Neural modeling
  * Particle simulations
  * Image processing
  * Drug screening
  * Weather modeling

---

## Scientific Computing .subslide[(2/2)]

Problems must be decomposed and mapped to the core mesh

* Problem decomposition is known apriori

* Pre-compute mapping between problem domain and core mesh
---

## Combinatorial Problems .subslide[(1/2)]

POETS is also well suited for discrete problems.

* State space exploration
  * Linear programming
  * Boolean satisfiability
  * Constraint-based optimization

---

## Combinatorial Problems .subslide[(2/2)]

* The problem domain is discovered during execution

* Decomposed sub-problems are mapped to the core mesh at runtime

---

# Programming Model

---

## Motivating Example .subslide[(1/2)]

Partial code snippet from an MPI program to approximate π

```C
#include "mpi.h"
#include <stdio.h>
#include <stdlib.h>

int main (int argc, char *argv[])
{
    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numtasks);
    MPI_Comm_rank(MPI_COMM_WORLD,&taskid);
    printf ("MPI task %d has started...\n", taskid);
    avepi = 0;
    for (i = 0; i < ROUNDS; i++) {
        homepi = dboard(DARTS);
        if (taskid == MASTER) {
            pi = pisum/numtasks;
            avepi = ((avepi * i) + pi)/(i + 1);
            printf("After %8d throws, average value of pi = %10.8f\n",
                    (DARTS * (i + 1)),avepi);
        }
    }
    MPI_Finalize();
    return 0;
}
```

]
---

## Motivating Example .subslide[(2/2)]

- Message passing is a very popular programming/execution model for massively-parallel machines

- Message passing programs can be .emph[reliable and performant] but they tend to mix .emph[problem logic] with .emph[non-problem-related concerns] .block[(e.g. node discovery], load balancing and synchronization)

- This is .emph[non-ideal from a software engineering perspective] because it makes programs and algorithms:

* difficult to read, refactor and maintain
  * problem-specific (difficult to re-use)
  * architecture-specific (difficult to port)

- Can we do better?

---

## Proposed Programming Model

We propose a multi-layer software stack to develop combinatorial solvers for massively-parallel POETS-like architectures.

---

## Layers 1 & 2

* Expose `init` and `receive` functions to initialize core states then update them on message arrival

* Example algorithm (mesh traversal):

```Python
  def init(node):
      state = {visited: False}
      return state

def receive(node, state, sender, msg, send, neighbours):
      if state[visited] == False then
          state[visited] = True
          for n in neighbours do
              send(n, EMPTY_MSG)
  ```

---

## Layer 3

Handles sub-problem distribution and load balancing. Example (summing 0 to n):

```Python
  def receive(state, ticket, msg, send):
      if msg == Call(n) then
          if n<1 then
              send(Result(0), ticket)
          else
              ticket = send(Call(n-1))
              state = Continue(ticket, n)
      else if msg == Result(total) then
          if state == Continue(ticket, n) then
              send(Result(total + n), ticket)
          else
              state = Done(total)
      else if msg == Trigger then
          send(Call(10))
  ```

---

## Layer 4

Implement a generalized form of the "ticket pattern" in Layer 3 so that users won't have to do call bookkeeping themselves.

This layers abstracts away message passing and exposes a different programming model to the user (e.g. imperative, functional).

---

## Layer 5

An alternative implementation to sum the numbers 0 to n:

```Python
def calculate_sum(n):
    if n<1 then
        yield Result(0)
    else
        yield Call(n-1)
        total = yield Sync()
        yield Result(total + n)
```

---

# Use Case: SAT Solver

---

## POETS Simulator

- We created a model POETS machine in Python and used it to develop and evaluate a concrete implementation of the model.

---

## Boolean Satisfiability (SAT)

SAT is a classic NP-complete problem:

- Given a list of Boolean expression in Conjunctive Normal Form (CNF), determine if an assignment exists such that all statements evaluate to true.

- There are many clever heuristics to prune the search space but all solutions are essentially a form of state space exploration.

We wanted to find out how much coding effort is needed to:

1. Port an existing SAT solver algorithm (DPLL) to our architecture

2. Experiment with architectural features and their impact on performance

3. Produce code that we can re-use to develop other combinatorial solvers

---

## Ported Solver Algorithm (DPLL)

```Python
def solve_sat(problem):
    if consistent(problem) then
        yield Result(SAT)
    if exist_empty_clause(problem) then
        yield Result(UNSAT)
    for clause in problem[clauses] do
        if unit_clause(clause) then
            unit_propagate(problem, clause)
    for literal in problem[literals] do
        if literal_pure(literal) then
            assign_pure(problem, literal)
    L = select_literal(problem)
    subp1 = assign(problem, L, True)
    subp2 = assign(problem, L, False)
    yield [is_SAT, Call(subp1), Call(subp2)]
    result = yield Sync()
    yield result
```

---

## Simulation Results

- We ran experiments to test several mesh topologies and mapping algorithms.

- Our main insight is that we did not have to modify the application code for
running on different mesh dimensions, topologies or mapping algorithms.

---

## Conclusion

1. We presented a programming model to develop combinatorial solvers for .block[a novel] supercomputer architecture (POETS)

2. Using the model gave us the best of two worlds:

- The convenience and portability of a familiar programming style at the application layer (layer 5)

- An underlying hierarchy of solutions to common problems (layers 1-4), which we can maintain, optimize and re-use for developing other solvers

3. Abstraction is an age-old idea but supercomputer programming models and abstractions tend to gravitate towards being either too thin or too deep.

4. We need more hierarchical models with building blocks (at different levels of organization) that we can maintain, port and re-use.