

# GPU-based image processing with the UFO framework

### **Matthias Vogelgesang**

matthias.vogelgesang@kit.edu





# **INTRODUCTION**

# **Institute for Data Processing and Electronics**



#### Hardware

- Development (FPGA, ASIC)
- Manufacturing (circuit production, bonding)
- Characterization and long-term tests

#### Software

- Experiment control and data acquisition
- Analysis of acquired data
- Large scale data storage



# **Institute for Data Processing and Electronics**



#### Hardware

- Development (FPGA, ASIC)
- Manufacturing (circuit production, bonding)
- Characterization and long-term tests

#### Software

- Experiment control and data acquisition
- Analysis of acquired data
- Large scale data storage



## Data analysis for synchrotron $\mu$ CT



## Higher requirements

- Compute-intensive reconstruction
- More pre- and post-processing
- Faster and direct feedback

#### More data

- Better sensors
- Higher throughput
- Time-resolved scans



## Data analysis for synchrotron $\mu$ CT



### Higher requirements

- Compute-intensive reconstruction
- More pre- and post-processing
- Faster and direct feedback

#### More data

- Better sensors
- Higher throughput
- Time-resolved scans

Existing tools can hardly satisfy the demands!





















### Using larger integration

- Complex instruction sets
- Larger caches
- More cores



Figure : Intel Haswell cpu-Die



### Using larger integration

- Complex instruction sets
- Larger caches
- More cores



Figure : Intel Haswell cpu-Die



### Using larger integration

- Complex instruction sets
- Larger caches
- More cores



Figure : Intel Haswell cpu-Die



### Using larger integration

- Complex instruction sets
- Larger caches
- More cores

### Parallelization required

- Instruction level (sse, avx)
- Multi-core CPUs und many-core GPUs
- Multi-node cluster



Figure: Intel Haswell cpu-Die



### Using larger integration

- Complex instruction sets
- Larger caches
- More cores

### Parallelization required

- Instruction level (sse, avx)
- Multi-core CPUs und many-core GPUs
- Multi-node cluster



Figure: Intel Haswell cpu-Die



# HETEROGENEOUS STREAM PROCESSING



### Requirements

- Process image streams on the fly
- Use heterogeneous compute systems

- Define tasks of work
- Connect processing workflows
- Let a run-time schedule tasks



### Requirements

- Process image streams on the fly
- Use heterogeneous compute systems



- Define tasks of work
- Connect processing workflows
- Let a run-time schedule tasks





### Requirements

- Process image streams on the fly
- Use heterogeneous compute systems

- Define tasks of work
- Connect processing workflows
- Let a run-time schedule tasks









### Requirements

- Process image streams on the fly
- Use heterogeneous compute systems

- Define tasks of work
- Connect processing workflows
- Let a run-time schedule tasks















### Requirements

- Process image streams on the fly
- Use heterogeneous compute systems

- Define tasks of work
- Connect processing workflows
- Let a run-time schedule tasks











### Requirements

- Process image streams on the fly
- Use heterogeneous compute systems

- Define tasks of work
- Connect processing workflows
- Let a run-time schedule tasks





#### Execution model

- Sequence of multi-dimensional data  $x_k$
- Tasks  $t_i$  generate, process and consume  $x_k$ 's
- Computation ist inherently sequential

## Parallelisation opportunities

Tasks have data dependencies but don't share state

- Pipeline parallelism
- Data parallelism on multiple cores
- Data parallelism on GPUs





#### Execution model

- Sequence of multi-dimensional data  $x_k$
- Tasks  $t_i$  generate, process and consume  $x_k$ 's
- Computation ist inherently sequential

## Parallelisation opportunities

Tasks have data dependencies but don't share state

- Pipeline parallelism
- Data parallelism on multiple cores
- Data parallelism on GPUs





#### Execution model

- Sequence of multi-dimensional data x<sub>k</sub>
- Tasks  $t_i$  generate, process and consume  $x_k$ 's
- Computation ist inherently sequential

## Parallelisation opportunities

Tasks have data dependencies but don't share state

- Pipeline parallelism
- Data parallelism on multiple cores
- Data parallelism on GPUs





#### Execution model

- Sequence of multi-dimensional data  $x_k$
- Tasks  $t_i$  generate, process and consume  $x_k$ 's
- Computation ist inherently sequential

## Parallelisation opportunities

Tasks have data dependencies but don't share state

- Pipeline parallelism
- Data parallelism on multiple cores
- Data parallelism on GPUs



# **Reconstruction throughput**





## **Reconstruction throughput**





# **Reconstruction throughput**







### Static assignment

- Partitioning of tasks
- Identification of processing units
- Depth-first search for path identification
- Insertion of duplicates
- Mapping and execution

- Avoids unnecessary data transfers
- Automatic scaling





### Static assignment

- Partitioning of tasks
- Identification of processing units
- Depth-first search for path identification
- Insertion of duplicates
- Mapping and execution

#### **Benefits**

- Avoids unnecessary data transfers
- Automatic scaling



10



### Static assignment

- Partitioning of tasks
- Identification of processing units
- Depth-first search for path identification
- Insertion of duplicates
- Mapping and execution

- Avoids unnecessary data transfers
- Automatic scaling









### Static assignment

- Partitioning of tasks
- Identification of processing units
- Depth-first search for path identification
- Insertion of duplicates
- Mapping and execution

- Avoids unnecessary data transfers
- Automatic scaling









### Static assignment

- Partitioning of tasks
- Identification of processing units
- Depth-first search for path identification
- Insertion of duplicates
- Mapping and execution

- Avoids unnecessary data transfers
- Automatic scaling









### Static assignment

- Partitioning of tasks
- Identification of processing units
- Depth-first search for path identification
- Insertion of duplicates
- Mapping and execution

#### **Benefits**

- Avoids unnecessary data transfers
- Automatic scaling



10

# **Reconstruction scalability**





# **Reconstruction scalability**





11



## Strategy

- Again, we assume existing task graph
- Proxy task represents subpath
- Instantiate subpath remotely
- Send and receive input and output

#### **Benefits**

- Local multi gpu optimization
- Abstracts from network communication (InfiniBand via MPI, Zeromq)



Local



## Strategy

- Again, we assume existing task graph
- Proxy task represents subpath
- Instantiate subpath remotely
- Send and receive input and output

#### **Benefits**

- Local multi gpu optimization
- Abstracts from network communication (InfiniBand via MPI, Zeromq)



Local



## Strategy

- Again, we assume existing task graph
- Proxy task represents subpath
- Instantiate subpath remotely
- Send and receive input and output

#### **Benefits**

- Local multi gpu optimization
- Abstracts from network communication (InfiniBand via MPI, Zeromq)



Remote



## Strategy

- Again, we assume existing task graph
- Proxy task represents subpath
- Instantiate subpath remotely
- Send and receive input and output

#### **Benefits**

- Local multi gpu optimization
- Abstracts from network communication (InfiniBand via MPI, Zeromg)



# Scalability







Scalability limited by compute and data transfer ratio

13

# **Implementation**



#### Framework

- Concepts are implemented as a C library
- Accelerator devices are accessed through OpenCL
- Language bindings are provided through GObject



## **Applications**

- Flat-correction, denoising, data conversion
- Filtered backprojection for tomographic and laminographic reconstruction
- Direct Fourier methods and algebraic techniques
- Feature detection, particle tracking

14

# Usage



#### As a user

- Command line
  - \$ ufo-launch read ! blur ! write filename=foo.tif
- Pre-defined JSON
  - \$ ufo-runjson pipeline.json
- tango interface

# Usage



#### As a user

- Command line
  - \$ ufo-launch read ! blur ! write filename=foo.tif
- Pre-defined JSON
  - \$ ufo-runjson pipeline.json
- tango interface new

# Usage



#### As a user

- Command line
  - \$ ufo-launch read ! blur ! write filename=foo.tif
- Pre-defined JSON
  - \$ ufo-runjson pipeline.json
- tango interface new

### As a developer

- Directly via C API
- Through language bindings, e.g. standard Python
- High-level Python interface

## **Standard Python interface**



```
from gi.repository import Ufo
pm = Ufo.PluginManager()
read = pm.get task('read')
opencl = pm.get_task('opencl')
write = pm.get task('write')
read.set properties(path='/home/data/*.tiff')
opencl.set_properties(source='...', kernel='...')
write.set_properties(filename='/home/out.tiff')
g = Ufo.TaskGraph()
g.connect_nodes(read, opencl)
g.connect_nodes(opencl, write)
sched = Ufo.Scheduler()
sched.run(g)
```

# High-level Python



```
from ufo import Read, Write, Opencl
read = Read(path='/home/data/*.tiff')
opencl = Opencl(source='...', kernel='...')
write = Write(filename='/home/out.tiff')
# write to disk
write(opencl(read())).run().wait()
# or use result
for image in opencl(read()):
    print(np.mean(image))
```

## JSON representation



```
"nodes": [
 {"plugin": "read", "name": "read",
   "properties": {"path": "/home/data/*.tiff"}},
 {"plugin": "opencl", "name": "opencl",
   "properties": {"source": "...", "kernel": "..."}},
  {"plugin": "write", "name": "write",
   "properties": {"filename": "/home/out.tiff"}}
"edges": [
 {"from": "read", "to": "opencl", "input": 0},
 {"from": "opencl", "to": "write", "input": 0}
```



# **OPENCL**

# **Implementations**



#### OpenCL is widely supported but ...

| Vendor | Rev. | GPU      | CPU | FPGA | OS           |
|--------|------|----------|-----|------|--------------|
| NVIDIA | 1.1  | <b>~</b> | _   | _    | <b>∆ ≈ ¢</b> |
| AMD    | 2.0  | ~        | ~   | _    | <b>∆ ≈ €</b> |
| Intel  | 2.0  | ~        | ~   | _    | A 👫          |
| Apple  | 1.2  | ~        | ~   | _    | <b>É</b>     |
| Altera | 1.0  | _        | _   | ~    | A 👫          |
|        |      |          |     |      |              |

# **Programming model**



#### Platform

- Host controls  $\geq 1$  platforms (i.e. vendor SDKs)
- A platform consists of  $\geq 1$  devices (CPU, GPU, FPGA)
- Host allocates resources and schedules execution
- Devices execute code assigned to them by the host

# **Programming model**



#### Platform

- Host controls  $\geq 1$  platforms (i.e. vendor SDKs)
- A platform consists of  $\geq 1$  devices (CPU, GPU, FPGA)
- Host allocates resources and schedules execution
- Devices execute code assigned to them by the host

#### **Devices**

- A single device has  $\geq 1$  compute units
- Each CU has  $\geq$  1 processing elements
- Mapping of CUs and PEs to hardware is not specified

# Programming model II



#### Context

- A context encompasses devices of a single platform that want to share data
- Memory buffers are created within a context and *not* per device

#### Command queues

- Communication with a device is only possible through *command queues*
- Created within a context for a specific device
- Commands are data transfers and kernel executions
- Implicit and explicit synchronization of commands



Work is arranged as work items on a 1D, 2D or 3D grid





- Work is arranged as work items on a 1D, 2D or 3D grid
- Grid is split into work groups





- Work is arranged as work items on a 1D, 2D or 3D grid
- Grid is split into work groups
- Work groups are scheduled on one or more CUs





- Work is arranged as work items on a 1D, 2D or 3D grid
- Grid is split into work groups
- Work groups are scheduled on one or more CUs
- Each work item executes a kernel on a PEs





## Memory, buffers and images

- Host cannot access device memory and vice versa
- Buffers transfer data between host and device memory
- Images are specially typed buffers



## Memory, buffers and images

- Host cannot access device memory and vice versa
- Buffers transfer data between host and device memory
- Images are specially typed buffers

### Device memory

Global, host-accessible, modifiable by all work items





## Memory, buffers and images

- Host cannot access device memory and vice versa
- Buffers transfer data between host and device memory
- Images are specially typed buffers

### Device memory

- Global, host-accessible, modifiable by all work items
- Constant, host-accessible, readable by all work items





## Memory, buffers and images

- Host cannot access device memory and vice versa
- Buffers transfer data between host and device memory
- Images are specially typed buffers

### **Device memory**

- Global, host-accessible, modifiable by all work items
- Constant, host-accessible, readable by all work items
- Local, modifiable by work group





## Memory, buffers and images

- Host cannot access device memory and vice versa
- Buffers transfer data between host and device memory
- Images are specially typed buffers

### Device memory

- Global, host-accessible, modifiable by all work items
- Constant, host-accessible, readable by all work items
- Local, modifiable by work group
- Private, modifiable by single work item



### Kernel



A kernel is a piece of C code executed by a work item

#### Kernel



A kernel is a piece of C code executed by a work item

To address data the work item identifies its position on the grid

}

#### Kernel



A kernel is a piece of C code executed by a work item

To address data the work item identifies its position on the grid

It is crucial to map work items to data according to the task and constraints

# Porting code to GPUs



 Look for massive data parallel sections of code, i.e. for loop over large array is a prime example

```
for (int i = 1; i < N-1; i++)
   x[i] = sin(y[i]) + 0.5 * (x[i-1] + x[i+1]);</pre>
```

2. Create kernel and compile that replaces the inner loop body

```
i = get_global_id(0);
x[i] = sin(y[i]) + 0.5 * (x[i-1] + x[i+1]);
```

- 3. Move data to device
- 4. Create, compile and run kernel
- 5. Move result to CPU

#### Pitfalls and solutions



- Not enough work causes underutilized PEs and poor latency hiding
  - → Use finer grid to increase number of work items
- Poor data locality reduces attainable bandwidth
  - → Adjacent work items should access adjacent memory locations
- PCIe bus can become a bottleneck (16 GB/s PCIe vs. 340 GB/s global)
  - → Keep data on GPU for successive kernel executions
- Conditional execution serializes work item execution
  - → Put condition into computation



# **TANGO INTEGRATION**



#### Protocol

tango server accepts compute requests





#### Protocol

- tango server accepts compute requests
- Client sets the json attribute and calls the Run or RunContinuous command





#### Protocol

- tango server accepts compute requests
- Client sets the json attribute and calls the Run or RunContinuous command
- The server spawns a new compute process identified by a process id





#### Protocol

- tango server accepts compute requests
- Client sets the json attribute and calls the Run or RunContinuous command
- The server spawns a new compute process identified by a process id

#### **Execution models**

- 1. Single-run processes ("fire and forget")
- Continuous processes (update description and re-run)



## Single-run processes



#### Interface

```
process = PyTango.DeviceProxy('hzgctkit/process/1')
process.json = "{ ... }"
pid = process.Run()
print(process.Running(pid))  # status of, e.g. True
print(process.jobs)  # active jobs, e.g. [7041]
process.Wait(pid)
print(process.ExitCode(pid))  # return code of job
```

#### Remarks

- Simple to use and understand
- No prolonged hogging of resources

## Continuous processes



#### Interface

```
pid = process.RunContinuous()
process.Continue(pid)
                     # trigger execution
process.json = "{ ... }" # update description
process.Continue(pid)
process.Stop(pid)
                         # terminate process
```

#### Remarks

- Allows for quicker results
- Resources are allocated as long as process is running
- Forgetting to call Stop leaks resources
- Real concurrency *not* solved yet

#### **Future efforts**



#### Framework

- Additional pure InfiniBand messenger besides Zeromq and MPI
- Enhance scheduling with run-time information

### Tools on top

- Update TomoPy integration (interfaces are breaking constantly ...)
- Finish web-based reconstruction and visualization prototype
- Stabilize tango interface and improve error handling



Thanks for your attention.