Real-time systems are ubiquitous in our everyday life, e.g., in safety-critical domains such as automotive, avionics or robotics. The correctness of a real-time system does not only depend on the correctness of its calculations, but also on the non-functional requirement of adhering to deadlines. Failing to meet a deadline may lead to severe malfunctions, therefore worst-case execution times (WCET) need to be guaranteed. Despite significant scientific advances, however, timing analysis of WCET guarantees lags years behind current high-performance microarchitectures with out-of-order scheduling pipelines, several hardware threads and multiple (shared) cache layers. To satisfy the increasing performance demands of real-time systems, analyzable performance features are required. In order to escape the scarcity of timing-analyzable performance features, the main contribution of this thesis is the introduction of runtime reconfiguration of hardware accelerators onto a field-programmable gate array (FPGA) as a novel means to achieve performance that is amenable to WCET guarantees. Instead of designing an architecture for a specific application domain, this approach preserves the flexibility of the system.
First, this thesis contributes novel co-scheduling approaches to distribute work among CPU and GPU in an extensive analysis of how (average-case) performance is achieved on fused CPU-GPU architectures, a main trend in current high-performance microarchitectures that combines a CPU and a GPU on a single chip. Being able to employ such architectures in real-time systems would be highly desirable, because they provide high performance within a limited area and power budget. As a result of this analysis, however, a cache coherency bottleneck is uncovered in recent fused CPU-GPU architectures that share the last level cache between CPU and GPU. This insight (i) complicates performance predictions and (ii) adds a shared last level cache between CPU and GPU to the growing list of microarchitectural features that benefit average-case performance, but render the analysis of WCET guarantees on high-performance architectures virtually infeasible. Thus, further motivating the need for novel microarchitectural features that provide predictable performance and are amenable to timing analysis.
Towards this end, a runtime reconfiguration controller called ``Command-based Reconfiguration Queue'' (CoRQ) is presented that provides guaranteed latencies for its operations, especially for the reconfiguration delay, i.e., the time it takes to reconfigure a hardware accelerator onto a reconfigurable fabric (e.g., FPGA). CoRQ enables the design of timing-analyzable runtime-reconfigurable architectures that support WCET guarantees. Based on the --now feasible-- guaranteed reconfiguration delay of accelerators, a WCET analysis is introduced that enables tasks to reconfigure application-specific custom instructions (CIs) at runtime. CIs are executed by a processor pipeline and invoke execution of one or more accelerators. Different measures to deal with reconfiguration delays are compared for their impact on accelerated WCET guarantees and overestimation. The timing anomaly of runtime reconfiguration is identified and safely bounded: a case where executing iterations of a computational kernel faster than in WCET during reconfiguration of CIs can prolong the total execution time of a task. Once tasks that perform runtime reconfiguration of CIs can be analyzed for WCET guarantees, the question of which CIs to configure on a constrained reconfigurable area to optimize the WCET is raised. The question is addressed for systems where multiple CIs with different implementations each (allowing to trade-off latency and area requirements) can be selected.
This is generally the case, e.g., when employing high-level synthesis. This so-called WCET-optimizing instruction set selection problem is modeled based on the Implicit Path Enumeration Technique (IPET), which is the path analysis technique state-of-the-art timing analyzers rely on. To our knowledge, this is the first approach that enables WCET optimization with support for making use of global program flow information (and information about reconfiguration delay). An optimal algorithm (similar to Branch and Bound) and a fast greedy heuristic algorithm (that achieves the optimal solution in most cases) are presented. Finally, an approach is presented that, for the first time, combines optimized static WCET guarantees and runtime optimization of the average-case execution (maintaining WCET guarantees) using runtime reconfiguration of hardware accelerators by leveraging runtime slack (the amount of time that program parts are executed faster than in WCET). It comprises an analysis of runtime slack bounds that enable safe reconfiguration for average-case performance under WCET guarantees and presents a mechanism to monitor runtime slack using a simple performance counter that is commonly available in many microprocessors.
Ultimately, this thesis shows that runtime reconfiguration of accelerators is a key feature to achieve predictable performance.