HeAT -- a Distributed and GPU-accelerated Tensor Framework for Data Analytics

To cope with the rapid growth in available data, the efficiency of data analysis and machine learning libraries has recently received increased attention. Although great advancements have been made in traditional array-based computations, most are limited by the resources available on a single computation node. Consequently, novel approaches must be made to exploit distributed resources, e.g. distributed memory architectures. To this end, we introduce HeAT, an array-based numerical programming framework for large-scale parallel processing with an easy-to-use NumPy-like API. HeAT utilizes PyTorch as a node-local eager execution engine and distributes the workload on arbitrarily large high-performance computing systems via MPI. It provides both low-level array computations, as well as assorted higher-level algorithms. With HeAT, it is possible for a NumPy user to take full advantage of their available resources, significantly lowering the barrier to distributed data analysis. When compared to similar frameworks, HeAT achieves speedups of up to two orders of magnitude.


I. INTRODUCTION
In the age of Big Data, modern-day computational data science heavily relies on generating highly complex datadriven models. However, the consistent increase in data volume is challenging the processing power for data analytics and machine learning (ML) frameworks. The Python programming language has evolved into the defacto standard for the data science community. Therein, the default choice for many applications is the SciPy stack [1], which is built upon the computational library NumPy [2]. More recently, deep-learning libraries such as TensorFlow [3] and PyTorch [4], have begun to bridge the gap between small scale workstation-based computing and high-performance multinode computing by offering GPU-accelerated kernels. Although these frameworks include options for manually distributing computations, they are generally confined to the processing capabilities of a singular computation node. As the size of datasets increase, single-node computations are at best impractical and at worst impossible.
In response to these challenges, we propose HeAT 1 -the Helmholtz Analytics Toolkit: an open-source library with a NumPy-like API for distributed and GPU-accelerated computing on general-purpose clusters and high performance computing (HPC) systems. HeAT implements parallel algorithms for both low-level array computations as well as higher-level data analytics and machine learning methods. It does so by interweaving process-local PyTorch tensors with communication via the Message Passing Interface (MPI) [5]. The familiar API facilitates the transition of existing Python code to distributed applications, thus opening the doorway to HPC computing for domain-specialized scientists. Due to its design, HeAT consistently performs better in terms of execution time and scalability when compared to similar frameworks.
The remainder of this paper is organized as follows. Section II will present related work in the field of distributed and GPU-accelerated array computations in Python. Section III will explain HeAT's programming model, as well as its array and communication design concepts. In Section IV, an empirical performance evaluation is presented. Section V discusses the advantages and limitations of HeAT's programming model with respect to other frameworks. Finally, Section VI concludes the presented work.

II. RELATED WORK
NumPy is arguably the most widely used Python-based data science library. It offers powerful data structures and algorithms for vectorized matrix and tensor operations, allowing for the implementation of efficient numerical and scientific programs. The high popularity of NumPy, likely due to the similarity of its functions to the mathematical formulation, has led to its API representing a widely recognized and accepted standard for data science programming. Scikit-learn [6] is a NumPy-based machine learning framework that offers a wide range of ready-to-use high-level algorithms for clustering, classification and regression, as well as selected pre-processing steps like feature selection and dimensionality reduction. This makes it a very attractive solution for application scientists. However, neither NumPy nor scikit-learn support out-of-thebox GPU usage.
Several packages have addressed GPU acceleration for array computations. The just-in-time compiler Numba [7] can compile Python functions to CUDA code. While Numba relies on custom annotation of the Python code with decorators for acceleration, the CUDA-accelerated array computation library CuPy [8] allows for GPU computation in analogy to the NumPy interface. Unfortunately, it does not provide the full range of NumPy functionality. RAPIDS's [9] CUDAaccelerated libraries CuDF and CuML aim towards higherlevel machine learning functionality beyond low-level array computation. CuDF offers GPU dataframe computations similar to the Pandas library [10]. CuML offers high-level machine learning algorithms to some extent. However, the currently available function space of these libraries is limited.
With the increasing interest in deep learning methods, novel frameworks have emerged that place special focus onto tensor linear algebra and neural networks. Many libraries like PyTorch [4], TensorFlow [3], MXNet [11] or JAX [12] focus mainly on deep learning applications, and as such they do not actively target NumPy-API compatibility. On the other hand, they enable simple transitions between CPU and GPU computation. Unfortunately, they primarily provide high-level algorithms for neural networks and often lack many traditional algorithms, for example clustering or ensemble methods.
All of the aforementioned libraries work around Python's parallel computation limitations by implementing computationally intensive kernels in low-level programming languages, like C++ or CUDA, and invoking the respective calls at runtime. This allows Python users to exploit powerful features such as vectorization, threading, or the utilization of accelerator hardware. However, these frameworks are designed for single computation node usage, with specific configurations enabling multi-core computation. This limits their potential application to small and medium size problems. For larger datasets, excessive computation times and increasing memory consumption pose arduous challenges.
Some of these ML libraries provide a basic infrastructure for distributed computation. PyTorch, for example, includes two such packages. Firstly, the distributed remote procedure call (RPC) framework handles communication and by that enables distributed automatic differentiation (AD). Secondly, the distributed package implements low-level bindings to MPI, Gloo, and NCCL. However the set of MPI functionality is not complete as it targets the communication functions specifically required for data-parallel neural networks. Furthermore, the communication aspect of algorithms must be implemented by the user. This requires, at minimum, a basic understanding of distributed computing. Similar restrictions apply for the distributed extensions of TensorFlow, JAX, and MXNet. Frameworks like DeepSpeed [13] and Horovod [14] offer multi-node computations on both CPU and GPU. These approaches are limited to deep learning applications, as they mainly focus on data-parallelism for neural networks and do not target general distributed array-based computations.
Phylanx [15] implements Python bindings for C and C++ array computation algorithms, which closely mimic NumPy's API, but it does not support higher-level machine learning functionality or GPU usage. Intel's DAAL [16] provides only high-level algorithms employing MapReduce [17], with the focus on accelerated multi-CPU usage. However, it does not offer the means for low-level array computations or GPU usage. Furthermore, functionality requiring communication beyond the scope of MapReduce must be implemented by the user. Legate [18] focuses on distributed multi-node multi-GPU computations for low-to high-level array operations by employing a global address space programming interface. While its design enables shared-distributed memory computation, a wide range of functionality is currently not implemented.
When it comes to easy distributed array computations with a NumPy-like API and support of high-level algorithms similar to scikit-learn, Dask [19] has become the most popular framework amongst application scientists. It employs dynamic task scheduling for parallel execution of NumPy operations on CPU-based multi-node systems. The Dask execution model envisages one scheduler process and multiple worker instances. The scheduler sends workload via serialized RPC calls to the workers. Networking between the processes builds on the available system network architecture. GPU usage can be enabled by coupling Dask to RAPIDS's cuML and cuDF. Due to its popularity, Dask can be considered the current benchmark in Python-first distributed data analysis and ML computations. There are non-Python based distributed data analysis tools such as the Java-based Spark [20]. A recent comparison [21] between Spark and Dask has shown that they preform similarly.

III. DESIGN AND IMPLEMENTATION
HeAT is an open-source library that implements data structures, functions, and methods for array-based numerical data analysis and machine learning. Due to its NumPy-like API, users are generally familiar with the programming approach. The implementation of (distributed) higher-level algorithms adheres to the scikit-learn API. This interface design makes the conversion of existing data analytics applications to distributed HeAT applications straightforward. An example can be seen in Listing 1. Furthermore, small-scale program prototypes can be developed, which can be transitioned transparently to HPC systems without major code or algorithmic changes. Distributed HeAT applications are typically faster, and their memory limitations are those of the entire system, rather than those of a single node. As a result, HeAT facilitates the algorithmic development and efficient deployment of largescale data analytics and machine learning applications.
Listing 1: Implementation of a function calculating the standard deviation of an array, demonstrating the API compatibility between NumPy and HeAT.
1 import numpy as np 2 import heat as ht 3 4 def np_stddev(a, axis=0): 5 return np.sqrt((a -a.mean(axis)) ** 2) 6 7 def ht_stddev(a, axis=0): 8 return ht.sqrt((a -a.mean(axis)) ** 2) The central component of HeAT is the DNDarray data structure, an N-dimensional array transparently composed of computational objects on one or more processes. The processlocal objects are PyTorch tensors, allowing HeAT functions to use both CPUs and GPUs. A detailed description of the DNDarray design will be given in III-B.
For distributed memory computing, communication between processes is crucial. MPI controls communication among parallel processes on distributed memory systems via a set of message sending and receiving functions. Communication can take place between two processes, i.e. point-to-point communication, or within groups of processes in the MPI communicator, i.e. global communication. HeAT's MPI-based custom communications backend is described in Section III-C.

A. Programming Model
HeAT realizes a single-program-multiple-data (SPMD) programming model [22] using PyTorch and MPI. Additionally, the framework's processing model is inspired by the bulksynchronous parallel (BSP) [23] model. In practice, computations proceed in a series of hierarchical supersteps, each consisting of a number of process-local computations and subsequent inter-process communications. In contrast to the classical BSP model, communicated data is available immediately, rather than after the next global synchronization. In HeAT, global synchronizations only occur for collective MPI calls as well as at the program start and termination. A schematic overview is depicted in Fig. 1. The process-local computations are implemented using PyTorch as the computation engine. Each computation is processed eagerly, i.e. when issued to the interpreter. The scheduling onto the hardware is controlled by the respective runtime environment of PyTorch. For the CPU backend, these are the synchronous schedulers of OpenMP [24] and Intel TBB [25]. For the GPU backend, it is the asynchronous scheduler of the NVIDIA CUDA [26] runtime system. HeAT provides the MPI "glue", utilizing the mpi4py [27] module, for the communication in each superstep. Users can freely access these implementation details, although it is neither necessary nor recommended to modify the communication routines.

B. DNDarrays
At the core of HeAT is the Distributed N-Dimensional Array, DNDarray (cf. Listing 2). The DNDarray object is a virtual overlay of the disjoint PyTorch tensors, which store the numerical data on each MPI process. A DNDarray's data may be redundantly allocated on each node, or onedimensionally decomposed into evenly-sized chunks with a maximum size difference of one element along the decomposition axis. This data distribution strategy aims to balance the workload between all processes. During computation, API calls may redistribute data items. However, completed operations automatically restore the uniform data distribution.
To steer the one-dimensional data decomposition and other parallel processing behavior, HeAT users can utilize a number of additional attributes and parameters: • split: the singular axis, or dimension, along which a DNDarray is to be decomposed (see Fig. 2 and Listing 2). If split is None, a redundant copy is created on each process • device: the computation device, i.e. CPU or GPU, on which the DNDarray is allocated • comm: the MPI communicator, i.e. the set of participating processes, for distributed computation (Section III-C) • shape: the dimensionality of the global data • lshape: the dimensionality of the process-local data As stated, process-level operations on DNDarrays are performed via PyTorch functions, thus employing their C++  DNDarray is distributed along axis 0, 1 or 2 (split=0, split=1, or split=2, respectively). An example for case (b) is available in Listing 2. In each case, the data chunk labeled p n resides on process n, with n = 0, 1, or 2.
core library libtorch to achieve high efficiency. Interoperability with external libraries such as NumPy and PyTorch is self-evident. Data contained in a NumPy ndarray or a PyTorch Tensor can be imported into a DNDarray via the heat.array() function with the optional split attribute.
In the opposite direction, data exchange with NumPy is enabled by the DNDarray.numpy() method.
Listing 2: A DNDarray distributed across three processes as illustrated in Fig. 2 A DNDarray can reside in a node's main memory for the CPU backend or, if available, in the VRAM of GPUs. Individual DNDarrays can be assigned to hardware devices via the device attribute or the default device can be defined as shown in Listing 3.

C. Distributed Computation
Many algorithms using a distributed DNDarray will require communication. HeAT has a custom MPI-based communication layer composed of wrappers of point-to-point and global MPI functions. PyTorch's distributed package does not support many of the required MPI functions, such as alltoall or custom reduction operations. Furthermore, tensors are sent via the network as contiguous buffers in their original storage layout, so that the information on their Ndimensional structure is lost in communication. Therefore, HeAT's communication layer is based on the Python library mpi4py [27], which offers an interface to the most common MPI functions and enables the communication of contiguous Python buffer objects.
The DNDarray memory representation is encoded in the one-dimensional buffer via strides (steps between elements) along the respective dimension. A main challenge in communicating an arbitrarily split DNDarray is the preservation of this data structure. The HeAT communication module internally handles buffer preparation as the interface between the DNDarray and the mpi4py functionality.
For point-to-point communications (e.g. send, recv), buffer preparation is trivial as the data can be sent contiguously from one process and unpacked by the receiving process. More considerable efforts must be made for communication involving collective operations. For gathering operations (e.g. gather, allgather), the node-local Tensor to be sent by each process must have the correct memory layout, which is dependent on the split axis of the DNDarray. For scattering operations (e.g. scatter, alltoall), the data chunks must be packed correctly along the split axis before distribution.
HeAT addresses the packing issues by creating custom MPI data types, which wrap the local Tensor buffer. First, the DNDarray's dimensions are permuted such that the dimension along which data collection or distribution will take place is the first dimension. Then, custom data types are created, via the MPI function Create_vector, to iteratively pack the dimensions from the last to the first. The individual data types at each dimension are defined via the DNDarray's strides. The creation of such a buffer is schematically shown in Fig. 3. Here, a split DNDarray is assembled to split=None via HeAT's allgather function.
With this internal buffer handling, HeAT offers a unified interface that provides communication without exposing the internal data representation. Based on the MPI layer, a resplit function is provided to change the split axis of a DNDarray if required. Re-splitting a DNDarray adheres to load balancing, i.e., the data is uniformly distributed across processes as previously stated. However, caution must be taken when using resplit as it is based on global MPI communication functions, thus requiring both significant communication and local memory.
In cases where CUDA-aware MPI is available, communications can be performed directly between GPUs. Otherwise, data must be copied from the GPU to the CPU, sent to the target CPU, then copied to the target GPU. This increases the communication overhead, and therefore the run time, of many functions.

IV. PERFORMANCE RESULTS
Performance of HeAT was evaluated by benchmarking algorithms that are commonly used in data science, and comparing the results to implementations in other frameworks. NumPy and PyTorch were chosen for single-node baseline evaluation. NumPy because it is the most frequently used library for Python-based array computation, and PyTorch because it is the underlying node-local eager execution engine for HeAT. For multi-node experiments, Dask was selected.
Algorithms were chosen based on their algorithmic complexity and availability as ready-to-use implementations in the investigated frameworks. Two types of low-level algorithms were benchmarked: the computation of statistical moments (Section IV-C1) and the computation of pairwise Euclidean distances, i.e. the function cdist (Section IV-C2). These operations are available in Dask, PyTorch, and NumPy as designated functions under the same name and mathematical definition. As HeAT aims to provide both low-level and highlevel functionality, two commonly used ML algorithms were also selected: k-means clustering (Section IV-C3) and least absolute shrinkage and selection operator (LASSO) regression (Section IV-C4). These two algorithms are provided by Dask and scikit-learn as end-user implementations. For comparison with PyTorch, algorithms were specifically implemented with as much provided PyTorch functionality as possible. All benchmark scripts are available on HeAT's GitHub repository. The software environment for benchmarking is summarized in Table I. The experiments were run on a machine learning HPC system comprised of 15 compute nodes with commodity components at the Jülich Supercomputing Centre (JSC). Each node is equipped with two 12-core Intel Xeon Gold 6126 CPUs, 206 GB of DDR3 main memory, and four NVIDIA Tesla V100 SXM2 GPUs with 32 GB VRAM per card. The GPUs communicate node-internally via an NVLink interconnect. The system is optimized for GPUDirect communication across node boundaries, supported by 2x Mellanox 100 Gbit EDR InfiniBand links. Though HeAT supports CUDA-aware MPI, it was not used, as to make experiments comparable to Dask, which cannot make use of this. MPI capable commodity clusters should show similar differences between HeAT and Dask unless the cluster is specifically tuned for non-standard use cases.

B. Datasets
Three datasets were chosen to demonstrate the effectiveness of HeAT for different data characteristics and to mimic common use-cases: 1) The Cityscapes dataset [28] contains 5 000 highresolution images with fine-grained annotations. Each image is 2 048×1 024 pixels with three 256-bit RGB color channels per pixel, which have been flattened and combined into a short-fat matrix with 5 000×6 291 456 entries, i.e. 117.19 GB.
2) The SUSY dataset [29] contains 5 000 000 samples from Monte Carlo simulations of high-energy particle collisions. Each sample has 18 features, consisting of kinematic properties measured by particle detectors and high-level functional derivations of those measurements [30]. The total data size is 343.33 MB

3) The EURopean Air pollution Dispersion-Inverse Model data (EURAD-IM) [31] contains parameters from an
Eulerian meso-scale chemistry transport model as part of the Copernicus Atmosphere Monitoring Service (CAMS) 2 . For our experiments, 10 7 data points and 100 parameters, i.e. 7.45 GB, of the model have been chosen and stored in a tall-skinny matrix. The Cityscapes and SUSY datasets are publicly available, the EURADS dataset is available upon request. All datasets were converted from their original sources into solitary data matrices and stored as floating point values in HDF5 files [32].
While both Dask and HeAT utilize parallel I/O for h5py [33], they handle data decomposition differently. HeAT automatically decomposes data upon DNDarray creation when given a split axis by the user (cf. Section III-B). Dask offers an automatic data decomposition scheme, the recommended setup as per the Dask documentation [34], and manually specifying the size of the memory-distributed data chunks. In the following, Dask's performance with automatic chunking is indicated with Dask-auto, whereas Dask-tuned indicates the manual chunking mirroring HeAT's data decomposition scheme. All measurements with HeAT are performed with split=0, unless otherwise stated.

C. Experiments
All of the following experiments are composed of weak scaling and strong scaling experiments. Measurements are the average of 9 runs, preceded by a warm up run. The error bars indicate the empirical standard deviation. In several cases the errors are too small to be visible compared to the data point itself. A fractional number of nodes for weak scaling GPU runs refers to the usage of the equivalent fraction of a node's resources, i.e. one (0.25) or two (0.5) out of the nodes four GPUs.
For Dask, the actual program code is provided in a separate script that connects the scheduler to the workers via a dask.distributed.Client instance. The discovery of the scheduler is done manually by passing an IP address, or via information on a shared filesystem. Networking between the processes builds on network sockets and utilizes Infiniband using TCP over IB. Each worker maintains its execution state by writing into journaling directories.
Weak scaling refers to the process of increasing the amount of computation resources while maintaining the workload on each MPI process. Ideal weak scaling behaviour is a constant runtime for each measurement as the number of processing units is increased. This indicates solid scalability for larger datasets. Results for weak scaling experiments are presented as the average maximum runtime across the processing units.
Strong scaling refers to the process of increasing the amount of computation resources while the total workload on the system remains constant. Ideally, the runtime in strong scaling measurements is inversely proportional to the number of processing units. We present our strong scaling results in units of speedup as compared to a single-node NumPy-based implementation.
1) Statistical Moments: Calculation of mean and standard deviation are arguably the most frequently used calculations in all of computing. In a distributed context, it is inefficient to compute statistical moments with multiple passes over the dataset. Therefore, HeAT calculates statistical moments using the numerically stable single-pass algorithms presented in [35]. These experiments utilise the Cityscapes dataset.
The experiments shown in this section show some of the most common applications of statistical moments: the mean of the entire dataset, the mean along the largest dimension, and the standard deviation of the entire dataset. In its native form, Dask is designed for distributed computation on CPUs. Multi-node GPU usage can be enabled by coupling of Dask to CuPy. In order to provide comparison of HeAT's multi-GPU performance, benchmarks for Dask were originally intended to be performed with and without CuPy. However, during benchmarks we found that the native binding of CuPy to Dask only enables the usage of a single GPU per node, while HeAT enables multi-GPU per node usage. While possible for Dask to utilize multiple GPUs per node, substantial modifications to each algorithm requiring communication must be done manually by the user. Therefore multi-GPU benchmarks on Dask with CuPy were only performed for mean calculations of the entire dataset. Exemplary code for the algorithmic definition can be seen in the Listing 4 (Appendix). Fig. 4 shows strong and weak scaling of the mean operation, the weak scaling of standard deviation, and the weak scaling of the mean along the largest axis. For weak scaling runs, each node had 300 rows of the matrix. For strong scaling runs, 1 200 rows were used globally. Dask-tuned failed with memory errors above 4 nodes for the standard deviation calculations. Single-GPU measurements were much faster than multi-GPU measurements as no communication is required. Furthermore, PyTorch is faster than HeAT on a single GPU as the HeAT functions are wrappers of PyTorch functions. Favorable strong scaling is clearly shown in the HeAT CPU measurements (Fig. 4 (C)). HeAT outperforms NumPy and is nearly equal to Dask-auto, for a single node. However, beyond one node, HeAT clearly improves upon these results, while all Dask measurements do not show significant improvement with the addition of more computing resources. CuPy shows positive scaling behavior, however it is the slowest of the three parallel frameworks. As the number of nodes increases, the time required for communication also increases. Once this time eclipses the time required for computation the performance degrades. This effect can be seen for the HeAT GPU measure-ments beyond four nodes. The effect is common in distributed computing as one must balance the number of MPI processes and the size of the dataset to avoid excessive communication calls.
HeAT calculations on CPU for these experiments show nearly ideal weak scaling. However, the measurements of the mean along the largest axis show Dask outperforming HeAT until the number of nodes is increased beyond 8. This is due to the large difference between the complexity of the calculation and the amount of communication required by HeAT. As the complexity of the calculation increases, the efficiency of HeAT remains constant and at 15 nodes shows slight improvement. Whereas the efficiency of Dask degrades as the node count increases.
All of these measurements were also conducted with the split=1 data distributed scheme for HeAT. The difference between the runtimes for split=0 and split=1 was on average (−0.0226 ± 0.0809) s.
2) Pairwise Euclidean Distances: Pairwise distance calculations are a vital part of data analysis used in many ML algorithms, such as clustering and neighborhood methods [36]. However, due to the quadratic growth in computational complexity, these computations scale notoriously poorly. Moreover, their excessive memory consumption poses a major challenge, which often limits the number of samples which can be processed. As a consequence, many applications employ a form of dimensionality reduction, yielding only approximate solutions. HeAT implements a custom distance computation function via ring communication which works regardless of the employed distance metric.
For scaling experiments, the L 2 norm (Euclidean distance, commonly referred to as cdist) was utilized. Benchmarks were conducted on the SUSY dataset. For weak scaling, the number of samples was increased by the square root of the number of nodes, because the computational load grows quadratically. The first 12 910, 18 258, 25 820, 36 515, 51 640, 73 030, and 100 000 samples were used for N = 0.25, 0.5, 1, 2, 4, 8 and 15 nodes, respectively. For strong scaling runs, the first 40 000 samples were used. Results are displayed in Fig. 5, where HeAT's implementation of cdist shows significantly lower computation times compared to the other frameworks. It also provides small speedup for CPU and large speedup for GPU over the NumPy implementation. On one GPU (0.25 nodes), HeAT outperforms PyTorch because it employs quadratic expansion via matrix multiplication for the calculation of the squared differences rather than relying on PyTorch's cdist function: On CPU, PyTorch's intrinsic cdist function is faster; however, experiments for k-means, cf. Section IV-C3, will show that this speedup depends on the data matrix shape (tall-skinny vs short-fat).
Overall scaling behaviour of the function is not optimal, as the communication overhead grows proportionally to the number of processes. Nevertheless, the HeAT implementation is able to solve the problem of memory usage at large sample sizes, whereas Dask's computation at 14 Nodes with 100 000 samples aborted due to memory overflow.
3) k-means: k-means [37] is a vector quantization method originating from the field of signal processing. It is commonly used as an unsupervised clustering algorithm that is able to assign all observations, x, within a dataset into k disjoint partitions, each forming a cluster, (C i ). Formally, the method minimizes the inter-cluster variance: for each cluster centroid c i . The k-means clustering problem is generally NP-hard, but can be efficiently approximated using an iterative optimization method, such as detecting a local minimum, i.e. Lloyd's algorithm [38]. HeAT's k-means implementation is dominated by element-wise vector operations in the distance matrix computation between the data points and the centroids, and reduction operations for finding the best matching centroids. Benchmarks in this experiment were conducted on the Cityscapes dataset. The dataset sizes used were analogous to those used in the moments experiments. For each benchmark, we have performed 30 iterations of Lloyd's algorithm at eight assumed centroids. Weak scaling measurements in Fig. 6 (upper panel) show that HeAT outperforms Dask by at least an order of magnitude. Furthermore, HeAT demonstrates solid scalability on both CPU and GPU. For Dask we were unable to complete the measurement procedure for all node configurations. While it was sporadically possible to complete the benchmark with a four-node configuration, it would terminate with an out-ofmemory exception before the completion of the measurement sequence. For 8 and 15 nodes, we were unable to obtain any measurements due to excessive memory consumption.
For HeAT, a single GPU shows better overall performance compared to multiple GPUs. The difference in runtime between PyTorch and HeAT on a single node can be explained by the differences in distance matrix computation (c.f. Section IV-C2). Strong scaling measurements are shown in Fig. 6 (lower panel). For these measurements, 600 rows of the dataset were used for all runs. Here, we obtain similar conclusions as in the weak scaling measurements. HeAT outperforms Dask by a significant margin and shows more favorable scaling behaviour. Again, Dask experienced out-of-memory issues. While HeAT's CPU computations scale approximately linearly, the GPU backend shows strong linearity.

4) LASSO:
LASSO is a regression method of simultaneously applying regularization and parameter selection. Its basic form is an extension of the ordinary linear regression (OLS) method by introducing an L1-norm penalty of the parameters scaled by the regularization parameter. The corresponding objective function reads where y denotes the n samples of the output variables, X ∈ R n×m denotes the system matrix in which m−1 columns represent the different features in which each column represents the constant bias term and each of the n rows represent one data sample, w ∈ R m denotes the regression coefficients, w − ∈ R m−1 the regression coefficients of the features, and λ the regularization parameter. In addition to the L2-norm regularization approach (i.e., Ridge-regression), LASSO favors not only smaller model parameters but, depending on the regularization parameters, can force selected model parameters to be zero. It is a popular method to determine the importance of input variables with respect to one or more dependent output variables.
In this experiment, a LASSO algorithm is used to determine the most important model parameters of the EURAD-IM model on the errors of ozone forecasts of the model at measurement sites 3 . In order to minimize the objective function, a coordinate descent algorithm with a proximal gradient soft threshold applied to each coordinate was implemented in HeAT, Dask, NumPy, and PyTorch. For the weak scaling measurements, the LASSO algorithm is run for 20 iterations on a data sample size of 714 280 samples per node.
The HeAT CPU measurements show good weak scaling behaviour (Fig. 7, upper panel) with the lowest runtime compared to the Dask and HeAT GPU versions. Dask shows poor weak scaling due to the incompleteness of Dask with respect to NumPy operations. For example, assignments to Dask arrays are not supported by the library itself but are heavily utilized in the implemented LASSO algorithm. Consequently, Dask cannot make efficient use of its lazy evaluation concept [34] for this algorithm. The HeAT GPU version also does not scale well, albeit with a significantly lower runtime than Dask. This is due to the high number of communication operations required.
Strong scaling measurements (Fig. 7, lower panel) were conducted for the entire sample set. The trends observed in the weak scaling measurements are also visible here. Dask shows almost no scaling, whereas the HeAT CPU measurements indicate a good scaling behaviour. For the HeAT GPU implementation the speedup decreases with the increase in computing resources. Overall, it is apparent that HeAT outperforms Dask by more than two orders of magnitude.

V. DISCUSSION
We have presented HeAT, a Python-based framework for distributed data analytics and machine learning on multiple CPUs and GPUs. It offers transparent parallel data handling and operations to exploit the available hardware, be it personal workstations or supercomputers. The NumPy-like API enables users to easily translate existing NumPy code into distributed applications. As the user-base of HPC resources is predominately composed of domain experts with limited knowledge of parallel programming concepts, the number of exposed configuration parameters for parallel constructs is minimized. These constructs are both powerful and versatile enough for the implementation of a large variety of distributed algorithms. PyTorch has been selected as an eager, node-local compute engine for HeAT. As a direct result, HeAT benefits from PyTorch's highly optimized functions. However, PyTorch does not offer parallel data decomposition or distributed algorithms. HeAT provides a custom communication layer for Ndimensional array objects and specifically designed hierarchical algorithms to leverage the full potential of PyTorch for distributed environments. This strategy enables the efficient utilization of the underlying hardware by exploiting locality in the memory hierarchy.
Memory distribution for scalable algorithms requires efficient communication between compute nodes at runtime via high-speed network links. HeAT is designed to leverage the potential of such systems. Significant efforts have been made to efficiently utilize the available hardware, including multi-GPU architectures, while avoiding central bottlenecks, such as workload schedulers, and excessive I/O, e.g. serial file access or journaling. Weak and strong scaling experiments on a number of applications (Section IV) demonstrate that HeAT consistently outperforms Dask by up to two orders of magnitude. Moreover, larger datasets caused Dask to raise memory errors in several experiments. HeAT enables users to access a substantial portion of maximum potential performance via a high-level interface, relatively independent of data characteristics.
Coupling of Dask with CuPy, CuDF, or CuML can in principle be used for distributed GPU computation. Within the performance evaluations Section IV-C1, a multi-GPU benchmark for Dask with CuPy was conducted for the mean calculation, which did not show any improvement. During these experiments it was found that this is only valid for a multi-node single-GPU setup. We hypothesize that Dask does not take care of copying the data from GPU to CPU before communicating and thus, the communication stack cannot properly access the VRAM. In order to enable multi-node multi-GPU support, data must be moved manually from CPU to GPU and back for every operation utilizing communication as demonstrated in Listing 4 (Appendix). For high-level algorithms, this is a very cumbersome task requiring a substantial understanding of distributed programming.
As discussed previously, most frameworks for parallel deep learning applications primarily focus on data-parallelism. However, generalized model parallelism is not available to the authors' knowledge. HeAT's programming model facilitates straight-forward data-parallelism as well as model parallelism and pipelining. The use of a custom communication layer allows for the implementation of distributed automatic differentiation, which is a vital part for a distributed model architecture. First steps in this direction are already underway with mpi4torch [39], a prototype for distributed AD.
This subsequently offers the opportunity for the development of other high-level differentiable algorithms. In light of the ever increasing need for machine learning models to yield reliable predictions, considerable efforts have been put towards the development of probabilistic approaches. HeAT's programming model and internal design give access to all levels of algorithmic development and by such offers an intuitive way to implement such approaches.

VI. CONCLUSION
With HeAT, we address the needs of an ever-growing community of scientists, both in academia and industry, who seek to accelerate the process of extracting information from Big Data. To this end, we have set upon the task of combining data analytics and machine learning algorithms with state-ofthe-art high-performance computing concepts into an easy-touse Python library.
We have demonstrated that even in its current early stage, HeAT offers great potential. The convergence of speed and usability sets it up to redefine high-performance data analytics by putting high levels of parallelism within reach of scientists in academia and industry alike. HeAT offers a way to easily develop application-specific algorithms while leveraging the available computational resources, setting it apart from other approaches and libraries.