Parallel Weighted Random Sampling

Data structures for efficient sampling from a set of weighted items are an important building block of many applications. However, few parallel solutions are known. We close many of these gaps. We give efficient, fast, and practicable parallel and distributed algorithms for building data structures that support sampling single items (alias tables, compressed data structures). This also yields a simplified and more space-efficient sequential algorithm for alias table construction. Our approaches to sampling k out of n items with/without replacement and to subset (Poisson) sampling are output-sensitive, i.e., the sampling algorithms use work linear in the number of different samples. This is also interesting in the sequential case. Weighted random permutation can be done by sorting appropriate random deviates. We show that this is possible with linear work. Finally, we give a communication-efficient, highly scalable approach to (weighted and unweighted) reservoir sampling. This algorithm is based on a fully distributed model of streaming algorithms that might be of independent interest. Experiments for alias tables and sampling with replacement show near linear speedups using up to 158 threads of shared-memory machines. An experimental evaluation of distributed weighted reservoir sampling on up to 5,120 cores also shows good speedups.


Introduction
Weighted random sampling asks for sampling items (elements) from a set such that the probability of sampling item i is proportional to a given weight w i .Several variants of this fundamental computational task appear in a wide range of applications in statistics and computer science, e.g., for computer simulations, data analysis, database systems, and online ad auctions (see, e.g., Motwani et al. [27], Olken et al. [29]).Continually growing data volumes ("Big Data") imply that the input sets and even the sample itself can become large.Since actually processing the sample is often fast, sampling algorithms can easily become a performance bottleneck.Due to the hardware developments of the last years, this means that we need parallel algorithms for weighted sampling.This includes shared-memory algorithms that exploit current multi-core processors, and distributed algorithms that split the work across multiple machines without incurring too much overhead for communication.
However, there has been surprisingly little work on parallel weighted sampling.This paper closes many of these gaps.Table 1 summarizes our results on the following widely used variants of the weighted sampling problem.We process the input set A = 1..n on p processing elements (PEs) where i..j is a shorthand for {i, . . ., j}.Item i has weight w i and W := n i=1 w i .Define u := log U where U := w max /w min := max i w i / min i w i .WRS-1: Weighted sampling of one item from a categorical (or multinoulli) distribution (equivalent to WRS-R and WRS-N for k = 1).WRS-R: Sample k items from A with replacement, i.e., the samples are independent and for each sample X, P[X = i] = w i /W .Let s = |S| ≤ k denote the number of different items in the sample S. Note that we may have s k for skewed input distributions.

WRS-N: Sample k pairwise unequal items s
. WRP: Permute the elements with the same process as for WRS-N using k = n.WRS-S: Sample a subset S ⊆ A where P[i ∈ S] = w i ≤ 1. WRS-B: Batched reservoir sampling.Repeatedly solve WRS-N when batches of new items arrive.Only the current sample and batch may be stored.Let b denote the batch size.When applicable, our algorithms build a data structure once which is later used to support fast sampling queries.Most of the algorithms have linear work and variants with logarithmic (or even constant) latency.Neither competitive parallel algorithms nor more efficient sequential algorithms are known.The distributed algorithms are refinements of the shared-memory algorithms with the goal to reduce communication costs compared to a direct distributed implementation of the shared-memory algorithms.As a consequence, each PE mostly works on its local data (the owner-computes approach).Communication -if at all -is only performed to coordinate the PEs and is sublinear in the local work except for extreme corner cases.The owner-computes approach introduces the complication that differences in local work introduce additional parameters into the analysis that characterize the local work in different situations (e.g., the last line of Table 1).The summary in Table 1 therefore covers the case when items are randomly assigned to PEs.This simplifies the exposition and is actually an approach that one can sometimes take in practice.

Outline.
First, in Section 2, we review the models of computation used in this paper as well as known techniques we are building on.We discuss additional related work in Section 3. In Section 4, we consider Problem WRS-1.We first give an improved sequential algorithm for constructing alias tables -the most widely used data structure for Problem WRS-1 that allows sampling in constant time.Then we parallelize that algorithm for shared and distributed memory.We also present parallel construction for a more space efficient variant.
Sampling k items with replacement (Problem WRS-R) seems to be trivially parallelizable with an alias table.However this does not lead to a communication-efficient distributed algorithm and we can generally do better for skewed input distributions where the number of Table 1 Result overview (expected and asymptotic).Distributed results assume random distribution of inputs.Input size n, output size s, sample size k, startup latency of point-to-point communication α, time for communicating one machine word β, log-weight ratio u = log U = log wmax/wmin, mini-batches of b items per PE.The complexity of sorting n integers with keys from 0..x is isortx(n) (isort * = parallel, isort distinct output elements s can be much smaller than k.Section 5 develops such an algorithm which is interesting both as a parallel and a sequential algorithm.Section 6 employs the algorithm for Problem WRS-R to solve Problem WRS-N.The main difficulty here is to estimate the right number of samples with replacement to obtain a sufficient number of distinct samples.Then an algorithm for WRS-N without preprocessing is used to reduce the "weighted oversample" to the desired exact output size. It is well known that the weighted permutation Problem WRP can be reduced to sorting (see Section 2.3).We show in Section 7 that this is actually possible with linear work by appropriately defining the (random) sorting keys so that we can use integer sorting with a small number of different keys.Since previous linear-time algorithms are fairly complicated [21], this may also be interesting for a sequential algorithm.Indeed, a similar approach might also work for other problems where sorting can be a bottleneck, e.g., smoothed analysis of approximate weighting matching [25].
For subset sampling (Problem WRS-S), we parallelize the approach of Bringmann et al. [7] in Section 8. Once more, the preprocessing requires integer sorting.However, only O(log n) different keys are needed so that linear work sorting works with logarithmic latency even deterministically on a CREW PRAM.
In Section 9, we adapt the sequential streaming algorithm of Efraimidis et al. [13] to a distributed setting where items are processed in small batches.This can be done in a communication efficient way using our previous work on distributed priority queues [17].
Section 10 gives a detailed experimental evaluation of our algorithms for WRS-1 and WRS-R.Section 11 summarizes the results and discusses possible future directions.

Models of Computation
We strive to describe our parallel algorithms in a model-agnostic way, i.e., we largely describe them in terms of standard operations such as prefix sums for which efficient parallel algorithms are known on various models of computation.We analyze the algorithms for two simple models of computation.In each case p denotes the number of processing elements (PEs).Most of our algorithms achieve polylogarithmic running time for a sufficiently large number of PEs.This is a classical goal in parallel algorithm theory and we believe that it is now becoming practically important with the advent of massively parallel ("exascale") computing and fine-grained parallelism in GPGPU.
For shared-memory algorithms we use the CREW PRAM model (concurrent read exclusive write parallel random access machine) [19].We will use the concepts of total work and span of a computation to analyze these algorithms.The span of a computation is the time needed by a parallel algorithm with an unbounded number of PEs.
For distributed-memory computations we use point-to-point communication between PEs where exchanging a message of length takes time α + β.We assume 1 ≤ β ≤ α.We will use that prefix sums and (all-)reductions can be computed in time O(β + α log p) for vectors of size .The all-gather operation collects a value from each PE and delivers all values to all PEs.It can be implemented to run in time O(βp + α log p) [20].We will particularly strive to obtain communication-efficient algorithms [36] where total communication cost is sublinear in the local computations.Some of our algorithms are even communication free.
We need one basic toolbox operation where the concrete machine model has some impact on the complexity.Sorting n items with integer keys from 1..K can be done with linear work in many relevant cases.Sequentially, this is possible if K is polynomial in n (radix sort).Radix sort can be parallelized even on a distributed-memory machine with linear work and span n ε for any constant ε > 0. Logarithmic span is possible for K = O(log c n) for any constant c, even on an EREW PRAM [31,Lemma 3.1].For a CRCW PRAM, expected linear work and logarithmic span can be achieved when K = O(n log c n) [31] (the paper gives the constraint K = O(n) but the generalization is obvious and important for us in Section 7).Resorting to comparison based algorithms, we get work O(n log n) and O(log n) span on an EREW PRAM [9].
Our pseudocodes are optimised for clarity.As a result, they do not closely reflect our actual implementations, details of which are given in Section 10.

Bucket-Based Sampling
The basic idea behind several solutions of Problem WRS-1 is to build a table of m = Θ(n) buckets where each bucket represents a total weight of W/m.Sampling then selects a random bucket uniformly at random and uses the information stored in the bucket to determine the actual item.If item weights differ only by a constant factor, we can simply store one item per bucket and use rejection sampling to obtain constant expected query time (see, e.g., Devroye [10], Olken et al. [29]).
Deterministic sampling with only a single memory probe is possible using Walker's alias table method [39], and its improved construction due to Vose [38].An alias table consists of m := n buckets where bucket b[i] represents some part w i of the weight of item i.The remaining weight of the heavier items is distributed to the remaining capacity of the buckets such that each bucket only represents one other item (the alias a i ).Algorithm 1 gives high-level pseudocode for the approach proposed by Vose.The items are first classified into light and heavy items.Then the heavy items are distributed over light items until their residual weight drops below W/n.They are then treated in the same way as light items.
To sample an item, pick a bucket index r uniformly at random, toss a biased coin that comes up heads with probability w r n/W , and return r for heads, or b [r].a for tails.
Algorithm 1 Classical construction of alias tables similar to Vose's approach [38].

Weighted Sampling using Exponential Variates
It is well known that an unweighted sample without replacement of size k out of n items 1..n can be obtained by associating with each item a uniform variate v i := rand(), and selecting the k with the smallest associated variates.This method can be generalized to generate a weighted sample without replacement by raising uniform variates to the power 1/w i and selecting the k items with the largest associated values [12,13,11].Equivalently, one can generate exponential random variates v i := − ln(rand())/w i and select the k items with the smallest associated v i [2] ("exponential clocks method"), which is numerically more stable.

Divide-and-Conquer Sampling
Uniform sampling with and without replacement can be done using a divide-and-conquer algorithm [35].To sample k out of n items uniformly and with replacement, split the set into two subsets with n (left) and n − n (right) items, respectively.Then the number of items k to be sampled from the left has a binomial distribution (k trials with success probability n /n).We can generate k accordingly and then recursively sample k items from the left and k − k items from the right.This can be used for a communication-free parallel sampling algorithm.We have a tree with p leaves.Each leaf represents a subproblem of size about n/p -one for each PE.Each PE descends this tree to the leaf assigned to it (time O(log p)) and then generates the resulting number of samples (time O(k/p + log p) with high probability).Different PEs have to draw the same random variates for the same interior node of the tree.This can be achieved by seeding a pseudo-random number generator with an ID of this node.
3 Related Work

Sampling one Item (Problem WRS-1)
Extensive work has been done on generating discrete random variates from a fixed distribution [39,38,22,10,7].All these approaches use preprocessing to construct a data structure that subsequently supports very fast (constant time) sampling of a single item.Bringmann et al. [6] explain how to achieve expected time r using only O(n/r) bits of space beyond the input distribution itself.There are also dynamic versions that allow efficient weight updates.Some (rather complicated ones) allow that even in constant expected time [16,23].

Sampling Without Replacement (Problems WRS-N and WRP)
The exponential clocks method of Section 2.3 is an O(n) algorithm for sampling without replacement.This approach also lends itself towards use in streaming settings (reservoir sampling) and can be combined with a skip value distribution to reduce the number of required random variates from O(n) to O k log n k [13].A related algorithm for WRS-N with given inclusion probabilities instead of relative weights is described by Chao [8].
More efficient algorithms for WRS-N repeatedly sample an item and remove it from the distribution using a dynamic data structure [40,29,16,23].With the most efficient such algorithms [16,23] we achieve time O(k), albeit at the price of an inherently sequential and rather complicated algorithm that might have considerable hidden constant factors.
It is also possible to combine techniques for sampling with replacement with a rejection method.However, the performance of these methods depends heavily on U , the ratio between the largest and smallest weight in the input, as the rejection probability rises steeply once the heaviest items are removed.Lang [21] gives an analysis and experimental evaluation of such methods for the case of k = n (cf."Permutation" below).A recent practical evaluation of approaches that lend themselves towards efficient implementation is due to Müller [28].
The Permutation Problem WRP can be seen as special case of sampling without replacement with k = n.In particular, sorting the exponential variates from Section 2.3 computes such a permutation.Lang [21] compares different approaches to Problem WRP, including one based on card shuffling techniques.

Parallel Sampling
There is surprisingly little work on parallel sampling.Even uniform unweighted sampling had many loose ends until recently [35].Parallel uniformly random permutations are covered in [15,34].Efraimidis and Spirakis note that WRS-N can be solved in parallel with span O(log n) and work O(n log n) [12].They also note that solving the selection problem suffices if the output need not be sorted.The optimal dynamic data structure for WRS-1 [23] admits a parallel bulk update in the (somewhat exotic) combining-CRCW-PRAM model.However, this does not help with Problem WRS-N since batch sizes are one.

Improved Sequential Alias Tables
Before discussing parallel alias table construction, we discuss a simpler, faster and more space efficient sequential algorithm that is a better basis for parallelization.Previous algorithms need auxiliary arrays/queues of size Θ(n) in order to decide in which order the buckets are filled.Vose [38] mentions that this can be avoided but does not give a concrete algorithm.We now describe an algorithm with this property.The idea of the algorithm is that two indices i and j sweep the input array with respect to light and heavy items, respectively.The loop invariant is that the weight of items corresponding to light (heavy) items preceding i (j) has already been distributed over some buckets and that their corresponding buckets have already been constructed.Variable w stores the weight of the part of item j that has not yet been assigned to buckets.Each iteration of the main loop advances one of the indices and initializes one bucket.When the residual weight w exceeds W/n, item j is used to fill bucket i, the residual weight w is reduced by W/n − w i , and index i is advanced to the next light item.Otherwise, the Algorithm 2 A sweeping algorithm for building alias tables.--Find residual weight of item j .j := j remaining weight of heavy item j fits into bucket j and the remaining capacity of bucket j is filled with the next heavy item.Algorithm 2 gives pseudocode that emphasizes the high degree of symmetry between these two cases.

Parallel Alias Tables
The basic idea behind our splitting based algorithm is to identify subsets L and H of light (w i ≤ W/n) and heavy (w i > W/n) items such that they can be allocated precisely within their respective buckets, i.e., w(H ∪ L) := i∈H∪L w i = (|H| + |L|) • W/n.By splitting the items into such pairs of subsets, we can perform alias table construction for these subsets in parallel.Since the above balance condition cannot always be achieved, we allow to "steal" a piece of a further heavy item, i.e., this item can be used to fill buckets in several subproblems.Such a split item will only be used as an alias except in the last subproblem where it is used.Thus, the computed data structure is still an alias table.
We first explain how to split n items into two subsets of size n and n − n .Similar to Vose's algorithm, we first compute arrays and h containing the indices of the light and heavy items, respectively.We then determine indices i and j such that i ≤ n W/n and σ+w h[j+1] > n W/n; see Figure 1 for an example.These values can be determined by binary search over the value of j.By precomputing prefix sums of the weights of the items in and h, each iteration of the binary search takes constant time.The resulting subproblem then consists of the light items To split the input into p independent subproblems of near-equal size, we perform the above two-way-split for the values n k = nk/p for k ∈ 1..p − 1. PE k is then responsible for filling a set of buckets corresponding to sets of light and heavy items, each represented by a range of indices into and h.A piece of a further heavy item may be used to make the calculation work out.Note that a subproblem might contain an empty set of light or heavy items and that a single heavy item j may be assigned partially to multiple subproblems, but only the last PE using a heavy item will fill its bucket.
Algorithm 3 gives detailed pseudocode.It uses function split to compute p − 1 different splits in parallel.The result triple (i, j, s) of split specifies that buckets [1] shall be filled using the left subproblem.Moreover, total weight s of item h[j + 1] is not used on the left side, i.e., spilled over to the right side.This splitting information is then used to make p parallel calls to procedure pack -giving each PE the task to fill up to n/p buckets.Pack has input parameters specifying ranges of heavy and light items it should use.The parameter spill determines how much weight of item h[j − 1] can be used for that.Pack works similar to the sweeping algorithm from Algorithm 2. If the residual weight of item h[j − 1] drops below W/n, this item is actually also packed in this call.The body of the main loop is dominated by one if-then-else case distinction.When the residual weight of the current heavy item falls below W/n, its bucket

Algorithm 3 Pseudocode for parallel splitting based alias table construction (PSA).
Procedure psaAliasTable( w 1 , . . ., w n , b : Array of w : R × a : N) W := i w i -total weight h := {i ∈ 1..n : w i > W/n} : Array -parallel traversal finds heavy items and := {i ∈ 1..n : w i ≤ W/n} : Array -and light items is filled using the next heavy item.Otherwise, its weight is used to fill the current light item.

Theorem 1. We can construct an alias table with work Θ(n) and span Θ(log n) on a CREW PRAM.
Proof.The algorithm requires linear work and logarithmic span for identifying light and heavy items and for computing prefix sums [4] over them.Splitting works in logarithmic time.Then each PE needs time O(n/p) to fill the buckets assigned to its subproblem.

Distributed Alias Table Construction
The parallel algorithm described in Section 4.2 can also be adapted to a distributed-memory machine.However, this requires information about all items to be communicated.Hence, more communication efficient algorithms are important for large n.To remedy this problem, we will now view sampling as a 2-level process implementing the owner-computes approach underlying many distributed algorithms.
Let E i denote the set of items allocated to PE i.For each PE i, we create a meta-item of weight W i := j∈Ei w j .Sampling now amounts to sampling a meta-item and then delegating the task to sample an actual item from E i to PE i.The local data structures can be built independently on each PE. 1 In addition, we need to build a data structure for sampling a meta-item.There are several variants in this respect with different trade-offs:  1), we can perform an all-gather operation on the meta-items and compute the data structure for the meta-items in a replicated way.For Equation ( 2) and Equation (3), we can compute an alias table for the meta-items using a parallel algorithm.Sampling then needs an additional indirection.First, a meta-bucket j is computed.Then a request is sent to PE j which identifies the subset E i from which the item should be selected and delegates that task of sampling from E i to PE i. 2 Equation (2) then follows by using the shared-memory algorithm from Section 4.2.It can be implemented to run in expected time O α log 2 p on a distributed-memory machine using PRAM emulation [32].
At the price of getting only expected query time, we can also achieve logarithmic latency (Equation (3)) by using the rejection sampling algorithm from Section 4.4.The preprocessing there requires only prefix sums that can directly be implemented on distributed memory: We have to assign p meta-items to 2p meta-buckets (two on each PE).Suppose PE i computes the prefix sum k = j<i W j /W .It then sends item i to PE j = k/2 .PE j then initiates a broadcast of item i to PEs j.. (j + W i /W − 1)/2 .All of this is possible in time O(α log p).

Redistributing Items
As discussed so far, constructing distributed-memory 2-level alias tables is communication efficient.However, when large items are predominantly allocated on few PEs, sampling many items can lead to an overload on PEs with large W i .We can remedy this problem by moving large items to different PEs or even by splitting them between multiple PEs.This redistribution can be done in the same way we construct alias tables.This implies a trade-off between redistribution cost (part of preprocessing) and load balance during sampling.
We now look at the case where an adversary can choose an arbitrarily skewed distribution of item sizes but where the items are randomly allocated to PEs (or that we actively randomize the allocation implying O(n/p) additional communication volume).Proof.Let us distinguish between heavy items that are larger than cW/(p log p) for an appropriate constant c and the remaining light items.The expected maximum weight allocated to a PE based on light items is O(W/p) [33].
There can be at most p log(p)/c heavy items.By standard balls into bins arguments, only O(log p) heavy items can initially be allocated to any PE with high probability.We use the algorithm from Theorem 2-(2) to distribute the heavy items to p meta-buckets of remaining capacity max(0, W/p − S i ) where S i is the total weight of the light items allocated to PE i.Using the bound from Equation (2) would result in a time bound of O log 3 p since we have a factor O(log p) more items.However, the only place where we need a full-fledged PRAM emulation is for doing the binary search which takes only O(log p) steps on the PRAM and time O α log 2 p when emulated on distributed memory.
For the faster variant with rejection sampling, we use prefix sums to distribute the largest N := O(p log p) items such that each PE gets an even share of it.For this, we build groups of N/p = O(log p) items that we distribute in an analogous fashion to the proof in Theorem 2-(3) -a prefix sum, followed by forwarding a group followed by a segmented broadcast.The asymptotic complexity does not change since even messages of size O(log p) can be broadcast in time O(α log p), e.g., using pipelining.Finally, each PE unpacks the group it received and extracts the parts that it has to represent in the meta-table.

Compressed Data Structures for WRS-1
Bringmann and Larsen [6] give a construction similar to alias tables that allows expected query time O(r) using 2n/r+o(n) bits of additional space.We describe the variant for r = 1 in some more detail.We assign w i /W buckets to each item, i.e., ≤ 2n in total.Item i is assigned to buckets j<i w j /W .. j≤i w j /W − 1.A query samples a bucket j uniformly at random.Suppose bucket j is assigned to item i.
Otherwise, bucket j is rejected and the query starts over.Since the overall success probability is ≥ 1/2, the expected query time is constant.
The central observation for compression is that it suffices to store one bit for each bucket that indicates whether a new item starts at bucket b[i].When querying bucket j, the item stored in it can be determined by counting the number of 1-bits up to position j.This rank-operation can be supported in constant time using an additional data structure with o(n) bits.Further reduction in space is possible by representing r items together as one entry in b.
Both constructing the bit vector and constructing the rank data structure is easy to parallelize using prefix sums (for adding scaled weights and counting bits, respectively) and embarrassingly parallel computations.Shun [37] even gives a bit parallel algorithm needing only O(n/ log n) work for computing the rank data structure.We get the following result:

5
Output Sensitive Sampling With Replacement (Problem WRS-R) The algorithm of Section 4.2 easily generalizes to sampling k items with replacement by simply executing k queries.Since the precomputed data structures are immutable, these queries can be run in parallel.We obtain optimal span O(1) and work O(k).

Corollary 5. After a suitable alias table data structure has been computed, we can sample k items with replacement with work O(k) and span O(1).
Yet if the weights are skewed this may not be optimal since large items will be sampled multiple times.Here, we describe an output sensitive algorithm that outputs only different items in the sample together with how often they were sampled, i.e., a set S of pairs (i, k i ) indicating that item i was sampled k i times.The work will be proportional to the output size s up to a small additive term.For highly skewed distributions, even k n may make sense.
Note that outputting multiplicities may be important for appropriately processing the samples.For example, let X denote a random variable where item i is sampled with probability w i /W and suppose we want a truthful estimator for the expectation of f (X) for some function f .Then (i,ki)∈S k i f (i)/k is such an estimator.
We will combine and adapt three previously used techniques for related problems: the bucket tables from Section 2.2, the divide-and-conquer technique from Section 2.4 [35], and the subset sampling algorithm of Bringmann et al. [7].
We approximately sort the items into u = log U groups of items whose weights differ by at most a factor of two -weight w i is mapped to group log(w i /w min ) .In contrast to subset sampling, this has to extend to items with even the smallest weights.
To help determine the number of samples to be drawn from each group, we build a complete binary tree with one nonempty group at each leaf.Interior nodes store the total weight of items in groups assigned to their subtrees.This divide-and-conquer tree (DC-tree) allows us to generalize the divide-and-conquer technique from Section 2.4 to weighted items.Suppose we want to sample k elements from a subtree rooted at an interior node whose left subtree has total weight L and whose right subtree has total weight R. Then the number of items k to be sampled from the left has a binomial distribution (k trials with success probability L/(L + R)).We can generate k accordingly and then recursively sample k items from the left subtree and k − k items from the right subtree.A recursive algorithm can thus split the number of items to be sampled at the root into numbers of items sampled at each group.When a subtree receives no samples, the recursion can be stopped.Since the distribution of weights to groups can be highly skewed, this stopping rule will be important in the analysis.
For each group G, we integrate bucket tables and DC-tree as follows.For the bucket table we can use a very simple variant that stores n G items with weights from the interval [a, 2a) in n G buckets of capacity 2a.Sampling one element then uses a rejection method that repeats the sampling attempt when the random variate leads to an empty part of a bucket. 3e also build a DC-tree for each group.A simple linear mapping of items into the bucket table allows us to associate a range of relevant buckets b T with each subtree T of the DC-tree.
For sampling m items from a group G, we use the DC-tree to decide which subtree has to contribute how many samples.When this number decreases to 1 for a subtree T , we sample this element directly and in constant expected time from the buckets in the range b T .Figure 2 gives an example.We obtain the following complexities: Theorem 6. Preprocessing for Problem WRS-R can be done in the time and span needed for integer sorting n elements with u = log U = log(w max /w min ) different keys 4 plus linear work and logarithmic span (on a CREW PRAM) for building the actual data structure.Using this data structure, sampling a multiset S with k items and s different items can be done with span O(log n) and expected work O(s + log n) on a CREW PRAM.
Proof.Besides sorting the items into groups, we have to build binary trees of total size O(n).This can be done with logarithmic span and linear work using a simple bottom-up reduction.The bucket-tables which have total size n can be constructed as in Section 4.2.
The span of a query is essentially the depth of the trees, log u + log n ≤ 2 log n.
Bounding the work for a query is more complicated since, in the worst case, the algorithm can traverse paths of logarithmic length in the DC-trees for just a constant number of samples taken at its leaves.However, this is unlikely to happen and we show that in expectation the overhead is a constant factor plus an additive logarithmic term.We are thus allowed to charge a constant amount of work to each different item in the output and can afford a leftover logarithmic term.
We first consider the top-most DC-tree T that divides samples between groups.Tree T is naturally split into a heavy range of groups that contain some items which are sampled with probability at least 1/2 and a remaining range of light groups in which all items are sampled with probability at most 1/2.Assuming the heavy groups are to the left, consider the path P in T leading to the first light group.Subtrees branching from P to the left are complete subtrees that lead to heavy groups only.Since all leaves represent non-empty groups, we can charge the cost for traversing the left trees to the elements in the groups at the leaves -in expectation, at least half of these groups contain elements that are actually sampled.
Then follow at most 2 log n light groups that have a probability ≥ 1/n to yield at least one sample.These middle groups fit into subtrees of T of logarithmic total size and hence cause logarithmic work for traversing them.
The expected work for the remaining very light groups can be bounded by their number (≤ u ≤ n) times the length of the path in T leading to them (≤ log u ≤ log n) times the probability that they yield at least one sample (≤ 1/n).The product (≤ n log(n)/n = log n) is another logarithmic term.
Finally, Lemma 7 shows that the work for traversing DC-trees within a group is linear in the output size from each group.Summing this over all groups yields the desired bound.
Lemma 7. Consider a DC-tree plus bucket array for sampling with replacement of k out of n items where weights are in the range [a, 2a).Then the expected work for sampling is O(s) where s is the number of different sampled items.
Proof.If k ≥ n, Ω (n) items are sampled in expectation at a total cost of O(n).So assume k < n from now on.The first log k+O(1) levels of T may be traversed completely, contributing a total cost of O(k).
For the lower levels, we count the number Y of visited nodes from which at least 2 items are sampled.This is proportional to the total number of visited nodes since nodes from which only one item is sampled contribute only constant expected cost (for directly sampling from the array) and since there are at most 2Y such nodes.
Let X denote the number of items sampled at a node at level of tree T .An interior node at level represents 2 L− leaves with total weight W ≤ 2a2 L− where L = log n .X has a binomial distribution with k trials and success probability Hence, where the "≈" holds for kρ 1 and was obtained by series development in the variable kρ.The expected cost at level > log k + O(1) is thus estimated as At level = log k + 3 + i we thus get expected cost ≤ k2 −i .Summing this over i yields total cost Y = O(k).

Distributed Case
The batched character of sampling with replacement makes this setting even more adequate for a distributed implementation using the owner-computes approach.Each PE builds the data structure described above for its local items.Furthermore, we build a top-level DC-tree that distributes the samples between the PEs, i.e., with one leaf for each PE.We will see below that this can be done using a bottom-up reduction over the total item weights on each PE, i.e., no PRAM emulation or replication is needed.Each PE only needs to store the partial sums appearing on the path in the reduction tree leading to its leaf.Sampling itself can then proceed without communication -each PE simply descends its path in the top-level DC-tree analogous to the uniform case [35].Afterwards, each PE knows how many samples to take from its local items.Note that we assume k to be known on all PEs and that communication for computing results from the sample is not considered here.Theorem 8. Sampling k out of n items with replacement (Problem WRS-R) can be done in a communication-free way with processing overhead O(log p) in addition to the time needed for taking the local sample.Building and distributing the DC-tree for distributing the samples is possible in time O(α log p).
Proof.It remains to explain how the reduction can be done in such a way that it can be used as a DC-tree during a query and such that each PE knows the path in the reduction tree leading to its leaf.First assume that p = 2 d is a power of two.Then we can use the well known hypercube algorithm for all-reduce (e.g., [20]).In iteration i ∈ 1..d of this algorithm, a PE knows the sum for its local i − 1 dimensional subcube and receives the sum for the neighboring subcube along dimension i to compute the sum for its local i dimensional subcube.For building the DC-tree, each PE simply records all these values.
For general values of p, we first build the DC tree for d = log p .Then, each PE i with i < 2 d and j = i + 2 d < p receives the aggregate local item weight from PE j and then sends its path to PE j.
Similar to Section 4.3, it depends on the assignment of the items to the PEs whether this approach is load balanced for the local computations.Before, we needed a balanced distribution of both number of items and item weights.Now the situation is better because items may be sampled multiple times but require work only once.On the other hand, we do not want to split heavy items between multiple PEs since this would increase the amount of work needed to process the sample.It would also undermine the idea of communication-free sampling if we had to collect samples of the same item assigned to different PEs.Below, we once more analyze the situation for items with arbitrary weight that are allocated to the PEs randomly.Theorem 9. Consider an arbitrary set of item sizes and let u = log(max i w i / min i w i ).

If items are randomly assigned to the PEs initially, then preprocessing takes expected time
O isort 1 u (n/p) + α log p where isort 1 u (x) denotes the time for sequential integer sorting of x elements using keys from the range 0..u. 5 Using this data structure, sampling a multiset S with k items and s different items can be done in expected time O(s/p + log p).
Proof.For preprocessing, standard Chernoff bound arguments tell us that O(n/p + log p) items will be assigned to a PE with high probability.Since sorting is now a local operation, we only need an efficient sequential algorithm for approximately sorting integers.The term α log p is for the global DC-tree as in Theorem 8.
A sampling operation will sample s items.Since their allocation is independent of the choice of the sampled items, we can once more use Chernoff bounds to conclude that only O(s/p + log p) of them are allocated to any PE with high probability.

Sampling k Items Without Replacement (Problem WRS-N)
We can construct an algorithm for sampling without replacement based on the outputsensitive algorithm for sampling with replacement of Section 5. Presume we know an > k so that a sample of size with replacement contains at least k and no more than O(k) unique items.Then we can obtain a sample with k ≥ k different items using the algorithm of Section 5, discard the multiplicities, and downsample to size k using the exponential clocks method (see Section 2.3).The crucial step, of course, is to find .This depends heavily on the distribution of the input: if all items have similar weight, will be little larger than k, but if the weights are heavily skewed, it could have to be exponentially large.Our main observation is that by having distributed the items into groups of similar weight, we already have enough information to compute a good value for in logarithmic time.We precompute the total weight of each group, their number of items and the prefix sums of these values.At query time, we perform a binary search over the groups.Suppose the currently considered group stores elements with weights in the range [a, 2a).Then we try the value = 1/(2a) .We (over)estimate the resulting number of unique samples as |{i : -the precomputed number of items with weight ≥ a plus times the total weight of the items with weight < a divided by W . Below we show that aiming for an overestimate of 2k different samples will actually yield ≥ k with constant probability.Overall, we get:

Theorem 10. Preprocessing for Problem WRS-N is possible with the same work and time bounds as for Problem WRS-R (Theorem 6). Sampling k items without replacement is then possible with span O(log n) and expected work O(k) on a CREW PRAM.
Proof.First, compute W and apply the preprocessing of Section 5.Then, compute each group's relative total weight, g i := j∈Gi w j /W , where G i denotes the group with index i, and their prefix sum, h i := j≤i g i .Further, let c i := j≤i |G j | be the number of items in all groups up to group i.All of these steps are possible with linear work and logarithmic span on top of the preprocessing for WRS-R of Theorem 6.
At query time, we can find a suitable value of in time O(log n) using binary search over the non-empty groups.These are exactly the leaves of the top-level DC-tree constructed by the preprocessing for WRS-R.Let i be the index of the group currently under consideration and [a i , 2a i ) be the interval of probabilities assigned to the group.Then we can evaluate t for := 1/(2a i ) in constant time as t = • h i + (n − c i ).Lemma 12 states that we can find a group i whose maximum weight 2a i gives us a value of such that sampling elements with replacement yields 2k unique elements in expectation, but at the same time not too many more (O(k)).This ensures that the output contains at least k unique elements with sufficient probability.However, this group may be empty and thus not considered in the binary search.Let G i and G j be the non-empty groups with items of next-smaller and next-larger weights, respectively.Then we are free to choose in the range 2a i ..a j .We do this by solving the linear equation of Lemma 11 for .As there are no items with weights in the range considered, the weight sum and set cardinality in the equation of the lemma remain constant, and we can solve for in constant time to find the most suitable value of t .
We then draw a sample with replacement of size as in Theorem 6, discarding the multiplicities.If the resulting sample has fewer than k unique elements, it is rejected and the sampling is repeated.By Markov's inequality, the probability of this happening is at most 1/2, as was chosen to yield ≥ 2k samples in expectation.In a postprocessing step, we downsample the resulting sample with replacement to the required size k using the exponential clocks technique of Section 3.2.This needs O(k) work and O(log n) span for a selection of k out of Θ(k) items [17,Section 4].
How the expected number of samples E[X] is influenced by the choice of depends on the distribution of the inputs (Lemma 11).Items with small probability of being sampled contribute linearly in their weight, whereas large items are expected to be sampled.
Proof.We can bound the number of unique items in a sample with replacement of size by considering the probability of any item to not be sampled.Item i is not part of the sample with probability (1 − w i /W ) , and we obtain ) where X is the random variable describing the number of unique items in the sample.
We split the formula for the expectation of X into two parts: the items for which w i /W < 1/ , and those with w i /W ≥ 1/ .For the first group, by Bernoulli's inequality, we obtain 1 − (1 − w i /W ) ≤ w i /W .Furthermore, we have: where b i := nw i /W and c := /n and thus the first term of t is explained.Consider now the items with w i /W ≥ 1/ , i.e., the items which yield at least one sample in expectation.Clearly, 1 − (1 − w i /W ) ≤ 1.In the other direction, because w i /W ≥ 1/ , we obtain 1 by the inequality (1 + x/n) n ≤ e x for > 1.Otherwise, if ≤ 1, either no item with w i /W ≥ 1/l = 1 exists, or we are in the trivial case with n = 1 and w 1 = W . Summing the results for the first and second group, we obtain the claimed inequalities.
An example of this is illustrated in Figure 3.By applying the above estimation to the groups used by the algorithm of Section 5, we can quickly obtain an estimate for the output size that is at most a factor of two worse.

Lemma 12. Applying the estimation of Lemma 11 to groups of items of similar weight as in Section 5 loosens the bound on E[X] by at most a factor of two.
Proof.Item i is in group log(w i /w min ) .Label the groups G 1 ..G u where u := log U = log(w max /w min ) and let the value range of group i be [a i , 2a i ).The difference to Lemma 11 now is that we can only choose as 1/(2a i ) for some group i.This necessitates moving entire groups between the left and right halves of the estimation.Pick a group i, i.e., := 1/(2a i ) , and consider the effects of choosing group i + 1 instead, i.e., := 1/(2a i+1 ) = 1/(4a i ) = /2.Clearly, the contribution of any group j > i+1 is the same for t and t .Furthermore, the contribution of groups 1..i is halved, as = /2.It remains to bound the contribution of group i + 1, which is moved from the "right" part of the estimate to the "left" part: it contributes |G i+1 | to t , and ξ := j∈Gi+1 w j /W to t .However, as all items in G i+1 are in the range [2a i , 4a i ) and = 1/(4a i ) , we have ξ and thus the contribution of group i + 1 at most doubles, too.Thus, t ≤ 2t , and for any of Lemma 11 we can find an ˆ that yields at most twice as many unique items in expectation while only relying on information about the groups, not the w i .

Distributed Sampling with Replacement
We again use the owner-computes approach and adapt the distributed data structure for Problem WRS-R from Section 5.1.However, to find the right estimate for the number of samples with replacement, we need to perform a global estimation taking all items into account.This can be achieved by finding the global sum of all the local prefix sums in each step of the binary search.This increases the latency of the algorithm to O(α log(p) log(n)).Also, in each iteration we need a nested binary search in order to find corresponding buckets in the local bucket arrays.
If the global number of buckets u is not too large, we can consider a different trade-off between preprocessing cost and query cost.We precompute replicated arrays of global bucket sizes and weights.This allows finding the right value of with local work O(log u) ≤ O(log n).
Local sorting is then performed using bucket sort with u buckets in time O(n/p + u).The global bucket sizes can then be computed as an all-reduction on the local arrays of bucket sizes in time O(βu + α log p).
During a query, after sampling with replacement, we need a global parallel selection algorithm to reduce the sample size to k.This can be done in expected time O(βk/p + α log p) using the unsorted selection algorithm from [17,Section 4].The term βk/p stems from the need to redistribute samples taken within the selection algorithm.For randomly distributed data this is not necessary and the term becomes a local work term k/p.We get the following result: Theorem 13.Consider an arbitrary set of item sizes and let u = log(max i w i / min i w i ).

If items are randomly assigned to the PEs initially, then preprocessing takes expected time
O isort 1 u (n/p) + α log p where isort 1 u (n/p) denotes the time for sequential integer sorting of x elements using keys from the range 0..u.Using this data structure, sampling k items without replacement can be done in expected time O(k/p + α log(n) log(p)).
A variant that uses a replicated array of global bucket sizes works with preprocessing time O(n/p + βu + α log p) and query time O(k/p + log u + α log p).

Permutation (Problem WRP)
As already explained in Section 2.3, weighted permutation can be reduced to sorting random variates of the form − ln(r)/w i where r is a uniform random variate.The nice thing is that a lot is known about parallel sorting.The downside is that sorting may need superlinear work in the worst case.However, since we are sorting random numbers, we may still get linear expected work.This is well known when sorting uniform random variates; e.g., [26,Theorem 5.9].The idea is to map the random variates in linear time to a small number of buckets such that the occupancy of a bucket is bounded by a binomial distribution with constant expectation.Then the buckets can be sorted using a comparison based algorithm without getting more than linear work in total.Here we explain how to achieve the same for the highly skewed distribution needed for WRP.We use the monotonous transformation of the above mapping function to f (r, w i ) := n ln(− ln(r)nw max /w i ) where w max = max j w j .Except for an expected constant number of elements, r will be in the range [1/n, 1 − 1/n].This corresponds to a range of for f where w min = min j w j and U = w max /w min .Values outside this range will be mapped to keys −1 and n ln(nU ln n) .The remaining items will be mapped to the integer key Parallel Weighted Random Sampling f (r, w i ) .We then perform integer sorting on the truncated keys and apply comparison based sorting to the resulting buckets of items with the same integer keys.
Theorem 14. Problem WRP can be solved in the work and span needed for integer sorting of n items with keys from the range 0..O(n ln(nU )).
Proof.The main proof obligation is to show that buckets have constant expected occupancy regardless of the weight distribution.First note that f can be written as n(ln(− ln r) + ln(nw max /w i )).The factor n just scales the bucket size.The two terms multiplied with n separate the influence of the uniform variate r and of the weights.This means that the weight term just shifts the distribution produced by the term ln(− ln(r)), i.e., the overall distribution is a mix of shifted versions of the same distribution -one for each weight value.This means that the maximum bucket occupancy is maximized if all weights are the same.Let us ignore the scaling factor n and the shift for now and concentrate on the mapping g(r) = ln(− ln r).The value of r needed to produce a value x is e −e x .The value of r needed to produce a value x + δ is e −e x+δ .Hence, the probability to produce a value in [x, x + δ] is e −e x − e −e x+δ .In other words, the probability density is maximized at x when the derivative of e −e x is minimized.Using calculus, it can be shown that this is the case at x = 0. Hence, the probability that an item is mapped to a bucket is bounded by the width of the bucket times e −e 0 = 1/e.Taking the scaling factor n into account now, we see that the bucket width is 1/n, yielding a probability of ≤ 1/(ne) for any item to be mapped to any particular bucket.Hence, the occupancy of any bucket is bounded by a binomial distribution with success probability ≤ 1/(ne) and n trials, i.e., with an expected occupancy of 1/e.
The proof from [26] for the uniform case transfers for the cost of sorting the buckets with overall work O(n).The span for that part is O(log n) with high probability.This can easily be shown using Chernoff bounds.
Depending on U and the machine model, this result can yield very efficient algorithms.If span n for some constant is acceptable, we get linear work if log U is polynomial in n using radix sort.If U itself is polynomial in n we get logarithmic span on a CRCW PRAM [31].

Subset Sampling (Problem WRS-S)
Subset sampling is a generalization of Bernoulli Sampling to the weighted case.The unweighted case can be solved in expected time linear in the output size by computing the geometrically distributed distances between elements in the sample [1].The naïve algorithm for the weighted problem, which consists of throwing a biased coin for each item, requires O(n) time.Bringmann et al. [7] show that this is optimal if only a single subset is desired, and present a sequential algorithm that is also optimal for multiple queries.The difference between WRS-S on the one hand and WRS-1/WRS-R on the other hand is that we do not have a fixed sample size but rather treat the item weights as independent inclusion probabilities in the sample (this requires w i ≤ 1 for all i).Hence, different algorithms are required.Observe that the expected sample size is W ≤ n.Then our goal is to devise a parallel preprocessing algorithm with work O(n) which subsequently permits sampling with work O(1 + W ).
We now parallelize the approach of Bringmann et al. [7].Similar to our algorithm for sorting with replacement, this algorithm is based on grouping items into sets with similar weight.In each group, one can use ordinary Bernoulli sampling in connection with rejection sampling.Load balanced division between PEs can be done with a prefix sum calculation over the weights in each group.

Theorem 15. Preprocessing for Problem WRS-S can be done with work O(n) and span O(log n). A query can then be implemented with expected O(W + 1) work and O(log n) span.
Proof.Approximately sort the items into L + 1 buckets with L := log n , where bucket i is B i := j | 2 −i ≥ w j ≥ 2 −(i+1) for i ∈ 0..L − 1 and B L := j | 2 −L ≥ w j contains all sufficiently improbable elements.This can be done with linear work and logarithmic span on a CREW PRAM [31,Lemma 3.1].Let σ denote the permutation of the items implied by this re-ordering.
Next, we precompute an assignment of consecutive ranges of permuted items to PEs.Observe that we need not care about bucket boundaries regarding the assignment; there are no dependencies between elements, and the result remains a valid subset sample.Therefore, we compute a prefix sum ŵi := j≤i w σ(j) over the inclusion probabilities in their new order.PE i then handles the items whose ŵj fall into the range [(i − 1)W/p, i • W/p).These can be found with linear work and constant span by checking for every neighboring pair of items whether a boundary falls between them, and if so, which PE's.
Sampling then proceeds on all buckets in parallel using the assignment calculated during preprocessing.In bucket i, all items have weight at most w i := max j∈Bi w j ≤ 2 −i , and, with the exception of bucket L, at least w i /2.PEs generate geometrically distributed skip values v := ln(rand())/ ln(1 − w i ) and then consider the j := v/w i -th item.The algorithm then uses rejection to output the item with probability w j /w i .This process is repeated until a PE exceeds its allotted item range.In all buckets i < L, the acceptance probability is at least 1/2 for every element, leading to an efficient algorithm in expectation.In bucket L, the smaller acceptance probability is not a problem, as with high probability only a constant number of items is ever considered to begin with.This is because items in bucket L have probability at most 1/n.In total, this requires work O(1 + W ) in expectation.
Although all PEs have to do about the same expected amount of work, there are some random fluctuations between the actual amount of work depending on the actual exponential variates that are computed.Using Chernoff bounds once more, these can be bounded to O(W/p + log n) with high probability -leading to a (conservative) log n term for the span.
Each PE returns an array containing its part of the sample.An additional prefix sum computation can be used to rearrange the output into one contiguous array.This requires work O(n) and span O(log n).

Distributed Subset Sampling
For the distributed setting, observe that Problem WRS-S can be solved entirely independently over disjoint subsets of the input if we are not interested in load balancing.No communication is needed.We thus can directly adapt the result of Bringmann et al. [7].Note that for a PE with n i items, it suffices to sort them into log n i categories now which is possible in linear time.A query on a PE with total weight W i will take expected time O(1 + W i ).As expected parallel query time we get O(max i W i + log n) using an argument analogous to the proof of Theorem 15.
Once more, we analyze load balancing for the case that the items are distributed randomly.Proof.The bound for the preprocessing follows by applying Chernoff bounds to the distribution of the number of items.The query bound follows in a similar way by exploiting that the expected maximum load is maximized when all the weight is concentrated in W items of weight 1 [33].

9
Sampling with a Reservoir (Problem WRS-B) We adapt the streaming algorithm of Efraimidis et al. [13] to a distributed mini-batch streaming model (also known as discretized streams), where PEs process variable-size batches of items one at a time.The PEs' memory is too small to store previous batches, only the current mini-batch is available in memory.This is a generalization of the traditional data stream model and widely used in practice, e.g., in Apache Spark Streaming [41], where it is called discretized streams.The basic idea is to keep the reservoir in a distributed priority queue [17].
First, we show how to adapt sampling with skip values (exponential jumps) of Efraimidis et al. [13,Section 4] to the exponential variates described in Section 2.3.This allows for faster and numerically more efficient generation in practice.The difference between the variates associated with items in Efraimidis et al. [13] and here is a simple x → − ln(x) mapping.Because of the sign inversion, the reservoir R contains the elements with the smallest associated variates.Let v i denote the key of item i, that is, the exponentially distributed variate associated with it, and define T := max i∈R v i as the threshold value, i.e., the largest key of any item in the reservoir.Then the skip value X describing the total amount of weight to be skipped can be computed as X := − ln(rand())/T .The key associated with the item j that is to be inserted into the reservoir is then v j := − ln(rand(e −T wj , 1))/w j , where rand(a, b) := a + rand()(b − a) generates a uniform random variate from the interval [a, b).The range of this variate has been chosen so that v j is less than T (as at this step in the algorithm, it has already been determined that item j must be part of the reservoir, we need to compute a suitable variate from the distribution associated with its weight).It then replaces the item with the largest key in the reservoir, and the threshold T is updated accordingly.
We now show how to use this sequential algorithm to construct a distributed weighted reservoir sampler.Proof.We maintain the local reservoirs using augmented search trees that support splitting in logarithmic time [26].When processing a mini-batch of size b, we proceed as in the sequential algorithm above and insert new elements into the reservoir.However, the insertion threshold T remains unchanged (initially, and as long as less than k elements are in the global reservoir, T = −∞).By definition of b * , these insertions require time O(b * log(b * + k)) in total.Since we have to process each element's weight even when using the above skip value technique, O(b) time is required to identify the items to be inserted into the reservoir.Once all PEs have finished processing their batch, we compute the new threshold.Using a distributed selection operation on the search trees, this can be implemented in expected time O α log 2 (kp) [17].We then discard the items with keys exceeding the new threshold in O(log(k + b * )) time using a split operation on the local search tree.
The question now is how many items we unnecessarily insert into the local reservoirs.Efraimidis et al. [13,Proposition 7] show that if the w i are independent random variates from a common continuous distribution, their sequential reservoir sampling algorithm inserts O(k log(n/k)) items into the reservoir in expectation.We adapt this to mini-batches of b elements per PE.Let X i denote the number of insertions on a PE for batch i.We obtain a binomially distributed random variable with expectation where n pre is the number of elements seen globally before the batch began.For the initial i 0 = k bp iterations, this probability exceeds one and we account for this with b insertions per PE -overall b • k bp = k/p.For mini-batches i 0 ≤ i < n pb we obtain where H n is the n-th harmonic number.
To obtain the maximum load over all PEs, we apply a Normal approximation to the bound on the X i , obtaining . Summing these over the mini-batches as above, we again obtain a Normal distribution whose mean and variance are the sum of its summands' means and variances.We then apply a bound on the maximum of i.i.dNormal random variables obtained using the Cramér-Chernoff method [

Experiments
We concentrate on alias tables (Problem WRS-1, Section 4) and sampling with replacement (Problem WRS-R, Section 5).The reason is that our algorithms for problems WRS-N, WRP, and WRS-B are basically reductions to sorting and selection problems that have been studied elsewhere (WRS-N also needs WRS-R, which we do study here).We are also not aware of competing parallel approaches that we could compare to.The algorithm for problem WRP is quite simple and not that different from the unweighted case studied in Sanders et al. [35].
Experimental Setup.We use machines with Intel and AMD processors in our experiments.
The Intel machine has four Xeon Gold 6138 CPUs (4 × 20 cores, 160 hyper-threads, of which we use up to 158 to minimize the influence of system tasks on our measurements) and 768 GiB of DDR4-2666 main memory.The AMD machine is a single-socket system with a All pseudorandom numbers are generated with 64-bit Mersenne Twisters [24], using the Intel Math Kernel Library (MKL) [18] on the Intel machine and dSFMT7 on the AMD machine.All of our implementations are publicly available under the GNU General Public License (version 3) at https://github.com/lorenzhs/wrs.Compared to the descriptions in Sections 2.2 and 4.2, we performed a minor modification to construction of the tables to improve query performance.In the alias table, we store tuples (w i , A = [i, a i ]) of a weight w i , item i and alias a i .This allows for an optimization at query time, where we return A[rand() • W/n ≥ w i ], saving a conditional branch.When indices are 32-bit integers and weights are 64-bit doubles, this does not use additional space since the record size is padded to 128 bits anyway.
Sequential Performance.Surprisingly, many common existing implementations of alias tables (e.g., gsl_ran_discrete_preproc in the GNU Scientific Library (GSL) [14] or sample in the R project for statistical computing [30]) use a struct-of-arrays memory layout for the alias table data structure.By using the array-of-structs paradigm instead, we can greatly improve memory locality, incurring one instead of up to two cache misses per query.Combined with branchless selection inside the buckets and a faster random number generator, our query is more than three times as fast as that of GSL version 2.5 (measured for n = 10 8 ).At the same time, alias table construction using our implementation of Vose's method is 30 % faster than GSL.Other popular statistics packages, such as NumPy (version 1.5.1, function np.choice) or Octave (Statistics package version 1.4.0,function randsample) employ algorithms with superlinear query time.We therefore use our own implementation of Vose's algorithm as the baseline in our evaluation.
Among our sequential implementations, construction with Vose's method is slightly faster than our sweeping algorithm.On the Intel machine, it is 8 % faster, while on the AMD machine, the gap is 3 %.However, since all of our measurements exclude the time for memory allocations, this is not the full picture.If we include memory allocation, our method is around 5 % faster than Vose's on both machines.This is because it requires no additional space, compared to O(n) auxiliary space for Vose's method.
The optimization described above to make queries branchless lowers query time substantially, namely by 22 % on the Intel machine and 27 % on the AMD machine, again for n = 10 8 .Storing the item indices at construction time comes at no measurable extra cost.

Construction
Speedups compared to an optimized sequential implementation of Vose's alias table construction algorithm are shown in Figure 4 (strong scaling with n = 10 8 and n = 10 9 , weak scaling with n/p = 10 7 uniform random variates).Speedups do not increase further once the machine's memory bandwidth is saturated, limiting the speedup can be achieved with techniques that require multiple passes over the data (PSA, 2lvl-classic).In contrast, 2lvl-sweep can be constructed almost independently by the PEs and requires much fewer accesses to memory.Sequentially, there is little difference between our sweeping algorithm and Vose's method.However, our algorithm scales much better to high thread counts because it reduces the memory traffic and since hyper-threading (HT) helps to hide the overhead of branch mispredictions.For large per-thread inputs, 2lvl-sweep achieves more than twice the speedup of 2lvl-classic.Preprocessing for OS introduces some overhead but requires fewer passes over memory than PSA and achieves approximately twice the speedups as a result.Interestingly, on the AMD machine it benefits from hyper-threading much more than on the Intel machine.We can also see that by processing each group of each thread independently, it makes good use of the cache: the large caches of the Intel machine give it a boost for n = 10 8 (Figure 4a).Once groups no longer fit into cache (n = 10 9 , Figure 4c), speedups are somewhat less.
In the weak scaling experiments (Figures 4e and 4f), we again see clearly how 2lvl-classic and PSA are limited by memory bandwidth.Using more than two threads per available memory channel (4 × 6 for the Intel machine, 8 for the AMD machine) yields nearly no additional benefit for these algorithms.Meanwhile, 2lvl-sweep and -to a lesser extent -OS are not so much limited by the available memory bandwidth as by the latency of memory accesses.As a result, they continue scaling well, even for the highest thread counts.

Queries
We performed strong and weak scaling experiments for queries (Figure 5) as well as throughput measurements for different sample sizes (Figure 6).Besides uniform random variates, we use random permutations of the weights {1 −s , 2 −s , • • • , n −s } for a parameter s to obtain more skewed "power-law" distributions.
Scaling.First, consider the scaling experiments of Figure 5.These experiments were conducted on the Intel machine, as its highly non-uniform memory access characteristics highlight the differences between the algorithms.All speedups are given relative to sampling sequentially from an alias table.The strong scaling experiments (Figures 5a and 5b) deliberately use a small sample size to show scaling to low per-thread sample counts (≈ 64 000 for 158 threads).We can see that all algorithms have good scaling behavior.Hyper-threading (marked "HT" in the plots) yields additional speedups, as it helps to hide memory latency.This already shows that sampling is bound by random access latency to memory.Sampling from PSA and 2lvl is done completely independently by all threads, with no interaction apart   from reading from the same shared data structures.Because the Intel machine has four NUMA zones, most queries have to access another NUMA node's memory.This limits the speedups achievable using PSA and 2lvl.
On the other hand, OS and OS-ND have a shared top-level sample assignment stage, after which threads only access local memory.This benefits scaling, especially on NUMA machines.As a result, OS-ND achieves the best speedups, despite this benchmark producing very few samples with multiplicity greater than one (Figure 5a, sample size is 1 % of input size).On the other hand, deduplication in the base case of OS has significant overhead, making it roughly 25 % slower than sampling from an alias table for such inputs, even sequentially.
The weak scaling experiments of Figures 5c and 5d show even better speedups because many more samples are drawn here than in our strong scaling experiment, reducing overheads.Sampling from a classical alias table (PSA) achieves a speedup of 40 here, again limited by memory accesses rather than computation.Meanwhile, the output-sensitive methods (OS, OS-ND) reap the benefits of accessing only local memory.

√
n second-level tables as elementary objects in the big data tool ensures load balancing and fault tolerance; a replicated meta-table of size √ n can be used to assign samples to groups.We can probably improve the work bound of WRS-B by exploiting integer keys for the distributed priority queue.We could also study additional variants of reservoir sampling, e.g., in a sliding window setting or sampling with replacement.
Procedure sweepingAliasTable( w 1 , . . ., w n , b : Array of w : R × a : N) W := i w i -total weight i := min {k > 0 : w k ≤ W/n} -first light item j := min {k > 0 : w k > W/n} -first heavy item w := w j -residual weight of current heavy item while j ≤ n do if w > W/n then --Pack a light bucket.b[i].w:= w i --Item i completely fits here.b[i].a := j --Item j fills the remainder of bucket i. w := (w + w i ) − W/n --Update residual weight of item j. i := min {k > i : w k ≤ W/n} -next light item, assume w n+2 = 0 else --Pack a heavy bucket.b[j].w:= w --Now item j completely fits here.b[j].a := j := min {k > j : w k > W/n} -next heavy item, assume w n+1 = ∞ w := (w + w j ) − W/n

Figure 1
Figure 1 Parallel alias tables: splitting n = 13 items into two parts of size n = 7 and n − n = 6.

Theorem 3 .
If items are randomly distributed over the PEs initially, it suffices to redistribute O(log p) items from each PE such that afterwards each PE has total weight O(W/p) in expectation and O(n/p + log p) (pieces of) items.This redistribution takes expected time O α log 2 p when supporting deterministic queries (Theorem 2-(2)) and expected time O(α log p) using rejection sampling (Theorem 2-(3)).

Theorem 4 .
Bringmann and Larsen's n/r + o(n) bit data structure can be built using O(n) work and O(log n) span allowing queries in expected time O(r).

Theorem 16 .
When items are distributed randomly, subset sampling (Problem WRS-S) can be done in a communication-free way with expected preprocessing overhead O(n/p + log p) and expected sampling time O(W/p + log p).

Theorem 17 .
For WRS-B with sample size k, consider mini-batches consisting of up to b elements per PE.Processing such a mini-batch is possible inO b + b * log(b * + k) + α log 2 (kp)time, where b * ≤ b is the maximum number of items from the mini-batch on any PE that is below the insertion threshold.6If the item weights are positive and independently drawn from a common continuous distribution, every PE processes n/p elements, and all batches have b elements per PE, then this algorithm inserts O( k p (1 + log n k )) items into each local reservoir in expectation.

Figure 4
Figure 4 Strong (top and middle) and weak (bottom) scaling evaluation of parallel alias table construction techniques.Strong scaling with input sizes n = 10 8 (top) and n = 10 9 (middle), weak scaling with n/p = 10 7 .Speedups are measured relative to our optimized implementation of Vose's method (Algorithm 1, Section 2.2).

1 Figure 5
Figure5 Query strong and weak scaling for n = 10 9 input elements.Sample size for strong scaling s = 10 7 , per-thread sample size for weak scaling s/p = 10 6 .All speedups relative to sequential sampling from an alias table.Intel machine.
Assuming that O(n/p) elements are allocated to each PE, we can sample a single item in time O(α) after preprocessing a 2-level alias table, which can be done in time O(n/p) plus the following communication overhead βp + α log p with replicated preprocessing (1) α log 2 p expected time using the algorithm from Section 4.2 (2) α log p with only expected time bounds for the query (3) Proof.Building the local alias tables takes time O(max i |E i |) = O(n/p) sequentially.For Equation ( 32-core AMD EPYC 7551P CPU (64 hyper-threads, of which we use up to 62) and 256 GiB of DDR4-2666 RAM.While single-socket, this machine also has non-uniform memory access (NUMA), as the CPU consists of four dies internally.Both machines run Ubuntu 18.04.All implementations are in C++ and compiled with GNU g++ 8.2.0 (flags -O3 -flto).Our measurements do not include time spent on memory allocation and mapping.Algorithm 2) as base case.For OS, we use an additional optimization that aborts the tree descent and uses the base case bucket table when fewer than 128 samples are to be drawn from at least half as many items.The resulting elements are then deduplicated using a hash table to ensure that each element occurs only once in the output.A variant without this deduplication is called OS-ND and may be interesting if items may be returned multiple times.We also implemented sequential versions of both alias table construction methods.Both of the machines used require some degree of Non-Uniform Memory Access (NUMA) awareness in memory-bound applications like ours.Thus, in our parallel implementations, all large arrays are distributed over the available NUMA nodes, and threads are pinned to NUMA nodes to maintain data locality.