=========================================================================== ICFP 2010 Review #78A Updated Tuesday 18 May 2010 4:42:08pm EDT --------------------------------------------------------------------------- Paper #78: Regular, shape-polymorphic, parallel arrays in Haskell --------------------------------------------------------------------------- Overall merit: B. OK paper, but I will not champion it. Expertise: Y. I am knowledgeable in the area, though not an expert. ===== Paper summary ===== This paper presents a Haskell library called Repa, for parallel arrays. The approach is purely functional, supports shape polymorphism (same code can operate on arrays of different dimensions), and supports implicit parallel execution. Parallelism comes from a delayed array representation where forcing/requesting one element forces computation of all elements of the array in parallel. A performance evaluation of the library is also provided. ===== Paper strengths and weaknesses ===== Strengths: -- Nice use of shape polymorphism and laziness in providing parallel operations on arrays. -- An implementation some experimental results. Weaknesses: -- No major conceptual improvement of the state of the art -- Not clear if the approach is flexible enough to provide support for more complex matrix calculations, e.g., can Strassen's matrix multiply be implemented? How general is the approach? -- Relatively limited empirical results involving just a few benchmarks and small inputs and including no comparison to other parallel, e.g., data-parallel, approaches. It is nice to have a library for parallel operations on arrays. It is even better when the library supports shape polymorphism essentially allowing the same code operate on array of different shape (ranks, dimensions etc). This is what the paper achieves. Furthermore, it also provides optimizations for eliminating temporary arrays through the use of index functions and loop fusion so that results may be computed by using the input arrays directly. Parallelism comes from a delayed representation of arrays that force all elements of an array to be computed when one element is requested/forced. This enables support of implicit parallelism. The programmer does not need to express explicitly what operations are in parallel and what operations are not. Through use of purely functional programming and laziness, this can be figured out. My concern is exactly that, i.e., that parallelism is implicit. The problem is that the programmer does not have control over the amount of parallelism and what tasks are being expressed in parallel. I worry that performance will suffer. For example, we often want to express matrix operations in a divide-and-conquer style because they will, when paired with the right scheduler, will generate just the right amount of parallelism and will have good data locality. In the proposed approach, we are at the mercy of the library, which can generate excessive amount of parallelism with poor locality, e.g., one task for each element in the output that involves a large row and a column (in matrix multiply). I would have liked to be convinced otherwise but the experiments are limited. They consider only a few simple benchmarks with small input sizes comparing only to sequential C programs. Since the input sizes are small, locality and too-much-parallelism effects are not visible. This is hardly surprising. Also, a C version of FFT is easy to find, why not include it? For example, it would have been interesting to see how the approach compares to FFT-W (parallel). Since the paper does not offer comparisons with other parallel approaches, e.g., data parallel, it is not clear if the approach can be effective. Another concern is the expressiveness. What kind of algorithms can I express? Can I express recursive or divide-and-conquer algorithms. For example, can I express Strassens'a MM which is naturally expressed with recursion. I would have liked to see a discussion of the expressiveness of the approach. ===== Comments for author ===== Page 2, Figure 1. May be provide the signature for replicate, as you use this function for defining mmMult. Page 3, DArray seems undefined. Page 7, last sentence. "is stil" -> "is still" =========================================================================== ICFP 2010 Review #78B Updated Saturday 22 May 2010 2:16:06am EDT --------------------------------------------------------------------------- Paper #78: Regular, shape-polymorphic, parallel arrays in Haskell --------------------------------------------------------------------------- Overall merit: A. Good paper. I will champion it at the PC meeting. Expertise: X. I am an expert in the subject area of this paper. ===== Paper summary ===== This is a paper about express regular array computations in a manner that allows GHC to compile them to very efficent code that can take advantage of parallelism. The code is somewhat idiosyncratic, but the performance speedup results are impressive. ===== Paper strengths and weaknesses ===== Strengths * API given is purely functional * The performance results given are embarrassingly impressive. Weaknesses * I found the notation for the dimensions quite strange ===== Comments for author ===== This is a strong paper, and classic functional programming! Bringing together a number of recent innovation in FP (associated data types, stream fusion, GHC rules, etc) seems to have finally licked the perpetual promise of parallelism into a corner. I liked the initial examples, which bring the reader quickly up to speed with the ideas. You make no apologies for the strange syntax use. One small suggestion is to stop the code running over a page/column. Do you have a proposal for better syntax for your dimension descriptions (assuming changes could be made) The double construction in the representation of Array is a pattern we see again and again. Perhaps there is something more general going on here? Why is the map in section 4.2 not just fmap? (The constrains from Shape sb?). In section 4.4, you seem to want some type of sized type, and sized type polymorphism. How does this system compare to the many attempts at using sized types as indices? I loved the way the rules play into the library in section 6. You may want to include some discussion about the tools and techniques that could be used to performance debug array based computations. Bring out the tool for reaching parallelism, force, in a more explicit way. From reading the text, force is the only way to get parallelism. How does force make choices about this split, for example? I found the justification of the "too perfect" graph amusing in section 8.2. =========================================================================== ICFP 2010 Review #78C Updated Sunday 23 May 2010 12:24:21am EDT --------------------------------------------------------------------------- Paper #78: Regular, shape-polymorphic, parallel arrays in Haskell --------------------------------------------------------------------------- Overall merit: B. OK paper, but I will not champion it. Expertise: X. I am an expert in the subject area of this paper. ===== Paper summary ===== The paper presents concepts in Haskell for operations on regular arrays of variable dimension (shape polymorphism). The approach is based on Haskell with some recent type system extensions and on data-parallel Haskell extensions in the latest (pre-release) GHC. Haskell types and data representations for zero-int-indexed arrays of various ranks and respective operations are presented, laying out a theory of shape-generic array computations with hidden library parallelisation. Measurements suggest good performance when arrays are accessed in regularly row-wise fashion, profiting from a library for parallelisation (of which no details are given). ===== Paper strengths and weaknesses ===== Overall, this paper presents very nice, useful, and mostly self-contained work. Regular parallel arrays provide an intuitive programming interface, recent type system extensions are put to good use, and performance figures are promising. The paper is hard to understand for non-Haskellers. Some of the discussion requires considerable background, specifically on laziness issues (which is not discussed in appropriate places, see cursory comments). One aspect of the title on which the paper does not give any detail is "parallel". We are told that arrays will be computed in parallel, speedups are given, yet the technique behind the scenes remains unexplained (and only performs well with a pre-release GHC). The claim "does not require any compiler support specific to its implementation" appears to be untrue (for performance). Maybe due to last-minute changes, the presented code contains inconsistencies (also with the available library version). Maybe due to the same reasons, there are no conclusions in the paper. Some of the (self-)references are contrived and should be removed. The paper is written in an elaborate, fluent style and therefore very pleasing to read. ===== Comments for author ===== Cursory comments p.1 The goal of the cited work on byte-strings [7] (previous work by one of the authors) is clearly performance not abstraction. The reference appears contrived; if anything, then [6] should be referenced here. (p.1 and throughout) The paper fails to clarify the relationship between the presented concepts and "data parallel haskell", deeply woven into GHC-(»=6.13). On page 1, the claim "does not require any compiler support specific to its implementation" is untrue for performance considerations (or not?) The last sentence at the bottom should be moved into sec.2 (where it belongs), to avoid the ugly white space on page 1. p.2 The discussion after the example (before sec.3) is premature, and I am unsure whether non-Haskellers can understand it, since it discusses laziness without explaining it (done later in 3.1). "Very Bad Indeed" (capitalised) seems to be a common idiom and joke for the authors - but just misplaced to me as a reviewer. p.3 ..requires us _to_ represent.. The first array type should be called _D_Array, as in the text that follows. This is better than having two definitions of "Array" in the paper. Consider swapping code and (rephrased) explanation for "wrap", to better explain the forward-references to "Shape". In 3.2 again, the reviewer is missing a concise explanation for laziness (here: delay). It follows (along the example) in 3.3. p.4 The type given to the crucial function "traverse" does not match its usage (throughout the paper! first in backpermute right below). Clearly the de-facto usage with the array to traverse as the last argument is preferable, but package repa-1.1.0.0 uses the other version. 4.1 can be reorganised as follows: starting by the part "The extent of an array..", then introducing the representation as snoc lists, then discussing/mentioning Ix. Also, the paper should point at "instance Shape sh =» Shape (sh:.Int)" as the main motivation for an extra data type. p.5 ...an array of any rank n _»=_ 1, .. (or, in fact, it should maybe say "n » 0", Z being "zero") ...generalising the code..to higher rank_s_ \em{rank generalisation}. In sum2, maybe use "swap_last" instead of the previously used "swap". Consider merging 4.2 and the first part of 4.3. The part following the sum2 example discusses limitations. One of them being that repa does not perform bounds checks (A pathologic shape-polymorphic "filter" version would type-check and even work in checked contexts). ..calculated from __ the extent of.. (delete "on") p.6 I got lost in the explanations for the Slice class, unable to match them with code in Fig.4. Giving the instance involving "All" is the minimum to allow readers to figure out functionality from the usage in mmMult. And, the "real code" in repa-1.1.0.0 looks completely different! The "asymptotic step complexity of log n" for specialised associative reductions is a purely theoretical claim. What you get in practice on a multicore is a linear factor. Which is good enough, do not oversell! p.7 .. general shuffle_ operations.. (no "s") .. following function that extract_s_ the diagonal In the code for "diagonal", a pattern match with an ordinary tuple instead of a shape is used on the "extent" result. the traverse type in Table 1 matches the one in the beginning (and in repa-1.1.0.0), but not its use throughout the text. ( The paper MUST match the library!! ) p.8/9 Parallelism should be explained more: how exactly the imple- mentation parallelises over the array, not only _that_ it does. p.9 References for FFT in 3 dimensions and Cooley-Tukey are missing. The bad performance of the LaPlace (stencil) benchmark is attributed very unspecifically to "bandwidth problems". As the code on p.8 suggests, the real problem is cache misses due to the non-row access for the stencil. A blockwise parallelisation would surely do better. p.10 If you do not have a C version of FFT3D, consider using fftw. Please give a more specific reference for the LLVM backend, or leave out the paragraph (it does not make any contribution). ..where we achieve _a_speedup_of_ 7.7 with 8 threads_,_ .. p.11 Maybe [7] is the best reference for ByteString, yet it has been written for GHC by Brian O'Sullivan before. The with-loop in SAC is part of a core language, not intended for programmer use but generated by the compiler. Programmers will rather use standard operations (such as matrix arithmetics) overloaded in shape-generic way in SAC (would be useful in repa as well). p.12 The paper misses conclusions in one or two paragraphs. In Ref 4, "sy_m_posium" Ref [21] (from the same group as the authors) is unspecific, and referenced from text only in a digression. Can be removed.