=========================================================================== ICFP 2012 Review #60A Updated Monday 23 Apr 2012 2:03:23pm CDT --------------------------------------------------------------------------- Paper #60: Work Efficient Higher Order Vectorisation --------------------------------------------------------------------------- Reviewer expertise: X. Expert ===== Paper summary ===== The flattening transformation is a classic technique for transforming nested data-parallel applications into flat data-parallel ones. But, this transformation also has a well-known problem --- when a less-nested piece of data is used in conjunction with each element of a more-nested piece of data, the less-nested piece of data is distributed (alternatively, replicated), potentially causing both a growth in space and work. Traditionally, this has been a footnote because in first-order NESL it is obvious when this happens and straightforward to avoid. But, in Data Parallel Haskell, they perform vectorisation (flattening on a higher-order language with recursive datatypes), making it far more difficult to avoid this problem when writing idiomatic code. This paper provides a solution to avoid the growth in work through a change in the representation of their parallel array structure and implementation of the distribute/replicate operations. Instead of flattening all of the data at the blocks and adding a segment descriptor, as is traditionally done in the flattening/vectorisation transformation, they allow blocks to remain in separate chunks and instead add a nesting structure to the parallel arrays. ===== Comments for author ===== This approach looks promising and worthy of investigation. But, based on the support presented in the paper, I am not convinced that it provides a solution that will work in practice and I recommend that this paper be rejected. I believe that this approach successfully removes the asymptotic blow-up associated with data replication, but I have two major concerns with the new representation as presented: 1) Given the very small number of benchmarks presented and absence of additional information about the input data and measurements, I cannot tell if this approach actually performs in the way that the plots in Figure 6 suggest. Shifting from a single flat vector representation to a blocked representation (with potential layout conversions) adds serious potential memory performance issues that are not discussed in Section 7. 2) The purpose of the flattening/vectorisation transformation is to enable parallelism. This representation change and the associated operations will have some impact on parallel execution and speedups, but there is no evaluation of it in that context. More detailed comments and questions follow. I was unable to understand if and when the concat operation will actually be applied, turning the potentially chunky blocks back into a flat structure. Section 5.2 seems to imply that every source-level mapP will result in a concat/unconcat pair in the vectorised code, but much of the rest of the discussion in the paper is about operations on the blocked representation. Since the performance of DPH relies so much on the locality (and parallelism, in the future), this topic seems important and worth clarifying. In particular, in section 7, you state that your "underlying primitives ... operate on bulk arrays and are amenable to parallelisation," which it seems that your new representation will not support if the blocks are not concatenated. The following two paragraphs assume that parallel programs such as those presented in the paper _may_ execute over the chunked flat arrays representation. How badly chunked can the partially-flattened array blocks get? Without a benchmark like quicksort, which has both small and large segments at different points in its execution, it is hard to tell if this transformation will perform well. What are the characteristics of the sparse matrix? If it is excessively sparse (very few elements per row, which would presumably lead to many blocks in your parallel array representation), dense, or mixed, do you still get the same close results to the hand-written Data.Vector version? Depending on how your allocator / garbage collector orders the blocks, I would assume that you could end up having wildly different memory access times for small blocks scattered across pages vs. a sequential run down a single flat block. I have a similar question about the distribution of points for Barnes-Hut. If the programs may _not_ execute over chunked flat arrays (and all executions will only be on flattened arrays), then is parallel stream fusion sufficient to remove all of these conversions within core loops? Or if some of these conversions will remain, is there support for performing them in parallel to avoid linear sequential bottlenecks? Finally, what operations do the benchmark timings include? Do they also include any related operations that convert to or from this flattened structure? Or normalize the structure? Or are the timings just of the core inner loop operations, after all data has been loaded into the new parallel array representation (and concatenated, if that is done)? =========================================================================== ICFP 2012 Review #60B Updated Wednesday 2 May 2012 7:05:56am CDT --------------------------------------------------------------------------- Paper #60: Work Efficient Higher Order Vectorisation --------------------------------------------------------------------------- Reviewer expertise: Y. Knowledgeable outsider ===== Paper summary ===== This paper is a step---and quite an important one---in the development of Data Parallel Haskell, an implementation of Nested Data Parallelism. The problem it solves is that the vectorizing transformation does not preserve the work complexity of programs it transforms: data in nested arrays must at times be replicated, for example turning linear algorithms into quadratic ones. This problem has been known for over a decade, and leads to an explosion in the running time of some DPH benchmarks. The solution is---as usual---an extra level of indirection. The array representation is changed to introduce "virtual segments" which are mapped to physical segments, enabling the data in a nested array to be distributed across many blocks, rather than collected into one contiguous vector. This sounds simple, but it's not: the implementation of the DPH operations is intricate, and to be honest, not that easy to follow---the paper demands very careful reading. In fact, only a part of the implementation is described in detail; there are many references to an accompanying technical report for further details. This makes the paper not entirely self-contained; one wonders whether the authors are trying to describe too much in one paper. The paper claims that the resulting implementation has the same work complexity as a "direct" implementation of the same operations using an unflattened representation of arrays as vectors of pointers. The paper argues informally for this as each operation is described, but there is no proof---instead, three benchmarks are presented in which the old DPH implementation blew up, with worse complexity than a sequential implementation using Vectors, while the new one performs with the same complexity, but a larger constant factor. The benchmarks measure only single-threaded performance, which is appropriate for comparing work complexity, but I would have liked to know how the new library scales on multiple cores also---a shame there are no parallel benchmark results. I believe nested data parallelism will be important, if it can be made to work, and this paper addresses a fundamental problem, probably a show-stopper. And yet... the benchmarks also show that the new DPH implementation performs (in two out of three cases) an order of magnitude *worse* than the reference sequential implementation. That is a big constant factor to try to recover through parallelism. Why is the absolute performance so poor? Perhaps because while the old DPH implementation used two auxiliary vectors of book-keeping data per array, the new one uses five. One is left fearing that the new idea is simply too expensive to be practical. Thus this paper is no more than a step on the way to practical nested data parallelism. ===== Comments for author ===== The new array representation (Figure 3) is quite complex, and it took me a long while to understand its intent. I like your examples, but I would also like a *formal* specification of the meaning of the new representation via an abstraction function mapping the new representation to the old. Studying the definition of that abstraction function would, I think, have enabled me to understand the new representation more easily. I realise this is probably your "demotion" functions discussed later---but you don't include the definitions of these functions anywhere, and I would like to see them at the same time as you introduce the new representation. It didn't help to omit the indices field from the diagram, by the way---just made me wonder what it should be. If you defined the abstraction function, you could use it in QuickCheck properties to relate the new operations on arrays to the old. This would be a useful check on the correctness of your---quite complex---implementations. At the beginning of section 2, "shared across" makes no sense to me---what are you saying here? In section 4.3 you have been sloppy with your types: replicates, which was introduced with a first parameter of type PA Int earlier, now appears with a first parameter of type Vector Int. Then you say you lied, and give it a type in which the first two parameters are scalars rather than vectors of any type (bottom of page 4, definition of replicatesI). By the time you get to replicatesPA a few lines further on, it takes a Vector again---and a PData as the second argument now, rather than a PA. This is EXTREMELY CONFUSING! What on earth is going on? At the top of page 6, you use ++, map and length from the Vector library without mentioning that they are NOT the ones from the Prelude. You'll help the reader if you point that out explicitly. I note that even empty segments refer to a block and offset. Does this mean, when you come to cull blocks, that an empty segment referring to a block can cause it to be retained? Does this matter for your complexity claims? p10 unrunable should be unrunnable. p11 first line: were should be where. =========================================================================== ICFP 2012 Review #60C Updated Saturday 5 May 2012 11:55:29am CDT --------------------------------------------------------------------------- Paper #60: Work Efficient Higher Order Vectorisation --------------------------------------------------------------------------- Reviewer expertise: Y. Knowledgeable outsider ===== Paper summary ===== Existing transformations used to "flatten" nested data parallelism may not preserve computational complexity but may instead asymptotically increase the amount of computational work required. This paper identifies the reason for this flaw, and its close connection with the target representation of arrays. The paper then describes in detail a new approach based on a different representation, including (single-threaded) results from a preliminary implementation. ===== Comments for author ===== I enjoyed reading your paper. I found the explanations of the opening sections (1-3) clear and pleasant to read, with well-chosen illustrative examples. Just one typo: p2 "the type PA is [an -> a] generic representation" Though I found sections 4-5 harder going, as they go into quite a lot of technical detail about your new approach, again I was impressed by the clear exposition. A few minor points: p3 "we refer to all of VSegd, SSegd and Segd ``segment descriptors''" Apart from the missing "as" there is an ambiguity: do you mean all severally or all collectively? p4 It would be helpful to keep the data invariant as concise as possible. I suggest moving the motivating observations "This ensures that ..." out of clauses 2 and 3 of the invariant itself, and into the following discussion (where other similar observations are already made). p4 The naming convention for identifiers is not always clear. PR is first used before it is defined. The signatures for "replicates" and "replicatesPR" seem to be those of other replicate functions and are inconsistent with Figure 5. p5 "replicating array from Figure 3 again" Missing "the". But I also found "again" confusing, as it does not refer to a repetition of what has gone before. How about "an application of |replicates| to the array represented in Figure 3"? p6 I felt a "reading discontinuity" on turning the page. Perhaps the definitions of appendPR et. al. could be made into a captioned figure? p7 I was a bit confused by the paragraph starting "Unfortunately, ...". The key claim is "does not worsen complexity". In Section 1 you clearly state as a contribution that under your approach "vectorised programs maintain the asymptotic work complexity of the originals". Surely "the originals" here can only mean source programs before vectorisation, or else the statement is a tautology? But now this "Unfortunately" paragraph seems to shift the goal-posts by continuing "... worsen the complexity of *vectorised* programs" (my emphasis). Please make things clear in both places. (A similar issue recurs in Section 5.2 with "does not worsen the complexity ... compared with the baseline representation".) p9 "Whereas the sequential version ... the flattened version ...". I found "sequential" distracting. Perhaps just "unflattened" would be clearer? p9 "and returns [on] only those elements" p9 "Luckily, the containment problem is rarely met in practice." Can programmers tell easily if their programs are not "contained", or is it a property that may only emerge in intermediate forms of programs during compilation? Section 6 offers some useful insights. Just one puzzling sentence: p10 "The problem is that this rule improves the asymptotic complexity of the program which turns out to be a bad thing engineering wise." Surely asymptotic complexity improvement is a "good thing"? The "bad thing", as the sequel explains, is that in the context of other transformations this rule may rarely fire. It is nice to have the experimental results in Section 7. p11 "programs [were -> for which] the baseline p11 "single-threaded mode only" ... and you say good results for parallel computations depend on getting fusion to work with the new array representation. How certain is it that this can be done? Is it routine but laborious? Or are there as yet unsolved problems? The results you give show quite large constant-factor overheads of the new array representation, so its ultimate advantage depends on significant performance gains by parallel execution of multiple threads. Section 8 summarises related work, but what about overall conclusions? p12 Omit "to name a few"? (No-one is fooled; two is two!) p12 "equivalent to disallowing [functions to have free variables -> free variables in function bodies]"? p12 "contents of free variables [is -> are] copied" p12 "in [14] [he -> the authors] present[s]" p12 Substitute for "TBA" in ref [14] =========================================================================== ICFP 2012 Review #60D Updated Sunday 6 May 2012 11:18:41pm CDT --------------------------------------------------------------------------- Paper #60: Work Efficient Higher Order Vectorisation --------------------------------------------------------------------------- Reviewer expertise: Z. Passing familiarity with the area ===== Paper summary ===== This paper describes a new approach to the flattening of nested parallelism that avoids the potential increase in computational time complexity of earlier approaches. The approach involves a change in the representation of nested arrays that allows result subarrays to be shared, with various techniques to maintain flat layout when possible to avoid loss of data locality. ===== Comments for author ===== This seems like an important step forward, but I found the presentation difficult to follow and hence verify starting with Section 4, possibly due to the inherent complexity of the mechanism. Indeed, the mechanisms described in this paper to shift between flattened and partiall flattened representations are complex enough that proofs of correctness and of the preservation of work complexity should really be provided. It would be nice to have one or more benchmarks that do not involve worsening of complexity to show whether there is any significant overhead for the new array representation versus Baseline DPH. It's unfortunate that benchmarks are not available for a parallel implementation. A nod to Riely and Prins [18] should be given when you introduce your modified array structure, given the similarity of their approach. It would also be nice and should not be difficult to back up your claim of superiority to [18] in the related work section by benchmarking a version of your implementation that always uses vectors of references. In Section 1, paragraph 1, remove the word "vastly" in "vastly more expressive" unless you have some way to justify the claim. same with "far" in "far easier to implement". In Section 1, paragraph 2, "only guarantees to preserve" => "guarantees to preserve only" Please clarify what you mean by "parallel depth complexity" and "parallel work complexity" so the reader doesn't have to puzzle it out. In Section 7, first paragraph, "programs were the" should read "programs where the".