===========================================================================
                           ICFP 2010 Review #78A
                 Updated Tuesday 18 May 2010 4:42:08pm EDT
---------------------------------------------------------------------------
     Paper #78: Regular, shape-polymorphic, parallel arrays in Haskell
---------------------------------------------------------------------------

                      Overall merit: B. OK paper, but I will not champion
                                        it.
                          Expertise: Y. I am knowledgeable in the area,
                                        though not an expert.

                         ===== Paper summary =====

This paper presents a Haskell library called Repa, for parallel
arrays. The approach is purely functional, supports shape polymorphism
(same code can operate on arrays of different dimensions), and
supports implicit parallel execution.  Parallelism comes from a
delayed array representation where forcing/requesting one element
forces computation of all elements of the array in parallel.  A
performance evaluation of the library is also provided.

                ===== Paper strengths and weaknesses =====

Strengths: 

-- Nice use of shape polymorphism and laziness in providing parallel
operations on arrays.

-- An implementation some experimental results.

Weaknesses:
-- No major conceptual improvement of the state of the art

-- Not clear if the approach is flexible enough to provide support for
   more complex matrix calculations, e.g., can Strassen's matrix
   multiply be implemented?  How general is the approach? 

-- Relatively limited empirical results involving just a few
   benchmarks and small inputs and including no comparison to other
   parallel, e.g., data-parallel, approaches.

It is nice to have a library for parallel operations on arrays.  It is
even better when the library supports shape polymorphism essentially
allowing the same code operate on array of different shape (ranks,
dimensions etc).  This is what the paper achieves.  Furthermore, it
also provides optimizations for eliminating temporary arrays  through
the use of index functions and loop fusion so that results may be
computed by using the input arrays directly.  

Parallelism comes from a delayed representation of arrays that force
all elements of an array to be computed when one element is
requested/forced.  This enables support of implicit parallelism.  The
programmer does not need to express explicitly what operations are in
parallel and what operations are not.  Through use of purely
functional programming and laziness, this can be figured out. 

My concern is exactly that, i.e., that parallelism is implicit.  The
problem is that the programmer does not have control over the amount
of parallelism and what tasks are being expressed in parallel.  I
worry that performance will suffer.  For example, we often want to
express matrix operations in a divide-and-conquer style because they
will, when paired with the right scheduler, will generate just the
right amount of parallelism and will have good data locality.  In the
proposed approach, we are at the mercy of the library, which can
generate excessive amount of parallelism with poor locality, e.g., one
task for each element in the output that involves a large row and a
column (in matrix multiply).  

I would have liked to be convinced otherwise but the experiments are
limited.  They consider only a few simple benchmarks with small input
sizes comparing only to sequential C programs.  Since the input sizes
are small, locality and too-much-parallelism effects are not
visible. This is hardly surprising.  Also, a C version of FFT is easy
to find, why not include it? For example, it would have been
interesting to see how the approach compares to FFT-W (parallel).
Since the paper does not offer comparisons with other parallel
approaches, e.g., data parallel, it is not clear if the approach can
be effective.

Another concern is the expressiveness.  What kind of algorithms can I
express? Can I express recursive or divide-and-conquer algorithms.
For example, can I express Strassens'a MM which is naturally expressed
with recursion.  I would have liked to see a discussion of the
expressiveness of the approach.

                      ===== Comments for author =====

Page 2, Figure 1. May be provide the signature for replicate, as you
use this function for defining mmMult.

Page 3,  DArray seems undefined.

Page 7, last sentence. "is stil" -> "is still"

===========================================================================
                           ICFP 2010 Review #78B
                Updated Saturday 22 May 2010 2:16:06am EDT
---------------------------------------------------------------------------
     Paper #78: Regular, shape-polymorphic, parallel arrays in Haskell
---------------------------------------------------------------------------

                      Overall merit: A. Good paper. I will champion it at
                                        the PC meeting.
                          Expertise: X. I am an expert in the subject area
                                        of this paper.

                         ===== Paper summary =====

This is a paper about express regular array computations in a 
manner that allows GHC to compile them to very efficent
code that can take advantage of parallelism. The code
is somewhat idiosyncratic, but the performance speedup
results are impressive.

                ===== Paper strengths and weaknesses =====

Strengths
* API given is purely functional
* The performance results given are embarrassingly impressive.

Weaknesses
* I found the notation for the dimensions quite strange

                      ===== Comments for author =====

This is a strong paper, and classic functional programming!

Bringing together a number of recent innovation in FP
(associated data types, stream fusion, GHC rules, etc) seems
to have finally licked the perpetual promise of parallelism into
a corner. 

I liked the initial examples, which bring the reader quickly up to
speed with the ideas. You make no apologies for the strange
syntax use. One small suggestion is to stop the code running
over a page/column. 

Do you have a proposal for better syntax for your dimension
descriptions (assuming changes could be made)

The double construction in the representation of Array is
a pattern we see again and again. Perhaps there is something
more general going on here?

Why is the map in section 4.2 not just fmap? (The constrains
from Shape sb?).

In section 4.4, you seem to want some type of sized type,
and sized type polymorphism. How does this system
compare to the many attempts at using sized types as
indices?

I loved the way the rules play into the library in section 6. You may
want to include some discussion about the tools and techniques
that could be used to performance debug array based computations.

Bring out the tool for reaching parallelism, force, in a more explicit way.
From reading the text, force is the only way to get parallelism. How
does force make choices about this split, for example?

I found the justification of the "too perfect" graph amusing in section 8.2.

===========================================================================
                           ICFP 2010 Review #78C
                 Updated Sunday 23 May 2010 12:24:21am EDT
---------------------------------------------------------------------------
     Paper #78: Regular, shape-polymorphic, parallel arrays in Haskell
---------------------------------------------------------------------------

                      Overall merit: B. OK paper, but I will not champion
                                        it.
                          Expertise: X. I am an expert in the subject area
                                        of this paper.

                         ===== Paper summary =====

The paper presents concepts in Haskell for operations on regular arrays of
variable dimension (shape polymorphism). The approach is based on
Haskell with some recent type system extensions and on data-parallel
Haskell extensions in the latest (pre-release) GHC.
Haskell types and data representations for zero-int-indexed arrays of
various ranks and respective operations are presented, laying out a
theory of shape-generic array computations with hidden library
parallelisation. Measurements suggest good performance when arrays are
accessed in regularly row-wise fashion, profiting from a library for
parallelisation (of which no details are given).

                ===== Paper strengths and weaknesses =====

Overall, this paper presents very nice, useful, and mostly
self-contained work. Regular parallel arrays provide an intuitive
programming interface, recent type system extensions are put to good
use, and performance figures are promising.

The paper is hard to understand for non-Haskellers. Some of the discussion
requires considerable background, specifically on laziness issues (which
is not discussed in appropriate places, see cursory comments).

One aspect of the title on which the paper does not give any detail is
"parallel". We are told that arrays will be computed in parallel,
speedups are given, yet the technique behind the scenes remains
unexplained (and only performs well with a pre-release GHC). The claim
"does not require any compiler support specific to its implementation"
appears to be untrue (for performance).

Maybe due to last-minute changes, the presented code contains
inconsistencies (also with the available library version). Maybe due
to the same reasons, there are no conclusions in the paper. Some of
the (self-)references are contrived and should be removed.

The paper is written in an elaborate, fluent style and therefore
very pleasing to read.

                      ===== Comments for author =====

Cursory comments

p.1 The goal of the cited work on byte-strings [7] (previous work by one
    of the authors) is clearly performance not abstraction. The reference
    appears contrived; if anything, then [6] should be referenced here.

(p.1 and throughout) The paper fails to clarify the relationship
   between the presented concepts and "data parallel haskell", deeply
   woven into GHC-(ť=6.13). On page 1, the claim "does not require any
   compiler support specific to its implementation" is untrue for
   performance considerations (or not?)

   The last sentence at the bottom should be moved into sec.2 (where
   it belongs), to avoid the ugly white space on page 1.

p.2 The discussion after the example (before sec.3) is premature, and
    I am unsure whether non-Haskellers can understand it, since it
    discusses laziness without explaining it (done later in 3.1).

    "Very Bad Indeed" (capitalised) seems to be a common idiom and
    joke for the authors - but just misplaced to me as a reviewer.

p.3 ..requires us _to_ represent..

    The first array type should be called _D_Array, as in the text
    that follows. This is better than having two definitions of
    "Array" in the paper.

    Consider swapping code and (rephrased) explanation for "wrap",
    to better explain the forward-references to "Shape".

    In 3.2 again, the reviewer is missing a concise explanation for
    laziness (here: delay). It follows (along the example) in 3.3.

p.4 The type given to the crucial function "traverse" does not match
    its usage (throughout the paper! first in backpermute right below).
    Clearly the de-facto usage with the array to traverse as the last
    argument is preferable, but package repa-1.1.0.0 uses the other version.

    4.1 can be reorganised as follows: starting by the part "The
    extent of an array..", then introducing the representation as snoc
    lists, then discussing/mentioning Ix.  Also, the paper should point
    at "instance Shape sh =ť Shape (sh:.Int)" as the main motivation
    for an extra data type.

p.5 ...an array of any rank n _ť=_ 1, ..
    (or, in fact, it should maybe say "n ť 0", Z being "zero")

    ...generalising the code..to higher rank_s_ \em{rank generalisation}.

    In sum2, maybe use "swap_last" instead of the previously used "swap".

    Consider merging 4.2 and the first part of 4.3. The part following
    the sum2 example discusses limitations. One of them being that
    repa does not perform bounds checks (A pathologic
    shape-polymorphic "filter" version would type-check and even work
    in checked contexts).

    ..calculated from __ the extent of.. (delete "on")

p.6 I got lost in the explanations for the Slice class, unable to match them
    with code in Fig.4. Giving the instance involving "All" is the minimum
    to allow readers to figure out functionality from the usage in mmMult.
    And, the "real code" in repa-1.1.0.0 looks completely different!

    The "asymptotic step complexity of log n" for specialised
    associative reductions is a purely theoretical claim. What you get
    in practice on a multicore is a linear factor. Which is good
    enough, do not oversell!

p.7 .. general shuffle_ operations.. (no "s")
    .. following function that extract_s_ the diagonal

    In the code for "diagonal", a pattern match with an ordinary tuple
    instead of a shape is used on the "extent" result.

    the traverse type in Table 1 matches the one in the beginning (and
    in repa-1.1.0.0), but not its use throughout the text.
    ( The paper MUST match the library!! )

p.8/9 Parallelism should be explained more: how exactly the imple-
    mentation parallelises over the array, not only _that_ it does.

p.9 References for FFT in 3 dimensions and Cooley-Tukey are missing.

    The bad performance of the LaPlace (stencil) benchmark is attributed
    very unspecifically to "bandwidth problems". As the code on p.8
    suggests, the real problem is cache misses due to the non-row access
    for the stencil. A blockwise parallelisation would surely do better.

p.10 If you do not have a C version of FFT3D, consider using fftw.

     Please give a more specific reference for the LLVM backend, or
     leave out the paragraph (it does not make any contribution).

     ..where we achieve _a_speedup_of_ 7.7 with 8 threads_,_ ..

p.11 Maybe [7] is the best reference for ByteString, yet it has been
     written for GHC by Brian O'Sullivan before.

     The with-loop in SAC is part of a core language, not intended for
     programmer use but generated by the compiler. Programmers will
     rather use standard operations (such as matrix arithmetics) overloaded
     in shape-generic way in SAC (would be useful in repa as well).

p.12 The paper misses conclusions in one or two paragraphs.

     In Ref 4, "sy_m_posium"

     Ref [21] (from the same group as the authors) is unspecific, and
     referenced from text only in a digression. Can be removed.