===========================================================================
                           ICFP 2011 Review #29A
                 Updated Friday 22 Apr 2011 20:15:43 CEST
---------------------------------------------------------------------------
       Paper #29: Efficient Parallel Stencil Convolution in Haskell
---------------------------------------------------------------------------

                      Overall merit: 1. Reject
                 Reviewer expertise: 3. Knowledgeable

                         ===== Paper summary =====

The paper describes implementing parallel stencil convolutions in
Haskell using the authors' previous Repa array library.  The library
is extended with two key new features: partitioned and cursored
arrays.  The paper describes how to use these extensions to address
some of the performance issues.  Performance numbers are presented for
some uses of stencils.

                      ===== Comments for author =====

Pros:
* Good description of how the library reduces performances issues and
of the extensions to the Repa array library.

Cons:
* Unimpressive performance results.  After all that work, the code is
still a factor worse than C in many cases.  In some cases there are no
parallel speedups.
* Arrays bounds checking is turned off.  If you want a safe language
then the right thing to do is implement array-bounds-checking
elimination optimizations.  These are well known and well described
and work reasonably well for Java.  They need to be implemented in
Haskell.  I also noticed a use of unsafePerformIO in Fig 18.

Other comments:

Fig 1: 3 argument types, 4 actual arugments.  This occurs in other
example code fragments too.

Fig 13 & 15: Why a log scale for runtimes?  This is highly unusual and
makes it hard to figure the performance slow down between the various
versions.

6.3, penultimate para: So is this a limitation of your library or a
limitation of the declarative Haskell approach?

7.1: While one can in theory get a 4x speedup from 4-wide SIMD, in
practice one will only see 2-3x.

7.5, DPJ: DPJ is not a first-order language.  It is based on Java,
Java object types include methods and methods can take other objects
as arguments - thus it is higher-order.  Java even allows annoymous
classes to be nested inside expressions, providing (a syntactically
very heavy weight) equivalent to nested functions.  Whether DPJ allows
this nesting, or whether its effect annotations (which are key to what
it does) work well with higher-order methods I don't know.


===========================================================================
                           ICFP 2011 Review #29B
                 Updated Tuesday 3 May 2011 19:57:58 CEST
---------------------------------------------------------------------------
       Paper #29: Efficient Parallel Stencil Convolution in Haskell
---------------------------------------------------------------------------

                      Overall merit: 3. Weak accept
                 Reviewer expertise: 3. Knowledgeable

                         ===== Paper summary =====

This paper furthers the work on Repa, a framework/library for
regular parallel array programming in Haskell, focussing in this paper
on the efficient implementation of convolution-based algorithms.
The focus on stencils/convolutions is warranted since convolution is a
ubiquitous operation in image processing, scientific computing, and
graphics-oriented programs. The well-known Canny edge detection
algorithm is used as a motivating example along with the discrete
Laplace operator.

"Partitioned" and "cursored" array representations are introduced, as
well as a representation for stencils, permitting efficient
implementations of convolutions.

Partitioned arrays separate into regions the elements near the array
boundaries, where application of a convolution indexes elements
outside of the array, from those in the inner area of the
array. Costly bounds checking on array indexing can be eliminated in
the inner regions, which as the author's argue well, can be a
significant overhead.

Cursored arrays improve performance by exposing common
sub-expressions involved in index calculation. Cursored arrays are
indexed by an abstract representation of their indices, for which
a shifting operation calculates indices relative to the current index.

A stencil representation is introduced which prevents random
array-indexing, instead restricting the programmer to relative
array-indexing. 

I enjoyed the paper. The work is well motivated and it is nice to see
practical, real-world examples being used. The authors have made a good
effort at thinking about how to improve array performance for
convolutions and have positive, encouraging results.

Although I think the research is good, I feel the discussion is
lacking in places.

In particular, we never see how the stencil is applied at the
boundary regions. Figure 9 on page 7 shows what I believe to be
the relevant code, although the parts pertaining to application of
stencils at the boundaries is omitted. I would have thought this would
be particularly interesting since the boundary varies depending on the
extent of the stencil. More is needed on the interaction between
boundary/regions and the stencil representation.

Also, the paper lacks discussion of how to generalise the code seen
throughout e.g. for rank-n arrays, for arbitrary stencil sizes in
mapStencil2, and for arbitrarily sized unroll-jamming blocks.  

I think this paper should be accpeted, although I would have more
confidence if the discussion of boundary regions was fleshed
out further and if there was more discussion of generalisation.

                      ===== Comments for author =====

Here are a number of thoughts, minor corrections, and errors:

On the first page, talking about Chapel, "which stresses the need to
define the value of each array declaratively". Do you mean the value
of each array element? 

It might be worth mentioning for the sake of those who are less
accustomed to convolution, that it is a general technique found in
many applications, not just image processing. 

The discrete Laplace operator is used as an example,
although I'm not sure if it is correctly specified. As I recall (and
confirmed by a peek at Wikipedia) discrete Laplacian in 2D has the kernel:

0  1  0 
1 -4  1
0  1  0

In the paper a slightly different kernel is used. I am not an expert
in image processing, so perhaps the definition you use is standard in
image processing as an approximation or as a special case of the
mathematically derived discrete Laplace operator. If the approach
is standard, perhaps a citation could be given? Otherwise, an
explanation could be included. This is not a particularly big problem
though, the actual coefficients are not relevant to the approach after
all.

In Figure 1, is the implication that arrBoundValue is 0 for all
elements that are not boundaries, and some other value for the
elements that are outside of the bound? 

"Replacing the use of (!) in the definition of traverse with an ..." -
We don't know that (!) is used in traverse, so perhaps you should say
this first, then say that instead unsafe indexing operations could be
used.

"In Figure 2, note that the code that checks for the border is
"further up", so the pres...." You mean it is not shown because it has
been lifted during code transformation; "further up" is too vague: I
started looking at the top of the code block to find it.

(pg 4) "In contrast, RangeRects is used..." Do you mean
"rangeRects" (lower case)?

You mention that regions are pemitted to overlap. Can you give an
example when this happens?

We are referred to 5.3 for a discussion of per-region specialisation,
but in 5.3 I don't see such discussion. Did you mean 5.2? However, in
5.2 there is barely any discussion of the specialisation to the
different regions (see my above major point).

"When tried adding it..." -> "We tried adding it ..."?

Shouldn't makeStencil in Figure 6 have a "Num a => " constraint
arising from the use of 0? (or was that omitted on purpose?).

I liked the Template Haskell specification of a kernel. This seems like
a nice idea but little is said about it.

"Returning to the issue of bounds checking, with ..." You might want
to start this sentence as a new paragraph.

(pg6) "\texttt{load*}" Perhaps "\texttt{load}*" might be better- I was
momentarily confused about some function called load* (this
appears again later I think). 

"Computation of the inner array elements is performed by by the" -
double "by".

In the mapStencil2 function (figure 9) shouldn't the pattern match
"(Stencil sExtent _ _)" be "stencil@(Stencil sExtend _ _)"? 

Do you plan on having multiple cases in unsafeAppStencil2 for m*n
stencils? Or will they all be square? You talk about not having a
general way to create map functions for rank-n arrays, perhaps a macro
system could be used to generated rank-n arrays and to generate the
(unrolled) stencils?

At the top of column 2 pg 8, "Although the conversion to and from the
linear index...". It took me a while to understand what you meant as I
thought you were referring to the calculation of the cursor, now I see
that you are referring to the ixLinear value which comes in from
somewhere else that we can't see and is converted back into (x, y)
indices. Be careful about referencing code we can't see, especially
if it sounds like some other code that we can see. Perhaps a bit of a
rewrite of this sentence will help. 

"..need to evaluated several in one go.." too vague/informal.

Why is a block size of 4 chosen, as opposed to say 2, 6, or 8 or n?
More discussion! 

"low level" -> "low-level"

There isn't much discussion of how the parallelism is acheived by
"force". I presume it performs 1D decomposition of the array into
n-blocks for the n-processors? A little on this would be good.

Why is Laplace applied at a size of 300x300 and not at a more standard
size that is a power of two, like you did for Sobel and Canny
(512x512, etc)?

"note the lack of overlap in four Laplace stencils placed
side-by-side" I do not see this. For four Laplace
stencils side-by-side there are 16 necessary accesses (i.e. non-zero
co-efficients), two of which overlap. There is some overlap, though not
much, as you said in the previous sentence. Note, there is no overlap
for *two* stencils side-by-side.

(pg 11, bottom of col1) "Figure. 18" -> "Figure 18"

Some discussion of how to generalise the code seen throughout would be
a good idea e.g. for rank-n arrays, for arbitrary stencil sizes in
mapStencil2, and for arbitrarily sized unroll-jamming blocks. On this
last one, some discussion of how to chose the optimal number of
iterations to unroll by would be good (for instance, I presume number
of registers is an important factor but also sizes of proximal caches). 

(pg 12, top of col1) "Recent representatives based on ..." this
sentences doesn't follow from the previous.

"PASTHA [18], whose implementation is based around IOUArray. It
supports..." Sentence fragment.

Can your techniques be extended to other types of stencil computations
that are not convolutions (e.g. the game of life?)


===========================================================================
                           ICFP 2011 Review #29C
                  Updated Sunday 8 May 2011 11:57:57 CEST
---------------------------------------------------------------------------
       Paper #29: Efficient Parallel Stencil Convolution in Haskell
---------------------------------------------------------------------------

                      Overall merit: 3. Weak accept
                 Reviewer expertise: 3. Knowledgeable

                         ===== Paper summary =====

The paper describes the changes made to the Repa library for parallel arrays
in Haskell in order to support efficient stencil-convolution operations. It
explains how the representation of arrays was adapted so that both library and
client code can distinguish between various regions of interest in an array.
Furthermore, it explains a zipper-like representation of an array's contents
that allows, to some extent, for memoisation of array lookups. The paper
discusses the results of a set of benchmarks that were performed on a dual-core
machine.

                      ===== Comments for author =====

There is no doubt that this is a good paper: it is well-written and it conveys
a sufficient amount of detail, while still being to-the-point. However, I have
to admit that I am not very enthusiastic about it: even though the authors
make clear that they have done a very good job improving on the performance of
their library on convolutions, it is not at all obvious to me why this work is
important for the functional-programming community.

Two-dimensional convolution is an operation important enough in image
processing to come with special support in a parallel-array library and
extending Repa with such support---or, actually, overhauling it---surely seems
the right thing to do; but that by itself does, in my opinion, not merit a
publication. These are my main concerns:

1. What lessons are there to learn for other library implementors?

   o The partitioned representation of arrays solves the problem of repeated
     border checks, but the paper does not make clear whether it poses a
     trade-off with respect to other aspects of working with parallel arrays.
     Note: I am not claiming that such a trade-off exists; it's just that the
     paper does not convincingly argue that there isn't. As Repa is not
     intended for convolution computations exclusively, I would have expected,
     in addition to the stencil benchmarks, for some previous benchmarks to be
     repeated: what is the impact of the changes to the library to other
     applications than convolution?

   o The cursored traversal over arrays that is supported by the new
     representation of arrays deserves a more in-depth treatment. How does it
     relate to Huet's zipper? It seems that for traversals to take advantage
     of memoisation, they should follow a very regular pattern. Can the
     implementation of cursors be adapted, perhaps even dynamically, to
     less-regular traversals?

2. The paper, by its title, claims to be about efficient convolution
   *in Haskell*, but I would argue that it's really about efficient convolution
   with GHC and in conjunction with its LLVM-back end. Some of the
   nitty-gritty design choices seem to be tailored to the particular choice of
   back end. This is defendable, as GHC is de facto the only weapon of choice
   for those who wish to write efficient Haskell code. Nevertheless, I wonder
   how stable your choices are with respect to changes in GHC's back end
   implementation. Once LLVM supports autovectorisation, does the library need
   another overhaul to take advantage of it? At least, if one addresses the
   functional-programming community as a whole, I would expect these issues to
   be discussed more extensively.

3. The parallelisation aspect is a bit underdeveloped in the paper. How to
   the design decisions work out with respect to distributing array-processing
   code over multiple cores? Is it sufficient to run the benchmarks on a
   dual-core machine only?

Here are some comments at a lower level:

o Section 1: Introduction

  page 1, column 1: "in our research group".
    Which group is this? The authors are from different institutions. Moreover,
    the paper referred to is co-written by Keller and Chakravarty (who was a
    coauthor of the Repa paper...)

o Section 2: The Laplace Equation, Reloaded

  What is the purpose of this section? Make it explicit.

  p.2, c.1: "We will reuse our example from ..."
    Make this part more self-contained by providing some more context.

  p.2, c.1: "u'(i,j) = ..."
    This is set in Computer Modern Roman, while SIGPLAN style prescribes
    Times. Use the mathptmx package.

  p.2, c.1: "The boundary conditions are specified by ..."
    Explain the role of these conditions better.

  p.2, c.1: "Parallelising the potential random access of ..."
    It was not clear to me what is meant here.

  p.2, c.1: "... we present part of Core IR ..."
    Explain what Core IR is.

  p.2, c.1: "... peforming a large number of code transformations."
    At this point, it is not at all clear whether these transformations are
    performed automatically or manually.

  p.2, c.2: Figure 1.
    To make the paper more self-contained, a better explanation of the code
    is required.

  p.3, c.2: Figure 3.
    Font.

  p.3, c.2: "... Sobel_X ...", "... Roberts_X ...", etc.
    Font.

o Section 4: Stencils, Borders and Partitioned Arrays

  p.4, c.1: "All stencils reuse values for the coefficients."
    I did not understand this point.

  p.4, c.1: "(ex ...)" (twice)
    Write out.

  p.4, c.1: "... Binominal_7X ..." etc.
    Font.

  p.4, c.1: "Points 1 & 2 ..." etc.
    Don't use "&", just write "and".

  p.4, c.1: "... we usually want the second option, but note that ... only
            0.8% of the pixels are in the border region."
    What opposition is signaled by "but"? Restructure; there are two aspects
    here: 1. how to handle the border case, 2. how to distinguish the border
    case from the internal case. Make these two aspects more explicit.

  p.4, c.1: "In the rank 2 case ..."
    Explain that the rank of an array is just its dimension.

  p.4, c.1: Figure 4.
    This picture is not very clear to me. Perhaps you could leave out the
    border, so that it is clear that parts of the stencil fall outside of the
    array? It's confusing that the index of the top-left element is (1,1)
    rather than (0,0).

  p.4, c.2: Figure 5.
    With respect to the Vector-argument of the GenManifest-constructor: how
    are the elements of the vector related to the positions in the array
    region?

  The explanation of co-stencils and how they relate to co-recursive
  functions over streams and co-monadic grid computation is far from clear to
  me. Please, make this subsection more self-contained.

  p.5, c.2: "Although some coefficients in a stencil are often zero, ..."
    What opposition is signaled by "although"?

  p.5, c.2: "Never the less ..."
    "Nevertheless".

o Section 5: Sharing and Cursored Arrays

  p.6, c.1: "4 x 9 = 36"
    Font.

  Invest in explaining the intuition behind cursored arrays before you delve
  into the details of the code that implements them.

  p.8, c.2: "Sobel_X"
    Font.

o Section 6: Benchmarks

  p.9, c.1: Figure 12: "... Sobel_X ..."
    Font.

  p.10, c.1: "... Sobel_XY ..." (twice)
    Font.


===========================================================================
                           ICFP 2011 Review #29D
                Updated Thursday 12 May 2011 02:40:54 CEST
---------------------------------------------------------------------------
       Paper #29: Efficient Parallel Stencil Convolution in Haskell
---------------------------------------------------------------------------

                      Overall merit: 2. Weak reject
                 Reviewer expertise: 2. Some familiarity

                         ===== Paper summary =====

The paper describes how a particular class of matrix manipulation
programs (so-called stencil convolutions) can be implemented in GHC so
the resulting code runs with performance comparable to C.  The authors
analyze the source of inefficiencies and extend a previously
constructed GHC library (Repa) by new features, so that the compiler
can eliminate these inefficiencies.

                      ===== Comments for author =====

At a high level, this reviewer can appreciate the challenged posed and
the solutions: partitioned arrays and cursored arrays.  Particularly
useful is the description on page 5: rather than giving untrusted code
direct access to the array (which would necessitate bounds checking,
in general), the trusted library itself feeds the elements to the
client.

This reviewer can also appreciate the end-to-end results in terms
of the overall performance of the library, which exploits a number
of techniques in GHC combined with the LLVM's low-level optimizations.

Beyond that, however, the whole construction seems incredibly brittle;
baling wire and duct tape come to mind.  Small changes in the LLVM or
the internals of GHC could have dramatic effects, and in fact they
already do.  What is the general scientific advance?  Can I adapt this
to other functional languages or environments?  And, while stencil
convolution is important, I feel like there should be some
generalizable lesson somewhere in here.  If so, I fear it is lost in
the current presentation.


===========================================================================
                           ICFP 2011 Review #29E
                 Updated Friday 13 May 2011 17:06:17 CEST
---------------------------------------------------------------------------
       Paper #29: Efficient Parallel Stencil Convolution in Haskell
---------------------------------------------------------------------------

                      Overall merit: 3. Weak accept
                 Reviewer expertise: 2. Some familiarity

                         ===== Paper summary =====

The paper descibes a number of design decisions/tactics which enable the
GHC/LLVM conglomerate to receive good efficiency. In the absence of
supercompilation several issues have be dealt with explcitly. The paper is
well written, achieves good results 9although comparison with other
approaches is not possible since their numbers are absent).

Aside from using Haskell and GHC the paper is low on very specific FP content;
the absence of side-effect definitely helps though in implementing the EDSL.
On the other hand, given the specific properties of the GHc which are involved,
and the way haskell is used to convey information through the stages, I think
ICFP is probably the most likely venue for this paper.

The paper is pleasant to read, explains things carefully, and I learned
from it.

                      ===== Comments for author =====

Comments were made by my subreviewer.