----------------------- REVIEW 1 ---------------------
PAPER: 8
TITLE: Efficient Parallel Stencil Convolution in Haskell
AUTHORS: Ben Lippmeier and Gabriele Keller

OVERALL RATING: 3 (strong accept)
REVIEWER'S CONFIDENCE: 2 (medium)

Summary: 

This paper presents an implementation of a library for fast parallel
stencil convolution in Haskell.  Stencil convolution is a kind of
image processing technique that is used for operations like smoothing
or edge detection; it is an embarassingly parallel problem because
values of the target image can be computed independently of one
another.

This paper shows how to implement stencil convolution in Haskell in
such a way as to achieve performance comparable to an
industry-standard library (OpenCV).  One big idea of this paper is to
use a representation of arrays that expose different block "regions"
such that operations over these regions can be specialized by the
compiler to eliminate redundant computation (such as unecessary bounds
checks).  Another big idea of this paper is to use a "cursor"
representation for accessing array elements -- this allows the program
to efficiently share subcomputations for related elements.

The paper describes the implementation of the library, shows examples
of its use, and gives performance measurements that demonstrate
significant performance increases: on one test, the GHC implementation
with 8 threads approaches the performance of OpenCV as the size of the
image gets larger.  The paper also discusses how the interaction
between GHC and its LLVM backend can be (carefully!) exploited to get
good performance.

Review:

This is a great paper.  It is well written and the ideas are explained
in a very understandable way.  As someone not working the in the realm
of high-performance computing, I found this paper to be a very good
read.  I particularly liked the explanation of cursored arrays and the
idea of "inverting" the relationship between the code that generates
array indices and the code that consumes array values; that idea was
new to me, and it was nicely explained.

Overall I give this paper a "strong accept"

Comments:

- I wonder to what extent cursored arrays and your block
 representation are useful in other contexts beyond this kind of
 stencil convolution.


----------------------- REVIEW 2 ---------------------
PAPER: 8
TITLE: Efficient Parallel Stencil Convolution in Haskell
AUTHORS: Ben Lippmeier and Gabriele Keller

OVERALL RATING: 2 (accept)
REVIEWER'S CONFIDENCE: 3 (high)

This paper updates last year's Repa paper with two new techniques for
improving performance of convolutions. "Partitioning" allows
representating an array by a collection of subarrays ("regions"), with
each one having a specialized representation. For instance, borders
require special treatment and cover a relatively small part of an array.
With partitioning, the larger region can be computed more efficiently,
leaving the more expensive computations to the (relatively small) borders.
The second new technique is "cursored arrays", which allow some sharing of
indexing computation, as well as some incremental calculation when
accumulating information from nearby array elements. The techniques are
shown to yield significant speed-ups.

The paper also draws attention to some subtleties in achieving and
measuring parallel performance.

Overall, I like the style and design choices presented in the paper.

Some specific comments:

+ Page 3, definition of map on arrays. There is more commonality to be
 shown in the cases for Manifest and Delayed. Both results have the form
 'Delayed (f . h)' and differ only in the choice of h. I'd rewrite the
 code accordingly. Also fix the sentence "Both cases of map produce a
 Delayed array, and the second corresponds to the following familiar
 identity: ...", since *both* cases correspond to that functor law.

+ Page 3, last paragraph of section 3, about explicit calls to 'force':
 "the programmer is responsible for inserting calls to force in the
 appropriate place in their code. Forcing the array at different points
 has implications for sharing and data layout, though in practice we have
 found there are usually only a small number of places where forcing
 would “make sense”, so the choice presents no difficulty." This
 explicitness is a shame, especially given the last sentence. Have you
 explored automating the points of reifying delayed arrays into manifest
 ones? Have you tried to articulate the issues involved in either manual
 or automatic placement?

+ Page 5, figure 6, definition of makeStencil. The code uses 0 and (+).
 Did you omit a Num constraint on the type parameter 'a'?

+ Section 4.3, "Nevertheless, in the literature stencils are ...". Add a
 comma after "literature". Otherwise reads like a remark on "literature
 stencils".

+ Section 5.2, first paragraph, "... have been elided to save space, so
 have the inInternal and inBorder predicates, though they are
 straightforward." Replace the comma after the word "space" with a
 semicolon or period (probably the former) for a cleaner break, or keep
 the comma and replace "so have" with "as have".

+ Page 7, figure 9, definition of mapStencil2. The definitions of make and
 shift are very similar. Perhaps instead, define cursor addition and use
 to define as something like "shift du = (^+^ mkCursor aWidth du)". Or
 maybe not.

+ Section 5.3, first paragraph, "who's inner fragment ...". Replace
 "who's" with "whose."

+ Throughout: Many uses of "this" and a few of "these" with missing or
 unclear referents. Adding a clear & succinct noun or noun phrase for
 every one of these instances will improve clarity and specificity in your
 writing and maybe in your thinking. (It does so for me.) This exercise
 takes some time and effort, and I find that the more difficult the
 exercise, the more muddled my thinking was (without my realizing) and
 hence the more worthwhile. I'd rather you made that effort than leaving
 the work to your readers. As E.B. White (I think) said, "Hard writing is
 easy reading; easy writing is hard reading." So if you expect (or hope
 for) for your paper to have more readers than writers, my math says
 write hard/clearly.


----------------------- REVIEW 3 ---------------------
PAPER: 8
TITLE: Efficient Parallel Stencil Convolution in Haskell
AUTHORS: Ben Lippmeier and Gabriele Keller

OVERALL RATING: 3 (strong accept)
REVIEWER'S CONFIDENCE: 3 (high)

I liked this paper. This is a challenge-response paper, where
a specific library was not performing competitively in a specific
space, and these were the steps taking to make it efficient on
the candidate example. This is a good fit for the Haskell Symposium.

Specifically, this paper discusses the optimization of convolution
operators. The outcome is that they claw back to parity with other
techniques, and in doing so enhance the range of computation methods
that can be expresses using Repa.

The paper is extremely well written, contains a detailed technical
of the execution of specific ideas, including reasonable performance
measurements and analysis of the measurements. Above the specifics,
this paper is part of an over-arching effort to raise the use of
Haskell abstractions, yet get highly efficient code, a goal we all
support. The paper closes with insightful discussions about the
rational behind their use of unsafePerformIO internally, and the
pervasive use of the INLINE pragma, which frustrates the authors.
It is this level of discussion that convinces me that this is
a great candidate for the symposium.


I have no specific issues, but two general comments
* Figure 5 could do with an extra diagram, to explain the 
partitioning into regions.
* Technically, I am uneasy about recovery of sharing, as used. 
  It feels an especially brittle part of the transformation chain.
  See for example the comment on pp 8, middle of right column.


----------------------- REVIEW 4 ---------------------
PAPER: 8
TITLE: Efficient Parallel Stencil Convolution in Haskell
AUTHORS: Ben Lippmeier and Gabriele Keller

OVERALL RATING: 2 (accept)
REVIEWER'S CONFIDENCE: 2 (medium)

The authors describe the work of optimising their parallel Array-library "Repa" 
for performing stencil convolution. It is their intent to achieve a good 
performance in comparison to programs using the fast C-library OpenCV.  

After the introduction the authors first identify algorithmic optimizations that 
possibly enhance the performance of the existing algorithm. To be able to follow 
the changes to the library deduced from these observations, the reader gets a 
quick introduction to the former version of repa in the following section. The 
first optimization is the avoidance of unnecessary operations such as bound 
checks. This issue is addressed by inversion of control and a new array 
representation that partitions the array into parts that need to be treated 
differently. Having addressed these higher-level optimizations the compiler and 
its optimisers are focused. To benefit from a maximal amount of sharing, the 
arrays are accessed via newly-defined cursors that make a special sharing 
optimization possible for the LLVM-Backend of the GHC. The authors verify the 
effectiveness of their approach by shortly examining the generated core IR and 
the disassembled executable. A benchmark section compares versions with 
different optimizations with each other as well as with C / OpenCV versions of 
applications of three concrete stencils. A summary section identifies key 
challenges in the context of the stencil convolution application and raises 
questions for further research. The paper ends with a related works section.

The paper is very well structured. Optimisation considerations are at the 
beginning at a more abstract level (where a normal Haskell feels very 
comfortable) and later descends into the lower levels of compiler backends and 
examining assembly code. It gives a good hint at what performance might be 
achievable for Haskell programs as well as how much pain it is for the 
programmer to get there. Apart from its obvious contributions (showing 
optimising techniques, introducing a new concept for optimization-friendly 
representation of arrays,...) it may as well serve as a base for interesting 
discussions about how far a Haskell programmer might want or should have to go 
to get his code running at top speed... which, it seems, can only be done by 
taking off the abstraction-glasses he normally wears with pleasure.

Throughout the paper they always give reasons for their design decisions, even 
down to which structural constructs to use or to avoid such that compiler 
optimisations can (will) be triggered.