You are viewing empitsu_kami

aggressively optimized for insanity

Data Parallel Haskell fun

Data Parallel Haskell fun

Previous Entry Add to Memories Share

GHC 6.10 promised to deliver on 2 things that I'm looking forward to a lot:

  • Data Parallel Haskell (in some form)
  • Associated types (fully working)

This is about the first of those.

The GHC 6.10 beta was just recently released and with the extralibs package came dph! I've therefore been investigating it a little and here's what I've come up with for those that are interested.

Quickly: What is DPH?

Data Parallel Haskell is an extension to Haskell to support highly optimized parallel arrays for doing nested data parallelism. In particular, it allows you to write pretty expressions like this:

test xs ys zs = sumP $
 zipWithP (-) [: z+(x*y/2) | x <- xs, y <- ys, z <- zs :]
              [: (y^2)-1   | y <- [:1..:] :]

And GHC will do the conversion of that nice, list-like syntax into highly optimized arrays that can run in parallel on your SMP machine. It's an excellent way to harness many cores by doing uniform operations over very large data sets. This is a vectorised syntax and the pass in GHC that does this complex transformation is the vectorisation pass.

The packages that come with dph

As it stands, 6 packages actually comprise the DPH system:

  • dph-base - this is the basic interface and provides things like the internal fusion primitives
  • dph-prim-interface - this (AFAICS) is something of a psuedo-interface to Prelude functions in reality.
  • dph-par - This library exposes a set of parallel combinators for doing computations using the vectorised syntax for arrays, and doing it in parallel as well.
  • dph-prim-par - The primitive, unlifted interface to the parallel combinators.
  • dph-seq - This library exposes simple sequential combinators for doing computations with the vectorised syntax.
  • dph-prim-seq - This is the primitive, unlifted interface to sequential combinators.

There are a few things worth noting. First, dph-par and dph-seq are based on the dph-prim-par and dph-prim-seq packages, respectively. The vectorisation pass in GHC does the conversion from what I can tell.

Second, dph-par and dph-seq export the exact same set of modules and an identical interface. The reason is that you can just pass a flag to GHC stating if you want to use the parallel version of DPH or just the sequential version. You don't have to make any API changes but you get more control over how things work.

Likewise, dph-prim-par and dph-prim-seq export a practically identical interface if you are doing unlifted operations so you can again choose if you want to use the sequential or parallel versions of the package (although they do export modules that the other doesn't have - the most general interface is just using Data.Array.Parallel.Unlifted which are identical.)

When compiling your code, you pass either the -fdph-par flag to get the parallel combinators, or the -fdph-seq flag to get sequential ones. The interface you use for either (in both the vectorised and more plain, unlifted case) is identical.

Example: Dot product

For an example of both of the interfaces we'll be writing a simple parallel dot product and using it in a few computations, with the vectorised interface and the unlifted one.

The first file, test.hs, will simply contain the driver for our experiment.

{-# LANGUAGE PArr, CPP #-}
#ifdef PRIM
import DotPPrim
import Data.Array.Parallel.Unlifted as U
#else
import DotPVect
#endif

main = print $ dotp (mk x1 x2)   (mk y1 y2)
    where x1 = dotp (mk 1  20)   (mk 21 40)
          x2 = dotp (mk 41 60)   (mk 61 80)
          y1 = dotp (mk 81 100)  (mk 101 120)
          y2 = dotp (mk 121 140) (mk 141 160)


#ifdef PRIM
mk n m = U.enumFromTo n m
#else
mk n m = [:n..m:]
#endif

So we can compile with -DPRIM to use the primitive interface and otherwise we'll use the vectorised one.

The vectorised version of the dot product module is nice and simple:

{-# LANGUAGE PArr #-}
{-# OPTIONS -fvectorise #-}
module DotPVect where
import Data.Array.Parallel.Prelude
import Data.Array.Parallel.Prelude.Int as I

dotp :: [:Int:] -> [:Int:] -> Int
dotp v w = I.sumP [: (I.*) x y | x <- v, y <- w :]

This example is really touchy - for example, you must use the (*) operator exported from Data.Array.PArallel.PreludInte - using the regular Prelude version will cause the vectoriser to choke. The vectorisation pass is still in the works, though.

Regardless we have a very nice, declarative syntax as of right now.

The options we need to compile with in particular are:

ghc -fcpr-off -threaded -Odph -funbox-strict-fields -fdph-par ...

So if we compile these two modules, we get our nice program.

However, we don't get anywhere due to a massive space leak that causes the program to die with a stack overflow:

$ $HOME/ghc-6.10/bin/ghc -o vect --make test.hs -fcpr-off -threaded -Odph -funbox-strict-fields -fdph-par
[1 of 2] Compiling DotPVect         ( DotPVect.hs, DotPVect.o )
[2 of 2] Compiling Main             ( test.hs, test.o )
Linking vect ...
$  time ./vect +RTS -N2 -sstderr
./vect +RTS -N2 -sstderr 
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize' to increase it.
     694,018,176 bytes allocated in the heap
   1,888,943,216 bytes copied during GC
     442,790,304 bytes maximum residency (8 sample(s))
      65,816,880 bytes maximum slop
             793 MB total memory in use (6 MB lost due to fragmentation)

  Generation 0:  1293 collections,    28 parallel, 15.34s, 15.23s elapsed
  Generation 1:     8 collections,     7 parallel,  3.54s,  2.60s elapsed

  Parallel GC work balance: 1.64 (180399755 / 110082965, ideal 2)

  Task  0 (worker) :  MUT time:   0.00s  (  0.00s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  1 (worker) :  MUT time:  19.32s  (  0.47s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  2 (worker) :  MUT time:  19.32s  (  0.47s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  3 (bound)  :  MUT time:   0.00s  (  0.00s elapsed)
                      GC  time:  18.88s  ( 17.83s elapsed)

  Task  4 (worker) :  MUT time:  19.32s  (  0.47s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.44s  (  0.47s elapsed)
  GC    time   18.88s  ( 17.83s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   19.32s  ( 18.31s elapsed)

  %GC time      97.7%  (97.4% elapsed)

  Alloc rate    1,572,564,177 bytes per MUT second

  Productivity   2.3% of total user, 2.4% of total elapsed

recordMutableGen_sync: 0
gc_alloc_block_sync: 4503221
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 80
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 18399
gen[1].steps[0].sync_large_objects: 0
./vect +RTS -N2 -sstderr  19.32s user 1.32s system 111% cpu 18.435 total

As far as I can tell this is due to the fact that sumP is in reality just a redefinition of GHC.PArr.sumP which isn't optimized at all, I guess. As well, we cannot even examine the core of this program using ghc-core - it's so huge it would cause less to suck up hundreds of megs of memory!

So the vectorised approach works, but it is not at all optimized right now and the vectoriser is very touchy, so getting things just to compile right can be very annoying, and there are some glaring holes in the API that are filled in the unlifted API. For a preliminary release that might be asking a bit much, but hey, at least the vectorisation pass does work and we do use both cores on my core2 duo! :]

Luckily, if we go to the primitive interface we get much better results.

The DotPPrim module is also quite simple:

module DotPPrim where

import Data.Array.Parallel.Unlifted as U

dotp :: U.Array Int -> U.Array Int -> Int
dotp v w = U.sum (U.zipWith (*) v w)

This version works a lot better and faster, and we do utilize both those nice cores:

$ $HOME/ghc-6.10/bin/ghc -o prim -DPRIM --make test.hs -fcpr-off -threaded -Odph -funbox-strict-fields -fdph-par
[1 of 2] Compiling DotPPrim         ( DotPPrim.hs, DotPPrim.o )
[2 of 2] Compiling Main             ( test.hs, test.o )
Linking prim ...
$ time ./prim +RTS -N2 -sstderr
./prim +RTS -N2 -sstderr 
-1246389388
          44,660 bytes allocated in the heap
           3,508 bytes copied during GC
           5,896 bytes maximum residency (1 sample(s))
          10,488 bytes maximum slop
               4 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  Parallel GC work balance: nan (0 / 0, ideal 2)

  Task  0 (worker) :  MUT time:   0.00s  (  0.00s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  1 (worker) :  MUT time:   0.00s  (  0.00s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  2 (worker) :  MUT time:   0.00s  (  0.00s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  Task  3 (worker) :  MUT time:   0.01s  (  0.01s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.00s  (  0.00s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.01s  (  0.01s elapsed)

  %GC time       1.6%  (3.4% elapsed)

  Alloc rate    7,105,807 bytes per MUT second

  Productivity  78.8% of total user, 78.9% of total elapsed

recordMutableGen_sync: 0
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].steps[0].sync_todo: 0
gen[0].steps[0].sync_large_objects: 0
gen[0].steps[1].sync_todo: 0
gen[0].steps[1].sync_large_objects: 0
gen[1].steps[0].sync_todo: 0
gen[1].steps[0].sync_large_objects: 0
./prim +RTS -N2 -sstderr  0.01s user 0.01s system 120% cpu 0.012 total

If we look at the core output of this program using ghc-core we can see thousands of optimization rules firing and the core isn't that bad, either.

So the primitive interface gives us much better results, we use both cores, and it isn't very difficult to use, either. As the vectorisor and the pass get better and better in the future however, it would naturally be preferable to just use that good syntactic sugar.

Conclusion

DPH is going to be a lot of fun - once the API is more complete and the vectorizer is touched up a bit, it should prove to be very nice in practice. Right now the syntactic sugar for vectors works but the vectoriser is quite fickle and will go 'boom' on very simple things/changes that can in practice be quite annoying. As well the vectorisation pass increases compile times a lot.

OTOH, the primitive interface works nicely right now and there's a lot more functions exposed from the get go - although they are by no means complete. Regardless, you can probably fix yourself up some nice things to play with right now.

DPH is going to change a lot in the future but with the new Recycle your Arrays work, the vectoriser getting more robust, as well as API gaps getting filled, things should get even better and faster.

Perhaps some of the shootout benchmarks would be nice to test DPH on?

Powered by LiveJournal.com