Dask
This weekend, I visited PyCon Italy in the pittoreque town of Firenze. It was a great conference with great talks and encounters (great thanks to all the volunteers who made it happen) and amazing coffee.
I held a talk with the title ``Python Data Science Going Functional'' Science Track", where I mostly presented on Dask, one of those libraries-to-watch in the Python data science eco system. Slides are available on speaker deck.
Dask
The talk introduces Dask as a functional abstraction in the Python data science stack. While creating the slides I had stumbled over an exciting tweet about Dask by Travis Oliphant
Cool to see dask.array achieving similar performance to Cython + OpenMP: https://t.co/3tsWCAgWWQ Much simpler code with #dask. @PyData
2016
And after verifying the results on my machine (with some modifications,
as I do not trust timeit
), I included a very similar benchmark in my
slides. While reproducing and adapting the benchmarks, I stumbled over
some weirdly long execution times for the dask
from_array
classmethod. So I included this finding in my talk’s slides without
really being able to attribute this delay to a specific reason.
After delivering my talk I felt a bit unsatisfied about this. Why did
from_array
perform so badly? So I decided to ask. The answer: Dask
hashes down the whole array in from_array
to generate a key for it,
which is the reason for it to be so slow. The solution is surprisingly
simple. By passing a name='identifier'
to the from_array
, one can
provide a custom key and from_array
is a suddenly a cheap operation.
So the current state of my benchmark shows that Dask improves upon pure
numpy or numexpr performance, however does not quite reach the
performance of a Cython implementation:
\{:
width=``100%''}
The expression evaluated in that benchmark was
x = da.from_array(x_np, chunks=arr.shape[0] / CPU_COUNT, name='x') mx = x.max() x = (x / mx).sum() * mx x.compute()
I plan to upload a revised edition of the slides on Speakerdeck (once I have a decenet internet connection again), to include the improved benchmark, so that they are not misleading for people who stumble on them without context.
Learnings
What can we conclude from this?
-
The conversion overhead of converting a dask array to a numpy array is not as bad as I feared.
-
There are two aspects in a benchnark: performance and usability.
-
Dask should be watched not only for out-of-core computations, but also for parallelizing simple, blocking numpy expressions.