Example: Parallel Speed Increase in an Image Transform

This topic shows how checking the Allow parallel execution option (on by default) increases speed by a factor of four in a simple use of the Transform dialog to modify an image.


Manifold automatically uses multiple CPU cores for parallel processing.   By default the system uses as many CPU cores as are installed in the system.    The Transform dialog allows us to disallow parallel execution if desired.   Although as a practical matter that is not something we are likely to ever want, having the option in the Transform dialog allows us to easily construct test cases to see if parallelism really does improve performance.  It does, as this example shows, often even in very simple tasks where the computations are so simple there hardly seems to be a point in parallelizing them.


Before running this example we open the Windows Task Manager and also open the Windows Resource Monitor so we can see cool graphs of how each of our CPU cores is being utilized.


The data set we will use is derived from US government terrain elevation data and bathymetry data (depths of oceans) for the entire world.


We see the entire data set above, colored with the Style dialog using a palette that shows depths of oceans as well as terrain elevations above land.


To see how large the image is we can right-click on the image in the Project pane and choose Properties.  



The Properties dialog tells us the image is 43200 x 21600 pixels, at 16 bits per pixel almost two gigabytes.   That is not a particularly large image by Manifold standards but it is big enough for processor cores to do enough work so the time required can be easily measured without splitting hairs over fractions of a second.


We choose Edit - Transform to launch the Transform dialog.




We will use the TileMax Manifold SQL function in the same expression used within the Example: Transform Elevation Image to Flatten Bathymetry to Zero topic to "flatten" the ocean depths into a single value of 0 for all pixels with a negative value, that is, indicating a depth below zero sea level.   That is a very simple expression but it will serve to illustrate the point that even very simple calculations can benefit from parallel execution.


We do not change the default setting of the Allow parallel execution option box:  We want to use parallel processing.   We press the Update Field button to execute the expression and to apply the changes to the data.


The result is to replace ocean depths with a flat ocean at sea level while leaving terrain values above zero, that is, terrain elevation values over land, unchanged.   It takes a few seconds for Manifold to accomplish the computation, enough time to glance at the CPU utilization displays in Resource Monitor.



What we see is that indeed, all eight of our CPU cores have been loaded and used while Manifold computed the result.    The system we have used for this example is running Windows 10 with an Intel Core i7 processor with four physical cores, but which with hyper-threading turned on appears to be eight cores.   Both Windows and Manifold can use all eight hyper-threaded cores.   Although Resource Monitor provides at best an overall view of CPU loading, it does work well enough for us to see that all eight cores are remarkably evenly loaded.  Manifold really has done a good job of chopping up the task into eight reasonably even parts to dispatch in parallel to eight different processors.   No GPU is installed so the performance seen is a result of CPU parallelization only.




We can open the Log Window to see what Manifold reports as the time required for the computation.  In this case, Manifold reports 10.62 seconds - not bad for an old and slow Core i7.


Next, we repeat the calculation but this time with the Allow parallel execution option box unchecked.   This tells Manifold to not use parallel execution and instead run the calculation as essentially all other Windows applications do: using one processor core at a time.



The Resource Monitor CPU utilization displays tell the sad tale.  Without using parallel execution Manifold looks like any other GIS or DBMS product. with most of the cores not being used most of the time.   




The Log Window also reports the slower performance, with exactly the same calculation requiring almost four times longer, almost 40.6 seconds.


Using GPU Cores

What everyone wants to know, of course, is how much faster we can do computation using massively parallel computations on GPUs, with potentially thousands of cores working at the same time.     The examples above were computed on a machine that had no GPGPU-capable hardware installed, so all computations were done on the Intel Core i7 processor running eight hyper-threaded processor cores.  


We can repeat the computation on a machine with GPGPU-capable hardware.  The result is interesting, but not much different given that the computation is so trivial it is barely worth parallelizing onto CPU cores, let alone being worth the effort to dispatch into hundreds or thousands of GPU cores.



The most interesting result, as seen in the Windows 10 Resource Monitor, is that all eight hyper-threaded processor cores are used to 100% maximum processing capacity.


The time required for the job drops down to about 8.5 seconds, about 20% faster with GPGPU than without.    At this point we are really measuring the internal infrastructure performance of Windows moving around 1.8 GB of data as it is dispatched to multiple processors and multiple GPU cores.  The GPU computations take effectively zero time given the simplicity of the calculation.   




The more interesting result is the 100% loading of the CPU cores.   That happens because Manifold's internal parallelization dispatcher is very effective at utilizing multiple CPU cores to dispatch tasks to multiple GPUs with many GPU cores.   The machine used for this example had two GPU cards installed with approximately two hundred GPU cores each so the eight CPU cores could be used effectively and kept busy dispatching parallelized tasks to the GPUs.   The 100% loading of the CPUs most likely is a result of the intensity of processor activity involved in dispatching and then receiving many parallel tasks to and from GPU.


Given that GPGPU capability with modern systems is effectively free of additional cost it is good to see that even with very simple computations Manifold will make the extra effort to use every bit of our machines to speed up as much as possible whatever job we want to do.   When computations are more complex the power of parallel processing can do even more for us than seen in this very simple example.



Multiple trials for better measurement - To better understand the difference between parallel and non-parallel use of CPU cores we should take care to minimize the influence of other factors such as disk access speeds.   We can do that by repeating this computation several times in a row both for the parallel case and the non-parallel case.   For the above measurements we repeated the following cycle:



In each case the starting conditions were as close to being the same as is reasonable given the constant Brownian motion of many background processes in Windows.  By repeating the process we ensured a fair chance at a "warm" launch of both Manifold and the data on which Manifold is working, that is, a reasonably level playing field of having the data within Windows internal cache.  That minimizes the delay caused by fetching data from disk, which is really a measure of disk speed and not how well we are using more than one processor core.


If we do the above cycle five or six times with the Allow parallel execution option box checked and then again five or six times with the option box unchecked we will get a reasonable idea of the difference between parallel and non-parallel execution.   For this particular expression and this data set, if we allow parallel execution, that is, using all CPU cores, the job runs approximately four times faster.    Considering that whether we use them or not we have paid for all those CPU cores we should be pleased that at  zero extra cost by simply using parallel software like Manifold we can accomplish even a very simple calculation four times faster.


Why not eight times faster? - With effectively eight processor cores it is reasonable to ask why the job does not go eight times faster instead of merely four times faster.  The main reason is that some overhead is involved in parallelizing a job, dispatching multiple parts to multiple processors, receiving results and then assembling the result.   The smaller the job the greater the proportion overhead takes compared to the actual time required to compute the job.  For small jobs it is quicker to execute the job in a single processor than to spend the overhead required to dispatch it to multiple processors.  For large jobs involving complex computations the overhead involved is a lower proportion so the time saved will be more proportionate to the increased number of processors.   Disk and memory access is part of the overhead so simple computations that involve much data will gain less from parallelization than complex jobs.   A secondary reason is that eight hyper-threaded "cores" are not quite the equivalent of four real processor cores.   Even with no overhead and perfect parallelization efficiency the speed gain that can be achieved with eight hyper-threaded cores will not be eight times the speed of a single physical core.


See Also





Style: Presenting Images


Style: Palettes


Transform Dialog


Transform Templates - Images




Example: Transform Elevation Image to Flatten Bathymetry to Zero - Using the Transform dialog with an image, which contains a single data channel for terrain elevation data for land together bathymetry data for oceans, we use the Expression tab of the Transform dialog to reset all pixel values less than zero to zero.   This takes all below-zero elevations and sets them to zero, in effect removing bathymetry effects so that ocean areas are represented with zero elevation.  


Example: Zoom In to See Transform Previews for Big Images - A short example showing how previews for the Transform Dialog will appear in large images only when zoomed in far enough so computation of the preview does not cause objectionable delays.


Example: Use a Transform Dialog Expression to Create Buffers in a Drawing


Intermediate Levels