Example: Parallel Speed Increase in an Image Transform

This topic shows how using all CPU cores (on by default) increases speed by a factor of four in a simple use of the Transform pane to modify an image.

 

Manifold automatically uses multiple CPU cores for parallel processing.   By default the system uses as many CPU cores as are installed in the system.    The Transform pane allows us to disallow parallel execution if desired.   Although as a practical matter that is not something we are likely to ever want, having the option in the Transform pane allows us to easily construct test cases to see if parallelism really does improve performance.  It does, as this example shows, often even in very simple tasks where the computations are so simple there hardly seems to be a point in parallelizing them.

 

Before running this example we open the Windows Task Manager and also open the Windows Resource Monitor so we can see useful graphs of how each of our CPU cores is being utilized.

 

The data set we will use is derived from US government terrain elevation data and bathymetry data (depths of oceans) for the entire world.

 

 

We see the entire data set above, colored with the Style pane using a palette that shows depths of oceans as well as terrain elevations above land.

 

To see how large the image is we can take a look at the Info pane.

 

 

The Info pane tells us the image is 43200 x 21600 pixels, at 16 bits per pixel almost two gigabytes.   That is not a particularly large image by Manifold standards but it is big enough for processor cores to do enough work, if we do a significant computation, so the time required can be measured without splitting hairs over fractions of a second.

 

With the focus on the image window, in the  Transform pane. we choose the Tile field and then we double-click the Slope template to launch it.

 

 

In the Slope template, we choose channel 0 for the Channel and slope as the Operation.  We enter 3 as the Radius.  We will not use GPU in this run, so we want to keep the computation small enough that anyone who wants to repeat this example will not be intimidated.   We use Degree as the Unit.

 

For the Result destination, we choose New Table and then enter Slope as the name of the new image with an analogous name for the new table.  

 

For Resources, we change the default setting of all CPU cores, all GPU cores to all CPU cores.    This will force exclusively parallel use of CPU, with no use of GPU.

 

Press Transform.  

 

 

The machine we are using has an AMD Ryzen 9 3900X 12-core CPU, which Windows can operate running 24 threads, as if there were 24 processors.   The display above shows the report approximately 40 seconds into the calculation, as the graph of core utilization for each core jumped from nearly zero to 100%, and then stayed at 100% for each core for the duration of the calculation, using 100% of the total power of the entire CPU, with all 24 hypercores running at full utilization.

 

What we see is that indeed, all 24 of our logical CPU cores have been loaded and are being used while Manifold computed the result.  The CPU is 100% loaded.     Computing slope is not a particularly difficult calculation, with relatively simple math, but it is a significant enough computation to be worth parallelizing.

 

 

We can open the Log Window to see what Manifold reports as the time required for the computation.  In this case, Manifold reports 177 seconds, about three minutes.   Not bad for a slope computation for the entire world at over 2 GB resolution.

 

Next, we repeat the calculation but this time with the Resources option set to one CPU core.  

 

 

This tells Manifold to not use parallel execution and instead run the calculation as essentially all other Windows applications do: using only one processor core.   

 

Press Transform.

 

 

The Resource Monitor CPU utilization displays tell the sad tale.  Without using parallel execution Manifold looks like any other GIS or DBMS product, with almost all of the cores not being at all.   The overall utilization of the CPU is only 5%, with 95% of the CPU being wasted.  

 

 

The resulting performance is much slower.   Instead of taking under 3 minutes the time required jumps to almost 46 minutes, nearly sixteen times longer.

 

Notes

Multiple trials for better measurement - To better understand the difference between parallel and non-parallel use of CPU cores we should take care to minimize the influence of other factors such as disk access speeds.   We can do that by repeating this computation several times in a row both for the parallel case and the non-parallel case.   For the above measurements we repeated the following cycle:

 

 

In each case the starting conditions were as close to being the same as is reasonable given the constant Brownian motion of many background processes in Windows.  By repeating the process we ensured a fair chance at a "warm" launch of both Manifold and the data on which Manifold is working, that is, a reasonably level playing field of having the data within Windows internal cache.  That minimizes the delay caused by fetching data from disk, which is really a measure of disk speed and not how well we are using more than one processor core.

 

If we do the above cycle five or six times with the Resources option set to all CPU cores and then again five or six times with the option set to one CPU core we will get a reasonable idea of the difference between parallel and non-parallel execution.   For this data set computing slope, on this particular machine,, when using all CPU cores, the job runs approximately sixteen times faster.    Considering that whether we use them or not we have paid for all those CPU cores, we should be pleased that at  zero extra cost by simply using parallel software like Manifold we can accomplish even a very simple calculation sixteen times faster.

 

Why not 24 times faster? - With 24 logical processor cores it is reasonable to ask why the job does not go 24 times faster instead of merely 16 times faster.  The main reason is that some overhead is involved in fetching and storing data, parallelizing a job, dispatching multiple parts to multiple processors, receiving results and then assembling the result.   The smaller the job the greater the proportion overhead takes compared to the actual time required to compute the job.  For very small jobs it is quicker to execute the job in a single processor than to spend the overhead required to dispatch it to multiple processors.  For large jobs involving complex computations the overhead involved is a lower proportion so the time saved will be more proportionate to the increased number of processors.   Disk and memory access is part of the overhead so simple computations that involve much data will gain less from parallelization than complex jobs.   A secondary reason is that 24 hyper-threaded "cores" are not quite the equivalent of 24 real processor cores.   Even with no overhead and perfect parallelization efficiency the speed gain that can be achieved with 24 hyper-threaded cores will not be 24 times the speed of a single physical core.

 

See Also

Selection

 

Images

 

Style: Images

 

Style: Palettes

 

Selection

 

Transform Pane

 

Transform - Tiles

 

Examples

 

Example: Transform Elevation Image to Flatten Bathymetry to Zero - Using the Transform pane with an image, which contains a single data channel for terrain elevation data for land together bathymetry data for oceans, we use the Expression tab of the Transform pane to reset all pixel values less than zero to zero.   This takes all below-zero elevations and sets them to zero, in effect removing bathymetry effects so that ocean areas are represented with zero elevation.  

 

Intermediate Levels