Personal tools
You are here: Home Documentation SincLib Optimizations
Document Actions

Optimizations

by admin last modified 2007-04-30 16:33

This page discusses various optimizations we use in SincLib in order to achieve faster performance.

Matrix Handling:

Although sinc computations usually do not require many nodes, the matrices involved are dense.  For example, the Poisson problem in two dimensions is of the form (with x representing the Kronecker product):

[(Ix x Ay) + (Ax x Iy)] u = f

Define A = (IxxAy)+(AxxIy).

The matrices Ix and Ax have dimension (Mx+Nx+1)x(Mx+Nx+1); Iy and Ay have dimension (My+Ny+1)x(My+Ny+1).  While the matrices Ax and Ay are usually small (around 200x200 is more than sufficient for most problems), the resulting Kronecker product has a much larger dimension:

(Mx+Nx+1)(My+Ny+1) x (Mx+Nx+1)(My+Ny+1)

If the dimensions of Ax and Ay are both 200x200, they each require 40,000 doubles to store.  This requires about 616 KB of memory to store.  If stored in dense form, A requires 2004=1,600,000,000 doubles, or about 12,207 MB of memory.  This is more memory than most current computers have RAM; certainly, more than all available laptops!

We can do better, as A is a sparse matrix.  We can store all the nonzero entries in about 2*2003=16,000,000 doubles.  This amounts to about 15 MB of memory, much better.  15 MB can easily fit into system RAM, but the program will still be slow as the matrix is too large to fit into L2 caches.

In SincLib, we try a different tack - taking advantage of the matrix structure.  We have the following identity for matrix-vector multiplication:

A u = [(Ix x Ay) + (Ax x Iy)] u = mat(u)*AyT + Ax*mat(u)

Here, mat(u) is the (Mx+Nx+1)x(My+Ny+1) matrix created from the size (Mx+Nx+1)(My+Ny+1) vector u.


Optimizations of Systems of ODEs

This section has not yet been written.


Using the DelayedMath Library

This section has not yet been written.


High Performance Implementations

Consider the sinc quadrature formula:

quadrature-formula.png

Look up the implementation in the module sinc.base.Quadrature.  First, we evaluate f at the sinc nodes, then multiply it by the values of f2 at the respective sinc nodes; this produces a vector of size (M+N+1):
integral = self.integrand( self.nodes ) * self.f2( self.nodes )
Then, we find the sum with the function scipy.sum, and multiply by h:
self.soln = scipy.sum( integral ) * self.h

Discussion

We have broken a "rule" of numerical analysis: we keep the intermediate values in the array integral, instead of "summing as we go".  This is done for multiple reasons:
  • (M+N+1) is usually small (<400), resulting in an array of size 3.1 KB.  This value can stay in the cache of even a decade-old 486 CPU!  Hence, we don't have to worry about hitting main memory in this computation.
  • Python looping constructs (implemented in python byte code) are extremely slow compared to array function implementation (implemented in C); there is a O(100) difference in speed.
  • Array manipulation will allow us to parallelize the computation later with minimal effort due to the structure of the PETSc library.
  • Even for a function g implemented in C, the cost of calling g once for each element of an array is very high compared to functions which must be called once and can manipulate the whole array.  For example, calling sin(X) for the first 40 elements of an array might be a single processor instruction, while something like this:
    for( int i = 0; i < 40; i++ ) { Y[i] = sin(X[i]); }
    might translate into many function calls.
The rule of thumb for SincLib is:
  • Maximize the ratio of array operations compared to python operations.  Never, ever, loop through the array in python!  If a large speedup might be gained by manipulating the array directly, investigate in alternates like SciPy's weave module or implementations in C.
  • Wherever possible, use numerical library functions as opposed python functions.  If using sum, make sure you use scipy's sum, not Python's (there is a 10x difference).  Remember the writers of LAPACK - the underlying math library - have spent much more time working on optimizations than you have available, and they're experts in the subject.

High Performance Python

For a thorough discussion of high performance code in Python, refer to http://www.scipy.org/PerformancePython.  The writer of that tutorial considers several methods for solving Laplace's equation in Python.  The pure python implementation has a runtime of 1500 seconds; the implementation using Numeric arrays takes 29.3 seconds.  Embedding C into the python code has a runtime of 2.3 seconds.  Finally, writing a pure C++ version results in a runtime of 2.16 seconds. 

Although the difference between Numeric and embedded C is a factor of about 10, the Numeric implementation takes significantly less time to code.  If you are writing a large library (such as SincLib), it's often advantageous to spend your time coding more algorithms, profile your code, and re-implement the core portions in C.  We believe this provides the best balance of maximizing speed and minimizing coding time.

Powered by Plone, the Open Source Content Management System