Lately, I’ve become very fond of the Python language for data analysis tasks. For small, off the cuff problems, I usually cobble something together on the command line, using a combination of various shell commands and command-line Perl. But when the problem becomes difficult enough t o warrant writing a script for it, I usually switch to Python. One reason is that, when compared to Perl, Python code is much more aesthetically pleasing – to my eye, at least. Python is fun to write, without bouncing off the letters $@%{} all the time. Notice how those letters by themselves alreadly look like a swearword?

The other reason is the number of modules making data analysis a breeze. A good overview is on Konrad Hinsen’s page on scientific computing with Python. The modules I use most frequently are

- Numeric, a module for vectors, matrices and linear algebra. It supports everything from basic matrix operations like multiplication or inverse to eigenvector decomposition and singular value decomposition. Numeric is also the base for a couple of other modules. A very convenient feature of Numeric are “ufuncs", or universal functions: Most binary operations (like addition, multiplication, etc.) and functions (like sine, cosine, etc) are defined for any combination of scalars, vectors and matrices. If one operand is a scalar, it the operation will be executed, using the scalar value, on every member of the matrix or vector. Likewise, functions are executed for every member of a vector or a matrix. Ufuncs also have special methods “reduce” and “accumulate” that accumulate the results of the function for every member of a matrix or vector.
- Numarray is similar to Numeric (and maintained on the same sourceforge project), but faster for large arrays. It is the designated successor and supposed to be mostly compatible with Numeric, but lacks support for universal functions. A number of older packages still depend on Numeric, which is why both are still being maintained.
- Scientific Python is a collection of modules for a wide range of applications, including vectors and tensors, quaterions, automatic derivation, linear and non-linear regression, and statistics. It also contains interfaces to the MPI library (message-passing interface, for parallel computation) and other libraries, as well as visualization functions.
- VPython is a simple 3D visualization module; I came across it when I wanted to prepare some animations for a talk. Scientific Python contains support for VPython for visualizations.

In every case, there is probably another application more powerful or more suitable to the task; for example, one could probably use either Mathematica, Matlab or R for any of these tasks. There are also numerous C++ and Java libraries.

But there are some important differences:

- Python is free; this also applies to R and most libraries.
- Python has a reasonably shallow learning curve
- Python allows me to pull modules for different tasks into a single application, something not possible with specialized tools. For example, I could do analysis on data stored in an SQL database as easily as I can do it for a text file.
- Python has an interpreter (unlike C++ or Java), allowing me to play with my code and my data, and try out different approaches.

In summary, it is usually really fast and easy to whip up a simple Python application to solve my problems.

The following is an example of how to implement linear regression using Numeric, written before I discovered Scientific Python. Assuming that variables x and y are matrixes with 1 column and n rows:

mx = add.reduce(x) / x.shape[0]
my = add.reduce(y) / y.shape[0]
ssxy = add.reduce((x -mx) * (y -my))
ssxx = add.reduce((x -mx) * (x -mx))
b = ssxy / ssxx
a = my - b * mx

Another thing I needed recently was nonlinear regression – fitting a non-linear function to data points. The Python module Scientific.Functions.LeastSquares will do that, and it will even compute the first derivation of the function by itself (which is necessary for determining the Jacobi matrix used by the Levenberg Marquardt algorithm. Or so I am told.) The resulting program was as small as this:

First, one defines the model that is to be fitted on the data, as a regular Python function. This function takes the parameters to be fitted as a tuple in the first argument, and the x-value of a data point as the second argument. In my case, the data points were supposed to follow a power-law distribution:

def zipf(params, point):
k = params[0]
a = params[1]
return a * pow(point, -k)

The data is assumed to be a list of 2-tuples, ie. tuples (<x -value>, <y -value>). This list is fed to the regression function, along with the model and starting values for the parameters:

(res, chisq) = leastSquaresFit(model, (1, 100), data)

The variable “res” contains the parameter combination providing the best fit, while “chisq” is the sum of the squared errors, a measure for the quality of the fit. More complicated models (with more than two parameters) are also possible. In my case, I decided to stick with a simple power-law distribution – a more complicated model provided a better fit, but with parameters that were way off the mark. The problem of overfitting …

I also implemented simple interval arithmetic in Python, in a fashion roughly similar to this module by Kragen Sitaker, but with support for most operators and mixed arithmetic (where one argument is an interval and one is a scalar.) Because Python has an interpreter, I can simply use it as an calculator for intervals.