Scalable PCA via Hashing
This software provides a highly scalable truncated SVD/PCA implementation using a combination of hashing and randomized linear algebra techniques.
The memory usage is independent of the number of examples so you will never run out of memory.
- veedubparse is required to build. If you git clone that project into ../veedubparse than it will find it, otherwise tweak the Makefile.
- libeigen3 is required to build. We use version 3.1.2.
- We use C++11 features extensively. gcc 4.6.3 works with
-std=c++0x
. - octave is required to run the tests.
-
The basic workflow looks like:
-
Build a model. Using PCA terminology, the (hashed image of) the loadings end up in the model file. Using SVD terminology, the (hashed image of) the right singular components end up in the model file.
pca -c -k 100 -m testmodel inputdata
-
Use a model. Using PCA terminology, the whitened component scores are output. Using SVD terminology, the left singular components are output.
pca -t -m testmodel inputdata
-
-
The software expects input in vowpal wabbit format.
- The label is not used but is required, so basically start all your lines with whitespace.
- The importance weight is used but if omitted defaults to 1.
- The tag if present will be passed through to the output in projection mode. (It is ignored when constructing a model).
-
The software will do 2 streaming passes over the data to build a model. Here are two techniques you can use to avoid intermediate materialization of the input data:
-
You can supply two arguments when building a model and the software will open the first one for the first pass and the second one for the second pass. This is useful when composed with zsh process substitution. For example, with compressed data:
pca -k 100 -m testmodel <(zcat compressed.gz) <(zcat compressed.gz)
-
If you supply one argument when building a model the software will open that argument twice and access the data from each opened version sequentially without seeking. This is useful when composed with named pipes.
-
-
By default the software does the equivalent of the --hash strings option of vowpal wabbit, i.e., the hash of something that parses as an integer is that integer (you can get the equivalent of --hash all using the command line option
-a
). By placing data in namespace 0 and using integer feature values the hashing essentially becomes the identity function (mod the number of hash buckets). For example, a line like> 6.9 mytag|0 1 2:4 28:0.5
will have features 1, 2, and 28 with values 1, 4, and 0.5 respectively. It also has an importance weight of 6.9, no label, and a tag of mytag
.
4. The demos/
directory contains working examples.