Subject: clustering hardware requirements


Here's some numbers using http://aws.amazon.com/datasets/7791434387204566 running locally:

Raw content size:
9.2 GB, 48K "items" -- note, most of the files are GZipped

It took 15 minutes to convert all of these to sequence files on an i7 single CPU w/ 4 cores and hyper-threading. 3.4 GHz machine with 16 GB of RAM

After converting to sequence files:
40 GB, 659 items.

Encoded Vectors (see build-asf-email.sh): cardinality = 5000: 11 GBs for 1,300 items.  This took 83 minutes to convert

Splitting into test and train took 9 minutes for SGD.  I had to kill the SGD job due to some issues I'm having on my machine w/ CPU temperature (SGD really cranks on the CPU and something is messed up on my machine) that I need to track down.

For clustering,  about the same time for  converting to sequence files

The job to convert to vectors took a while (it scrolled out of my window).  The resulting tfidf-vecs were 7.8 gb.
Dictionary:
 82865442 2011-11-21 17:46 dictionary.file-0*
83269191 2011-11-21 17:46 dictionary.file-1*
10963133 2011-11-21 17:46 dictionary.file-2*

Freq files:

 37160153 2011-11-21 22:35 frequency.file-0*
 37160173 2011-11-21 22:35 frequency.file-1*
 37160173 2011-11-21 22:35 frequency.file-2*
 31407713 2011-11-21 22:35 frequency.file-3*

Total dir size for seq2sparse:  du -s seq2sparse/
30923564 seq2sparse/

More as they become available.

HTH,
Grant

On Nov 21, 2011, at 3:57 AM, Ioan Eugen Stan wrote:

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com