======================================= Blog posts and additional documentation ======================================= Hashtable and filtering ======================= The basic inexact-matching approach used by the hashtable code is described in this blog post: http://ivory.idyll.org/blog/jul-10/kmer-filtering A test data set (soil metagenomics, 88m reads, 10gb) is here: http://angus.ged.msu.edu.s3.amazonaws.com/88m-reads.fa.gz Illumina read abundance profiles ================================ khmer can be used to look at systematic variations in k-mer statistics across Illumina reads; see, for example, this blog post: http://ivory.idyll.org/blog/jul-10/illumina-read-phenomenology The `fasta-to-abundance-hist `__ and `abundance-hist-by-position `__ scripts can be used to generate the k-mer abundance profile data, after loading all the k-mer counts into a .kh file:: # first, load all the k-mer counts: load-into-counting.py -k 20 -x 1e7 25k.kh data/25k.fq.gz # then, build the '.freq' file that contains all of the counts by position python sandbox/fasta-to-abundance-hist.py 25k.kh data/25k.fq.gz # sum across positions. python sandbox/abundance-hist-by-position.py data/25k.fq.gz.freq > out.dist The hashtable method 'dump_kmers_by_abundance' can be used to dump high abundance k-mers, but we don't have a script handy to do that yet. You can assess high/low abundance k-mer distributions with the `hi-lo-abundance-by-position script `__:: load-into-counting.py -k 20 25k.kh data/25k.fq.gz python sandbox/hi-lo-abundance-by-position.py 25k.kh data/25k.fq.gz This will produce two output files, .pos.abund=1 and .pos.abund=255.