http://sourceforge.net/projects/het-smooth/
equencing technologies, such as Illumina sequencing, provide the sequences of
short "reads" of DNA that come from random positions on the genome. These reads
then must be assembled de-novo into the original genome, or, if there is a
reference genome available mapped onto the reference genome. These tasks, especially de-novo assembly, become more difficult if the genome is heterozygous.
het-smooth is a experimental program to smooth out heterozygosity in DNA sequence reads by identifying isolated SNPs and changing each one to only one of the heterozygous variants. It is intended for use in sequence data from diploid genomes. It accepts input data in FASTA or FASTQ format, and for each input file, it writes an output FASTA or FASTQ file containing the reads with a reduced rate of heterozygosity. See the README for more details.
http://sourceforge.net/p/het-smooth/code/ci/master/tree/
安装:
git clone git://git.code.sf.net/p/het-smooth/code
编辑Makefile
找到自己jellyfish的安装包,如果找不到就再下载一个。
改过以后,sudo make.
Read Me
INTRODUCTION het-smooth is a program to smooth out heterozygosity in DNA sequence reads. Sequencing technologies, such as Illumina sequencing, provide the sequences of short "reads" of DNA that come from random positions on the genome. These reads then must be assembled de-novo into the original genome, or, if there is a reference genome available mapped onto the reference genome. These tasks, especially de-novo assembly, become more difficult if the genome is heterozygous. All diploid organisms have some rate of heterozygosity (differences between each chromosome set), but some organisms in particular have a very high rate of heterozygosity. `het-smooth' is an experimental program to reduce the heterozygosity rate of reads. It accepts input data in FASTA or FASTQ format. INSTALLATION `het-smooth' must be compiled and run on a modern UNIX system, such as GNU/Linux. It currently must be compiled with g++ because it uses built-in atomic functions. `het-smooth' also requires Jellyfish (http://www.cbcb.umd.edu/software/jellyfish/) to be installed because it relies on it for k-mer counting. `het-smooth' should be able to spawn the `jellyfish' binary as well as link to `libjellyfish.so'. Edit the Makefile to make sure that JELLYFISH_CXXFLAGS and JELLYFISH_LDFLAGS correctly specify the location of the Jellyfish library and headers. After installing Jellyfish and editing the Makefile, run `make' to build the binary `het-smooth'. There is no `make install' yet; just copy the binary somewhere if you want to. USAGE Usage: het-smooth (FASTA_FILE | FASTQ_FILE)... Reduce the level of heterozygosity in the read set contained in the specified FASTA or FASTQ files, which may be gzipped. -k, --kmer-len=LEN use a kmer size of LEN; must be odd (default: 19) -B, --bottom-threshold=NUM use NUM as the lower bound on the kmer coverage for heterozygosity -T, --top-threshold=NUM use NUM as the upper bound on the kmer coverage for heterozygosity -C, --correct-erroneous-kmers do not place a lower bound on the coverage of k-mers for them to be corrected. This may result in the correction of some of the erroneous k-mers. However, this may not work as well as dedicated error-correction programs such as Quake. -j, --jellyfish-hash-file=FILE use existing kmer counts hash table for the reads produced by the Jellyfish program in FILE -s, --jellyfish-hash-size=SIZE use SIZE as the Jellyfish hash size parameter (default: 100000000) -t, --threads=NUM use NUM threads for Jellyfish kmer counting and for processing reads files in parallel (default: number of processors) -Q, --quality-weight weight kmer counts by quality -q, --quality-start=NUM assume that the lowest quality value starts at NUM (default: 64) (only used if -Q given) -l, --kmer-replacements-log=FILE log all kmer replacement pairs to FILE (default: don't log) --no-multibase-replacements do not do multibase replacements --no-reads-log do not log read changes -h, --help display this help and exit The essential options to set are --kmer-len, --bottom-threshold, and --top-threshold. Furthermore, it is recommended to give --no-multibase-replacements until the usefulness of the multibase replacements algorithm is verified. We have been using --kmer-len=23 for the pineapple genome, which is a ~400 megabase, very repetitive genome. Smaller, less repetitive genomes could do with --kmer-len=19, or perhaps 17 or 21. If you run `jellyfish count' on your data to count k-mers of this k-mer length and then run `jellyfish histo', you will see how many k-mers have a given coverage level. If you plot this histogram, for a heterozygous genome it will be bimodal; the lower peak is for k-mers that appear in the diploid genome one time, while the upper peak is for k-mers that appear in the diploid genome two times (probably once in each chromosome set). --bottom-threshold should be set to approximately the first minimum in the k-mer coverage plot near the beginning of the lower peak, while --top-threshold should be set to a little past the second minimum in the k-mer coverage plot, a little past the end of the lower peak. The goal is for these thesholds to include the coverage levels of k-mers that appear only once in the diploid genome, and are therefore in heterozygous regions. An example of using the program would be: het-smooth --kmer-len=23 --bottom-threshold=38 --top-threshold=220 \ --no-multibase-replacements --jellyfish-hash-file=23-mers.jf \ reads_1.fq reads_2.fq In this example, the program will produce new files (reads_1.het-smoothed.fq and reads_2.het-smoothed.fq) that contain the heterozygosity-smoothed reads. There also will be files reads_1.het-smoothed.log and reads_2.het-smoothed.log that list the original and new reads for each modified read, although this can be disabled by the --no-reads-log option; furthermore, it may be more useful to log the actual k-mer replacements that were computed by providing the {\tt --kmer-replacements-log option (this probably should be made the default). The name of the Jellyfish hash table containing 23-mer counts is given as 23-mers.jf, but if you do not specify this option, het-smooth will spawn a jellyfish process that will produce the hash table from the FASTQ files specified for heteterozygosity smoothing. het-smooth will strip the directory name from the input files when it determines the output files, so the output files will be written in the current directory. The --quality-weight option currently does not work. ALGORITHM We designed and implemented an algorithm to preprocess the libraries to remove some of the heterozygosity. Although we have used this algorithm to improve the pineapple genome assembly, it is primarily a proof-of-concept algorithm; it eventually would be better to integrate the algorithm (or a modified version of it) into Allpaths-LG, so that it could assemble highly heterozygous genomes by default. Our algorithm analyzes the reads at the level of k-mers. The basic idea is that if we use an odd value of k, we can go through each k-mer that appears in the genome and see if there exists a lexicographically smaller k-mer in the genome that differs only by the center base. If so, we can replace the former k-mer with the latter k-mer every time it appears in the reads. Reverse complements can be handled by only considering canonical k-mers when looking for replacements, but then replacing both the foward and reverse complement versions of the k-mer with the correctly oriented replacement when actually doing the replacements. The effect of this algorithm is to remove SNPs from the data. Ideally, each SNP would be changed to only one of the two variants, thereby removing the SNP from the data given to the assembler. There are many problems with this heterozygosity smoothing algorithm, some of which we have addressed in our implementation, and some of which we have not yet been able to address. The algorithm requires knowing which k-mers appear in the genome and which do not. We estimate this from the reads by examining k-mer coverage. We require the "from" and "to" k-mers to have coverage high enough to make it unlikely for either k-mer to contain a sequencing error. Furthermore, to reduce the chances of incorrect replacements, which would produce misassembled sequence, from being made, we only add a replacement if there is exactly one valid replacement given the possible substitutions of the middle base. We also establish an upper bound on the coverage for the "from" k-mer, since repeated sequences can cause slightly differing sequences to show up on the same chromosome set, and these could easily be mistaken for SNPs. If we were to restrict k-mer replacements only to the middle base in each k-mer, this would prevent any changes from being made to the first k/2 - 1 or the last k/2 - 1 k-mers of each read. This is a big problem because a SNP, even in the best case, would not get smoothed out in reads for which the SNP site happens to be near one of the ends of the read. To solve this problem, we add an additional step to the algorithm; after adding the replacements as already described, we go through the reads and look for the locations where the middle-base replacements would be made. For each such location, we examine the surrounding k/2-1 k-mers and add replacements containing the SNP site as one of the non-center base pairs, provided that the coverage falls within the bounds mentioned earlier. Later, when actually doing the replacements, these extra replacements are only actually done if the k-mer appears as the first or last k-mer in the read. LIMITATIONS The algorithm currently only smooths out SNPs that are separated from other SNPs by at least k/2 - 1 base pairs. If there is a cluster of SNPs, they will not be smoothed out. The algorithm does not handle insertions, deletions, or inversions. These are believed to be much rarer than SNPs, but this may still be a huge limitation. Heterozygous sites that occur very close to sequencing errors are unlikely to be smoothed out. There is no built-in capability to this program to recover the heterozygosity after the assembly; other tools would have to be used for this (you would want to align the original reads to the draft genome and call the SNPs that were removed). CORRECTNESS OF HETEROZYGOSITY SMOOTHING By chance, it is possible for this program to make changes to the reads that would not be strongly supported when considering the larger context. It is important to using a large enough k-mer size to reduce the chance of incorrect replacements from being made. Providing the --no-multibase-replacements argument is recommended because on large datasets it is much too likely for two k-mers to differ by only two base pairs purely by chance. Providing the --correct-erroneous-kmers option is NOT recommended. PERFORMANCE The program is multi-threaded, and the running time of the program is roughly linear with the number of reads that must be processed. Computing the middle-base replacements is fast compared to computing the edge-base replacements and then doing the k-mer replacements. We have used the program to reduce the heterozygosity in 235 GB of reads from the pineapple genome, and the running time was several hours on a server-class computer. TODO The algorithm needs to be improved. Make the program automatically deteremine the lower and upper thresholds for heterozygosity. It probably would make more sense for stand-alone assemblers to be able to assemble highly heterozygous data, rather than have a separate program to smooth out heterozygosity. LICENCE This code is licensed under the GNU General Public License version 3 or later. There is no warranty whatsoever. See COPYING for details. SPONSORS The development of this algorithm and program was done as part of Cold Spring Harbor Laboratory's Undergraduate Research Program. Funding for this program is provided in part by the National Science Foundation. QUESTIONS / BUGS / SUGGESTIONS / PATCHES Eric Biggers ebiggers3 at gmail.com Michael Schatz mschatz at cshl.edu
freemao
FAFU