java -jar trimmomatic-0.30.jar PE s_1_1_sequence.txt.gz s_1_2_sequence.txt.gz
lane1_forward_paired.fq.gz lane1_forward_unpaired.fq.gz lane1_reverse_paired.fq.gz
lane1_reverse_unpaired.fq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3
This will perform the following:
- Remove adapters (ILLUMINACLIP:TruSeq3-PE.fa:2:30:10)
- Remove leading low quality or N bases (below quality 3) (LEADING:3)
- Remove trailing low quality or N bases (below quality 3) (TRAILING:3)
- Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)
- Drop reads below the 36 bases long (MINLEN:36)
用途:trim, crop, remove adapters
支持gz, bz2的fastq
ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
- ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>
- fastaWithAdaptersEtc: specifies the path to a fasta file containing all the adapters, PCR sequences etc. The naming of the various sequences within this file determines how they are used. See below.
- seedMismatches: specifies the maximum mismatch count which will still allow a full match to be performed
- palindromeClipThreshold: specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment.
- simpleClipThreshold: specifies how accurate the match between any adapter etc. sequence must be against a read.
SLIDINGWINDOW: Performs a sliding window trimming approach. It starts scanning at the 5‟ end and clips the read once the average quality within the window falls below a threshold.
MAXINFO: An adaptive quality trimmer which balances read length and error rate to maximise the value of each read
LEADING: Cut bases off the start of a read, if below a threshold quality
TRAILING: Cut bases off the end of a read, if below a threshold quality
CROP: Cut the read to a specified length by removing bases from the end
HEADCROP: Cut the specified number of bases from the start of the read
MINLEN: Drop the read if it is below a specified length
AVGQUAL: Drop the read if the average quality is below the specified level
TOPHRED33: Convert quality scores to Phred-33
TOPHRED64: Convert quality scores to Phred-64
Trimming occur in the order in which the steps are specified on the command line. It is recommended in most cases that adapter clipping, if required, is done as early as possible, since correctly identifying adapters using partial matches is more difficult.
The Adapter Fasta
Illumina adapter and other technical sequences are copyrighted by Illumina,but we have been granted permission to distribute them with Trimmomatic. Suggested adapter sequences are provided for TruSeq2 (as used in GAII machines) and TruSeq3 (as used by HiSeq and MiSeq machines), for both single-end and paired-end mode. These sequences have not been extensively tested, and depending on specific issues which may occur in library preparation, other sequences may work better for a given dataset.
To make a custom version of fasta, you must first understand how it will be used. Trimmomatic uses two strategies for adapter trimming: Palindrome and Simple
With 'simple' trimming, each adapter sequence is tested against the reads, and if a sufficiently accurate match is detected, the read is clipped appropriately.
'Palindrome' trimming is specifically designed for the case of 'reading through' a short fragment into the adapter sequence on the other end. In this approach, the appropriate adapter sequences are 'in silico ligated' onto the start of the reads, and the combined adapter+read sequences, forward and reverse are aligned. If they align in a manner which indicates 'read-through', the forward read is clipped and the reverse read dropped (since it contains no new data).
Naming of the sequences indicates how they should be used. For 'Palindrome' clipping, the sequence names should both start with 'Prefix', and end in '/1' for the forward adapter and '/2' for the reverse adapter. All other sequences are checked using 'simple' mode. Sequences with names ending in '/1' or '/2' will be checked only against the forward or reverse read. Sequences not ending in '/1' or '/2' will be checked against both the forward and reverse read. If you want to check for the reverse-complement of a specific sequence, you need to specifically include the reverse-complemented form of the sequence as well, with another name.
The thresholds used are a simplified log-likelihood approach. Each matching base adds just over 0.6, while each mismatch reduces the alignment score by Q/10. Therefore, a perfect match of a 12 base sequence will score just over 7, while 25 bases are needed to score 15. As such we recommend values between 7 - 15 for this parameter. For palindromic matches, a longer alignment is possible - therefore this threshold can be higher, in the range of 30. The 'seed mismatch' parameter is used to make alignments more efficient, specifying the maximum base mismatch count in the 'seed' (16 bases). Typical values here are 1 or 2.
For input files,
either of the following can be used:
Explicitly naming the 2 input files Naming the forward file using the -basein flag, where the reverse file can be determined automatically.
The second file is determined by looking for common patterns of file naming, and changing the appropriate character to reference the reverse file.
Examples which should be correctly handled include:
Sample_Name_R1_001.fq.gh -> Sample_Name_R2_001.fq.gz
Sample_Name.f.fastq -> Sample_Name.r.fastq
Sample_Name.1.sequence.txt -> Sample_Name.2.sequence.txt
For output files, either of the following can be used:
Explicity naming the 4 output files Providing a base file name using the –baseout flag,
from which the 4 output files can be derived. If the name “mySampleFiltered.fq.gz” is provided, the following 4 file names will be used:
mySampleFiltered_1P.fq.gz - for paired forward reads
mySampleFiltered_1U.fq.gz - for unpaired forward reads
mySampleFiltered_2P.fq.gz - for paired reverse reads
mySampleFiltered_2U.fq.gz - for unpaired reverse reads
Most processing steps take one or more settings, delimited by ':' (a colon)
Perform a sliding window trimming, cutting once the average quality within the window falls
below a threshold. By considering multiple bases, a single poor quality base will not cause the
removal of high quality data later in the read.
windowSize: specifies the number of bases to average across
requiredQuality: specifies the average quality required.
The SLIDINGWINDOW trimmer will cut the leftmost position in the window where the average quality drops below the threshold and remove the rest of the read. However if there is low quality in the very beginning of the read then it will fail the minimum length tests and be removed completely - the remaining 3-prime end (even if it is good quality will not be printed)
Consider the following test file t.fq
and processing with:
$ java -jar trimmomatic.jar SE -phred64 t.fq tt.fq SLIDINGWINDOW:4:15
TrimmomaticSE: Started with arguments: -phred64 t.fq tt.fq SLIDINGWINDOW:4:15 Automatically using 16 threads Input Reads: 2 Surviving: 1 (50.00%) Dropped: 1 (50.00%) TrimmomaticSE: Completed successfully
The output file looks like the following:
As you can see the read with the poor quality at the beginning has been removed completely
Remove low quality bases from the beginning. As long as a base has a value below this
threshold the base is removed and the next base will be investigated.
quality: Specifies the minimum quality required to keep a base.
Remove low quality bases from the end. As long as a base has a value below this threshold
the base is removed and the next base (which as trimmomatic is starting from the 3‟ prime end
would be base preceding the just removed base) will be investigated. This approach can be
used removing the special illumina „low quality segment‟ regions (which are marked with
quality score of 2), but we recommend Sliding Window or MaxInfo instead
quality: Specifies the minimum quality required to keep a base.
Removes bases regardless of quality from the end of the read, so that the read has maximally
the specified length after this step has been performed. Steps performed after CROP might of
course further shorten the read.
length: The number of bases to keep, from the start of the read.
Removes the specified number of bases, regardless of quality, from the beginning of the read.
length: The number of bases to remove from the start of the read.
Paired End:
先把开头N去掉,然后前40个碱基,缺点是有些第六个碱基是N的,没法去掉! 估计要写脚本,但是脚本太慢了。。。改天问问老师有没有什么好工具!