Total Number of Reads in Bam File
bamCoverage¶
- Required arguments
- Output
- Optional arguments
- Read coverage normalization options
- Read processing options
- Usage hints
- Usage example for Chip-seq
- Usage examples for RNA-seq
- Regular bigWig track
- Split up tracks for each strand
- Versions earlier 2.ii
If yous are not familiar with BAM, bedGraph and bigWig formats, you tin can read up on that in our Glossary of NGS terms
This tool takes an alignment of reads or fragments as input (BAM file) and generates a coverage track (bigWig or bedGraph) as output. The coverage is calculated as the number of reads per bin, where bins are curt consecutive counting windows of a defined size. Information technology is possible to extended the length of the reads to amend reverberate the actual fragment length. bamCoverage offers normalization by scaling factor, Reads Per Kilobase per Million mapped reads (RPKM), counts per one thousand thousand (CPM), bins per 1000000 mapped reads (BPM) and 1x depth (reads per genome coverage, RPGC).
usage: An example usage is:$ bamCoverage -b reads.bam -o coverage.bw
Required arguments¶
--bam, -b | BAM file to process |
Output¶
--outFileName, -o | |
Output file proper name. | |
--outFileFormat, -of | |
Possible choices: bigwig, bedgraph Output file blazon. Either "bigwig" or "bedgraph". |
Optional arguments¶
--scaleFactor | The computed scaling gene (or 1, if not applicable) will exist multiplied by this. (Default: one.0) |
--MNase | Determine nucleosome positions from MNase-seq data. Only 3 nucleotides at the center of each fragment are counted. The fragment ends are defined by the two mate reads. Just fragment lengthsbetween 130 - 200 bp are considered to avert dinucleosomes or other artifacts. By default, whatever fragments smaller or larger than this are ignored. To over-ride this, use the –minFragmentLength and –maxFragmentLength options, which will default to 130 and 200 if non otherwise specified in the presence of –MNase. NOTE: Requires paired-finish data. A bin size of 1 is recommended. |
--First | Uses this offset within of each read as the signal. This is useful in cases like RiboSeq or GROseq, where the bespeak is 12, xv or 0 bases past the start of the read. This can be paired with the –filterRNAstrand option. Note that negative values indicate offsets from the end of each read. A value of one indicates the first base of operations of the alignment (taking alignment orientation into account). Besides, a value of -1 is the last base of the alignment. An kickoff of 0 is not permitted. If ii values are specified, then they will be used to specify a range of positions. Note that specifying something similar –Offset five -1 will upshot in the 5th through final position being used, which is equivalent to trimming 4 bases from the 5-prime end of alignments. Note that if you specify –centerReads, the centering will be performed before the kickoff. |
--filterRNAstrand | |
Possible choices: forward, reverse Selects RNA-seq reads (unmarried-end or paired-stop) originating from genes on the given strand. This choice assumes a standard dUTP-based library training (that is, –filterRNAstrand=forward keeps minus-strand reads, which originally came from genes on the forward strand using a dUTP-based method). Consider using –samExcludeFlag instead for filtering past strand in other contexts. | |
--version | show program's version number and get out |
--binSize, -bs | Size of the bins, in bases, for the output of the bigwig/bedgraph file. (Default: 50) |
--region, -r | Region of the genome to limit the operation to - this is useful when testing parameters to reduce the computing time. The format is chr:kickoff:stop, for case –region chr10 or –region chr10:456700:891000. |
--blackListFileName, -bl | |
A BED or GTF file containing regions that should be excluded from all analyses. Currently this works by rejecting genomic chunks that happen to overlap an entry. Consequently, for BAM files, if a read partially overlaps a blacklisted region or a fragment spans over information technology, and then the read/fragment might still be considered. Please note that you should adjust the constructive genome size, if relevant. | |
--numberOfProcessors, -p | |
Number of processors to use. Blazon "max/two" to utilize half the maximum number of processors or "max" to use all bachelor processors. (Default: ane) | |
--verbose, -v | Gear up to encounter processing letters. |
Read coverage normalization options¶
--effectiveGenomeSize | |
The effective genome size is the portion of the genome that is mappable. Large fractions of the genome are stretches of NNNN that should be discarded. Likewise, if repetitive regions were not included in the mapping of reads, the effective genome size needs to be adjusted accordingly. A table of values is bachelor here: http://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html . | |
--normalizeUsing | |
Possible choices: RPKM, CPM, BPM, RPGC, None Use one of the entered methods to normalize the number of reads per bin. Past default, no normalization is performed. RPKM = Reads Per Kilobase per One thousand thousand mapped reads; CPM = Counts Per 1000000 mapped reads, same as CPM in RNA-seq; BPM = Bins Per Million mapped reads, aforementioned as TPM in RNA-seq; RPGC = reads per genomic content (1x normalization); Mapped reads are considered later on blacklist filtering (if applied). RPKM (per bin) = number of reads per bin / (number of mapped reads (in millions) * bin length (kb)). CPM (per bin) = number of reads per bin / number of mapped reads (in millions). BPM (per bin) = number of reads per bin / sum of all reads per bin (in millions). RPGC (per bin) = number of reads per bin / scaling factor for 1x boilerplate coverage. None = the default and equivalent to not setting this option at all. This scaling factor, in turn, is determined from the sequencing depth: (total number of mapped reads * fragment length) / constructive genome size. The scaling factor used is the inverse of the sequencing depth computed for the sample to match the 1x coverage. This option requires –effectiveGenomeSize. Each read is considered independently, if you want to only count one mate from a pair in paired-end data, and so use the –samFlagInclude/–samFlagExclude options. (Default: None) | |
--exactScaling | Instead of computing scaling factors based on a sampling of the reads, process all of the reads to determine the verbal number that will be used in the output. This requires significantly more time to compute, just will produce more authentic scaling factors in cases where alignments that are being filtered are rare and lumped together. In other words, this is only needed when region-based sampling is expected to produce incorrect results. |
--ignoreForNormalization, -ignore | |
A list of infinite-delimited chromosome names containing those chromosomes that should be excluded for computing the normalization. This is useful when considering samples with unequal coverage beyond chromosomes, like male samples. An usage examples is –ignoreForNormalization chrX chrM. | |
--skipNonCoveredRegions, --skipNAs | |
This parameter determines if non-covered regions (regions without overlapping reads) in a BAM file should be skipped. The default is to treat those regions equally having a value of cipher. The decision to skip non-covered regions depends on the interpretation of the data. Non-covered regions may correspond, for case, repetitive regions that should be skipped. | |
--smoothLength | The smooth length defines a window, larger than the binSize, to average the number of reads. For example, if the –binSize is set to xx and the –smoothLength is gear up to 60, then, for each bin, the average of the bin and its left and right neighbors is considered. Any value smaller than –binSize will be ignored and no smoothing will exist applied. |
Read processing options¶
--extendReads, -eastward | |
This parameter allows the extension of reads to fragment size. If set, each read is extended, without exception. NOTE: This feature is generally NOT recommended for spliced-read data, such as RNA-seq, every bit it would extend reads over skipped regions. Unmarried-end: Requires a user specified value for the concluding fragment length. Reads that already exceed this fragment length will not be extended. Paired-stop: Reads with mates are always extended to match the fragment size defined by the two read mates. Unmated reads, mate reads that map too far apart (>4x fragment length) or even map to different chromosomes are treated like single-end reads. The input of a fragment length value is optional. If no value is specified, it is estimated from the data (mean of the fragment size of all mate reads). | |
--ignoreDuplicates | |
If set up, reads that take the same orientation and start position will be considered simply once. If reads are paired, the mate's position also has to coincide to ignore a read. | |
--minMappingQuality | |
If set up, only reads that accept a mapping quality score of at least this are considered. | |
--centerReads | By adding this option, reads are centered with respect to the fragment length. For paired-stop data, the read is centered at the fragment length defined by the two ends of the fragment. For single-terminate information, the given fragment length is used. This option is useful to become a sharper signal around enriched regions. |
--samFlagInclude | |
Include reads based on the SAM flag. For example, to get simply reads that are the first mate, utilize a flag of 64. This is useful to count properly paired reads merely once, as otherwise the 2nd mate will be besides considered for the coverage. (Default: None) | |
--samFlagExclude | |
Exclude reads based on the SAM flag. For example, to get but reads that map to the forward strand, utilize –samFlagExclude 16, where 16 is the SAM flag for reads that map to the reverse strand. (Default: None) | |
--minFragmentLength | |
The minimum fragment length needed for read/pair inclusion. This option is primarily useful in ATACseq experiments, for filtering mono- or di-nucleosome fragments. (Default: 0) | |
--maxFragmentLength | |
The maximum fragment length needed for read/pair inclusion. (Default: 0) |
Usage hints¶
- A smaller bin size value volition result in a higher resolution of the coverage runway just also in a larger file size.
- The 1x normalization (RPGC) requires the input of a value for the effective genome size, which is the mappable part of the reference genome. Of course, this value is species-specific. The control line help of this tool offers suggestions for a number of model species.
- It might be useful for some studies to exclude certain chromosomes in guild to avoid biases, e.m. chromosome X, equally male mice contain a pair of each autosome, but usually only a single Ten chromosome.
- Past default, the read length is Not extended! This is the preferred setting for spliced-read data similar RNA-seq, where one usually wants to rely on the detected read locations just. A read extension would neglect potential splice sites in the unmapped part of the fragment. Other information, e.g. Chip-seq, where fragments are known to map contiguously, should be processed with read extension (
--extendReads [INTEGER]
). - For paired-terminate data, the fragment length is by and large defined by the two read mates. The user provided fragment length is only used equally a fallback for singletons or mate reads that map also far apart (with a distance greater than 4 times the fragment length or are located on different chromosomes).
Warning
If you already normalized for GC bias using correctGCbias
, you should admittedly NOT set the parameter --ignoreDuplicates
!
Note
Like BAM files, bigWig files are compressed, binary files. If you lot would like to run across the coverage values, cull the bedGraph output via --outFileFormat
.
Usage instance for Bit-seq¶
This is an instance for Flake-seq data using additional options (smaller bin size for higher resolution, normalizing coverage to 1x mouse genome size, excluding chromosome Ten during the normalization footstep, and extending reads):
bamCoverage -- bam a . bam - o a . SeqDepthNorm . bw \ -- binSize ten -- normalizeUsing RPGC -- effectiveGenomeSize 2150570000 -- ignoreForNormalization chrX -- extendReads
If you lot had run the command with --outFileFormat bedgraph
, you could easily peak into the resulting file.
$ caput SeqDepthNorm_chr19.bedgraph 19 60150 60250 9.32 19 60250 60450 18.65 xix 60450 60650 27.97 19 60650 60950 37.29 19 60950 61000 27.97 19 61000 61050 eighteen.65 19 61050 61150 27.97 19 61150 61200 18.65 19 61200 61300 9.32 nineteen 61300 61350 18.65
Every bit you tin meet, each row corresponds to 1 region. If consecutive bins have the same number of reads overlapping, they will be merged.
Usage examples for RNA-seq¶
Notation that some BAM files are filtered based on SAM flags (Explain SAM flags).
Regular bigWig rail¶
bamCoverage - b a . bam - o a . bw
Split tracks for each strand¶
Sometimes it makes sense to generate 2 contained bigWig files for all reads on the frontward and reverse strand, respectively. Equally of deepTools version 2.2, one can simply use the --filterRNAstrand
option, such as --filterRNAstrand forward
or --filterRNAstrand reverse
. This handles paired-end and single-end datasets. For older versions of deepTools, please see the instructions below.
Note
The --filterRNAstrand
option assumes the sequencing library generated from ILLUMINA dUTP/NSR/NNSR methods, which are the most commonly used method for library preparation, where Read ii (R2) is in the direction of RNA strand (contrary-stranded library). However other methods exist, which generate read R1 in the direction of RNA strand (run into this review). For these libraries, --filterRNAstrand
will have an opposite behavior, i.e. --filterRNAstrand forward
will give you reverse strand signal and vice-versa.
Versions before 2.2¶
To follow the examples, you need to know that -f
will tell samtools view
to include reads with the indicated flag, while -F
will atomic number 82 to the exclusion of reads with the respective flag.
For a stranded `unmarried-end` library
# Frontwards strand bamCoverage - b a . bam - o a . fwd . bw -- samFlagExclude sixteen # Opposite strand bamCoverage - b a . bam - o a . rev . bw -- samFlagInclude xvi
For a stranded `paired-end` library
At present, this gets a bit cumbersome, only time to come releases of deepTools will brand this more straight-forwards. For now, bear with us and perhaps read up on SAM flags, e.g. here.
For paired-end samples, we assume that a proper pair should have the mates on opposing strands where the Illumina strand-specific protocol produces reads in a R2-R1
orientation. We basically follow the recipe given in this biostars tutorial.
To get the file for transcripts that originated from the forward strand:
# include reads that are 2nd in a pair (128); # exclude reads that are mapped to the reverse strand (16) $ samtools view -b -f 128 -F sixteen a.bam > a.fwd1.bam # exclude reads that are mapped to the reverse strand (xvi) and # kickoff in a pair (64): 64 + 16 = 80 $ samtools view -b -f 80 a.bam > a.fwd2.bam # combine the temporary files $ samtools merge -f fwd.bam a.fwd1.bam a.fwd2.bam # index the filtered BAM file $ samtools index fwd.bam # run bamCoverage $ bamCoverage -b fwd.bam -o a.fwd.bigWig # remove the temporary files $ rm a.fwd*.bam
To get the file for transcripts that originated from the reverse strand:
# include reads that map to the reverse strand (128) # and are second in a pair (16): 128 + 16 = 144 $ samtools view -b -f 144 a.bam > a.rev1.bam # include reads that are offset in a pair (64), but # exclude those ones that map to the reverse strand (xvi) $ samtools view -b -f 64 -F 16 a.bam > a.rev2.bam # merge the temporary files $ samtools merge -f rev.bam a.rev1.bam a.rev2.bam # index the merged, filtered BAM file $ samtools index rev.bam # run bamCoverage $ bamCoverage -b rev.bam -o a.rev.bw # remove temporary files $ rm a.rev*.bam
deepTools Galaxy. | code @ github. |
Source: https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html
0 Response to "Total Number of Reads in Bam File"
Enregistrer un commentaire