Quick start
If you have bioconda setup as recommended, then you can install preseq as follows:
$ conda install preseq
If you have problems, try the following which sets the channel priority:
$ conda install -c conda-forge -c bioconda --strict-channel-priority preseq
The instructions for installing conda are here. You can obtain preseq source code here.
This documentation is still under construction... But the PDF documentation still applies and can be found here.
Preseq
The preseq package is aimed to help researchers design and optimize sequencing experiments by using population sampling models to infer properties of the population or the behavior under deeper sampling based upon a small initial sequencing experiment. The estimates can then be used to examine the utility of further sequencing, optimize the sequencing depth, or to screen multiple libraries to avoid low complexity samples.
The four main commands are c_curve
, lc_extrap
, gc_extrap
, and
bound_pop
. The c_curve
command interpolates the expected
complexity curve based upon a hypergeometric formula and is primarily
used to check predictions from lc_extrap
and gc_extrap
. It will
show you what your data looks like (as a curve) within the range of
data you already have. The lc_extrap
command uses rational function
approximations to Good & Toulmin's (1956) non-parametric empirical
Bayes estimator to predict the library complexity of future
experiments, in essence looking into the future for hypothetical
experiments. This is the paper:
Good IJ & Toulmin GH (1956).
The number of new species, and the increase in population coverage,
when a sample is increased.
Biometrika, 43(1-2),45-63.
The gc_extrap
command uses a similar approach as lc_extrap
to
predict the genome coverage, i.e. the number of bases covered at least
once, from deeper sequencing in a single cell or low input sequencing
experiment based on the observed coverage counts. An option is
available to predict the coverage based on binned coverage counts to
speed up the estimates. gc_extrap
requires mapped read or bed format
input, so the tool to-mr
is provided to convert bam format read to
mapped read format.
bound_pop
uses a non-parametric moment-based approach to
conservatively estimate the total number of classes in the sample,
also called the species richness of the population that is sampled.
Installation
Using conda
You can install preseq as follows using conda:
$ conda install -c bioconda preseq
The instructions for installing conda are here.
From source
You can obtain preseq here.
Only clone the repo if you are planning to modify or reuse the code,
or have some other interesting plans and know what you are doing. Most
users should download a release. The README.md
file in the GitHub
repo explains how to download a release, and not simply the source
code files.
Preseq runs on Unix-based systems, notably macOS and Linux. If the input file is in BAM format, the HTSLib library is required. If the input is a text file of counts in a single column, a "counts histogram" or is in BED format, then HTSLib is not required. Preseq has been tested on Linux and Mac OS-X.
Instructions on how to install preseq are included in the README.md
file in root of the source tree for preseq. If you have problems
installing preseq, please post an issue on the GitHub page or contact
Andrew Smith andrewds@usc.edu.
Using preseq
Basic usage
To generate the complexity curve of a genomic library from a read file
in BED or BAM format or a duplicate count file, use the c_curve
command. Use -o
to specify the output file name.
$ preseq c_curve -o complexity_output.txt input.bed
In the command above (and everywhere in these docs) we assume that the
preseq
program is in your path
To predict the complexity curve of a sequencing library using an
initial experiment in BED format, use the lc_extrap
command. The
required options are -o
to specify the output of the yield estimates
and the input file, which is either a BED file sorted by chromosome,
start position, end position, and strand or a BAM file sorted with the
samtools sort function. Additional options are available and are
detailed below.
$ preseq lc_extrap -o future_yield.txt input.bed
For a low input sequencing experiment the genomic coverage is highly
variable and uncertain function of sequencing depth. Some regions may
be missing due to locus dropout or preferentially amplified during
whole genome amplification. The gc_extrap
command allows the level
of genomic coverage from deep sequencing to be predicted based on an
initial sample. The input file format need to be a mapped read
(.mr
) or BED, sorted by chromosome, start position, and end then
position. Additional options are available and are detailed below.
$ preseq gc_extrap -o future_coverage.txt input.mr
File Format
Sorted read files in BED or BAM format
Input files are sorted mapped read files in BED or BAM format, or a text file consisting of one column giving the observed read counts. The programs require that BED files are sorted by chromosome, start position, end position, and strand. This can be achieved by using the command line function sort as follows:
$ sort -k 1,1 -k 2,2n -k 3,3n -k 6,6 input.bed > input.sort.bed
BAM format read files should be sorted by chromosome and start
position. This can be done with the SAMTools sort function. If the
input is in BAM format, then the flag -B
must be included.
If the input is for paired end mapped reads, the option -P
can be
set. In this case concordantly mapped reads and disconcordantly
mapped fragments are counted. This means that both ends of a
disconcordantly mapped read will each be counted separately. If a
large number of reads are disconcordant, then the default single end
should be used or the disconcordantly mapped reads removed prior to
running preseq
. In this case only the mapping location of the first
mate is used as the unique molecular
identifier. See this paper for details:
Kivioja T, Vähärautio A, Karlsson K, Bonke M, Enge M, Linnarsson S & Taipale J (2012).
Counting absolute numbers of molecules using unique molecular identifiers.
Nature Methods, 9(1),72-74.
Text files of observed read counts
For more general applications preseq
allows the input
to be a text file of observed read counts, one count per
line. To specify this input, the option -V
must be set.
Such a text file can typically be constructed using shell commands or very simple scripts. Take for example an unmapped sequencing experiment in FASTQ format. To predict the complexity, the unique molecular identifier needs to use only the observed sequence. For instance, a unique molecular identifier used may be the first 20 bases in the observed sequence. A command line script to construct the counts would then be
$ awk '{if (NR%4==2) print substr($0,1,20);}' input.fastq | sort | \
uniq -c | awk '{print $1}' > counts.txt
More complicated unique molecular identifiers can be used, such as mapping position plus a random barcode, but are too complicated to explain here. For questions with such usage, please contact us at andrewds@usc.edu
Mapped read format for gc_extrap
gc_extrap
does not allow for input files to be in BAM format. We
have found that some mappers give inconsistent SAM flags for paired
end reads, preventing efficient merging of reads in the proper order.
We provide the tool to-mr
to convert SAM or BAM format files to
MR format. The MR or BED format file needs to be sorted by
chromosome, start, and end position before input into gc_extrap
.
Detailed usage
c_curve
The c_curve
command is used to compute the expected complexity curve
using a hypergeometric formula (Heck et al., 1975, below). This can be
done with input as mapped reads, a counts file, or a counts histogram
file. The output is a text file with two columns. The first gives the
total number of reads and the second the corresponding number of
distinct reads.
Heck Jr, K. L., van Belle, G., & Simberloff, D. (1975).
Explicit calculation of the rarefaction diversity measurement
and the determination of sufficient sample size.
Ecology, 56(6), 1459-1461.
Command line arguments are as follows:
-o, -output yield output file (default: stdout)
-s, -step step size in extrapolations [1000000.000000]
-v, -verbose print more information
-P, -pe input is paired end read file
-H, -hist input is a text file containing the observed histogram
-V, -vals input is a text file containing only the observed
counts
-B, -bam input is in BAM format
-l, -seg_len maximum segment length when merging paired end bam
reads [5000]
-r, -seed seed for random number generator [408]
lc_extrap
The lc_extrap
command generates the expected yield for theoretical
larger experiments. It can also gives bounds on the number of distinct
reads in the library and the associated confidence intervals. The
confidence interval is computed by bootstrapping the observed
duplicate counts histogram. The output is a text file with four
columns:
- The first is the total number of reads
- The second is the corresponding average expected number of distinct reads
- The third is the lower limit of the confidence interval
- The fourth is the upper limit of the confidence interval
Specifying verbose will print the counts histogram of the input file.
Usage:
$ preseq lc_extrap [OPTIONS] <input-file>
Command line arguments are as follows:
-o, -output yield output file (default: stdout)
-e, -extrap maximum extrapolation [10000000000.000000]
-s, -step extrapolation step size [1000000.000000]
-n, -boots number of bootstraps [100]
-c, -cval level for confidence intervals [0.950000]
-x, -terms maximum terms in estimator [100]
-v, -verbose print more info
-B, -bam input is in BAM format
-l, -seg_len maximum segment length when merging paired end bam
reads [5000]
-P, -pe input is paired end read file
-V, -vals input is a text file containing only the observed
counts
-H, -hist input is a text file containing the observed histogram
-Q, -quick quick mode (no bootstraps) for confidence intervals
-D, -defects no testing for defects
-r, -seed seed for random number generator [408]
gc_extrap
The gc_extrap
command is used to extrapolate the expected number of
bases covered at least once for theoretical larger experiments. This
is useful for single cell or low input sequencing experiments. Input
format is required to be in mapped read format and we have provided
the tool to-mr
to convert BAM format files to the mr format. The
output is a text file with four columns.
- The first is the total number of sequenced and mapped bases
- The second gives the corresponding expected number of distinct bases covered
- The third gives the lower limit of the confidence interval
- The fourth gives the upper limit of the confidence interval
Specifying verbose will print the coverage counts histogram of the input file.
Usage:
$ preseq gc_extrap [OPTIONS] <input-file>
Command line arguments are as follows:
-o, -output coverage yield output file (default: stdout)
-w, -max_width max fragment length, set equal to read length for
single end reads [10000]
-b, -bin_size bin size [10]
-e, -extrap maximum extrapolation in base pairs
[1000000000000.000000]
-s, -step step size in bases between extrapolations
[100000000.000000]
-n, -bootstraps number of bootstraps [100]
-c, -cval level for confidence intervals [0.950000]
-x, -terms maximum number of terms [100]
-v, -verbose print more information
-B, -bed input is in bed format without sequence information
-Q, -quick quick mode: run gc_extrap without bootstrapping for
confidence intervals
-D, -defects defects mode to extrapolate without testing for
defects
-r, -seed seed for random number generator [408]
bound_pop
The bound_pop
command is a method for estimating species richness,
the total number of species or classes in the sampled population.
Input format is the same as lc_extrap
. Default output is a three
column text file, with the first column containing the estimated
species richness and the second and third containing the estimated
lower and upper confidence intervals. If bound_pop
is run in quick
mode, then the output is two columns. The first column will contain
the estimated species richness and the second column will contain the
dimension or order of the approximation.
Usage:
$ preseq bound_pop [OPTIONS] <input-file>
Command line arguments are as follows:
-o, -output species richness output file (default: stdout)
-p, -max_num_points maximum number of points in quadrature estimates
[10]
-t, -tolerance numerical tolerance [0.000000]
-n, -bootstraps number of bootstraps [500]
-c, -clevel level for confidence intervals [0.950000]
-v, -verbose print more information
-P, -pe input is paired end read file
-H, -hist input is a text file containing the observed
histogram
-V, -vals input is a text file containing only the
observed duplicate counts
-B, -bam input is in BAM format
-l, -seg_len maximum segment length when merging paired end
bam reads [5000]
-Q, -quick quick mode, estimate without bootstrapping
-r, -seed seed for random number generator [408]
Examples
lc_extrap
Using a file of mapped reads
The mapped reads given as input must be sorted, and can be in BED
format or BAM format if the -B
flag is specified. We will use the
data file SRR1041830_5M.bam
provided through a link in the data
subdirectory of the preseq source tree. This is a random sample of 5M
reads from a particular sequencing data set.
$ preseq lc_extrap -B -o future_yield.txt SRR1041830_5M.bam
Notice that the -B
flag was specified on the command line. This is
required if the input data is in BAM format. If you run preseq
lc_extrap
and it does not list the -B
argument, then your version
of preseq was not built with BAM support.
The output file future_yield.txt
will contain estimaes of the number
of distinct reads that will be sequenced upon deeper sequencing from
the same library. In this case, we can see some of these predictions
as follows. The first several predictions can be seen as follows:
$ head future_yield.txt
TOTAL_READS EXPECTED_DISTINCT LOWER_0.95CI UPPER_0.95CI
0 0 0 0
1000000.0 990651.5 990601.3 990709.5
2000000.0 1963963.0 1963805.8 1964152.2
3000000.0 2921362.5 2921067.3 2921723.9
4000000.0 3863837.0 3863361.6 3864408.8
5000000.0 4792170.5 4791451.2 4792972.3
6000000.0 5706990.8 5705989.7 5708096.4
7000000.0 6608846.9 6607533.2 6610298.1
8000000.0 7498214.9 7496569.6 7500056.3
Note that the first line above gives column headings. The second line corresponds to sequencing zero reads (and is as expected). The remaining lines give estimates for what we would expect if we sequenced the corresponding number of reads from the library. If we sequenced 8M reads, then we expect 7498214.9 of them to be unique. Here the confidence intervals are quite narrow, but that's not always the case, especially in very low-complexity libraries.
We can examine the estimates for sequencing 50M, 500M and 5G reads as follows:
$ grep ^50000000 future_yield.txt
50000000.0 36786623.6 35117618.0 37112536.8
500000000.0 116108773.9 97165891.1 137121043.6
5000000000.0 148401854.8 118080946.1 198715938.8
Above the second column shows the expected distinct reads for sequencing 10x, 100x and 1000x deeper from the same library. In this case, we still sequence new unique reads after 500M, but but the fraction unique at 5G is under 3%. Looking more closely around the 5G estimate:
$ grep -2 ^5000000000 future_yield.txt
4998000000.0 148400017.5 118079815.0 198711828.1
4999000000.0 148400936.4 118080380.7 198713883.8
5000000000.0 148401854.8 118080946.1 198715938.8
5001000000.0 148402772.9 118081511.4 198717993.0
5002000000.0 148403690.7 118082076.4 198720046.3
we see that the library likely only contains 150M unique molecules.
This example uses a sorted read mapped reads file as the data for the initial experiment. The default step size between yield estimates is 1M, the default confidence interval level is 95%, and the default total extrapolation is to 10 billion (10G) sequenced reads.
Using a mapped reads file and the verbose option
$ preseq lc_extrap -v -B -o future_yield.txt SRR1041830_5M.bam
As preseq is running, information will print to the terminal screen (standard error) that includes a read counts histogram of the input file. This truncates after the first value that has zero observations. Included here is the first 20 lines of what would be observed:
BAM_INPUT
TOTAL READS = 5083564
DISTINCT READS = 4.86904e+06
DISTINCT COUNTS = 19
MAX COUNT = 19
COUNTS OF 1 = 4.67849e+06
MAX TERMS = 18
OBSERVED COUNTS (20)
1 4678490
2 173897
3 12770
4 2345
5 772
6 355
7 156
8 82
9 55
10 37
11 24
12 21
Using a sorted mapped reads with options
$ preseq lc_extrap -B -e 15000000 -s 500000 -b 90 -c .90 -o future_yield.txt SRR1041830_5M.bam
This will use a step size of 500k when making estimates, and the estimates ending at 15M. The confidence intervals are now at a level of 90%. This can be seen as follows:
$ head future_yield.txt
TOTAL_READS EXPECTED_DISTINCT LOWER_0.9CI UPPER_0.9CI
0 0 0 0
500000.0 497614.0 497602.0 497630.0
1000000.0 990651.5 990609.0 990703.1
1500000.0 1479369.5 1479288.0 1479477.3
2000000.0 1963963.0 1963834.6 1964139.1
2500000.0 2444581.0 2444401.0 2444831.2
3000000.0 2921362.5 2921116.3 2921682.2
3500000.0 3394409.5 3394093.8 3394798.0
4000000.0 3863837.0 3863430.4 3864324.8
$ tail -5 future_yield.txt
12500000.0 11356784.0 11353132.2 11360765.0
13000000.0 11771818.5 11767804.8 11776196.8
13500000.0 12184179.7 12179952.7 12188987.7
14000000.0 12594028.0 12589384.9 12599334.0
14500000.0 13001405.8 12996034.3 13007337.6
Using a histogram or read counts as input
lc_extrap
allows the input file to be an observed histogram. An
example of the format of this histogram is (truncated at 20 lines):
1 47879524
2 16751354
3 7193291
4 3552776
5 1935072
6 1120980
7 682786
8 430623
9 280089
10 186473
11 127529
12 89218
13 63197
14 45779
15 33644
16 25155
17 18760
18 14371
19 11048
20 8667
The following command will give the same output as the first example above (using default parameters):
$ preseq lc_extrap -o future_yield.txt -H histogram.txt
Assuming that histogram.txt
has the proper format and includes the
frequencies of read counts from SRR1041830_5M.bam
.
Similarly, both lc_extrap
and c_curve
allow the option to input
read counts (text file should contain ONLY the observed counts in a
single column). For example, if a dataset had the following counts
histogram:
1 4
2 3
3 1
then the corresponding input file of just read counts might be
1
2
1
1
3
2
2
1
This particular example involves counts that are far to small for preseq to make any estimates, but they should illustrate the relationship between a file of counts, and the counts histogram format.
When using these count "values" the command should be run with the -V
flag (not to be confused with the lowercase -v
for verbose mode):
$ preseq lc_extrap -o future_yield.txt -V counts.txt
gc_extrap
The gc_extrap
command is designed for estimating genome coverage
from more deeply sequencing a library, particularly in single cell
whole genome sequencing experiments. For illustrative purposes we will
examine an MDA (multiple displacement amplification) whole genome
sequencing experiment, SRA accession SRR1777281. This experiment has
5.76 million paired end 101 base reads. We mapped the experiment with
bowtie2 v0.0-beta7 under default parameters. This resulted in 3.63
million concordantly mapped fragment pairs and 2.1 million
disconcordantly mapped fragments.
The first step is to convert the sorted bam file to mr format and sort it.
$ to-mr -o SRR1777281_bwt2.mr -L 10000 SRR1777281_bwt2.sort.bam
$ sort -k 1,1 -k 2,2n -k 3,3n SRR1777281_bwt2.mr > SRR1777281_bwt2.sort.mr
The resulting mapped reads file has 813 million bases total (note that bases covered by two fragments of the same read are only counted once) and 410 million covered bases in the genome.
By default, gc_extrap
divides the genome into 10 base pair non-overlapping bins.
In default mode, the running time of gc_extrap
was under 12 minutes.
LOADING READS
MAPPED READ FORMAT
TOTAL READS = 5726883
BASE STEP SIZE = 1e+08
BIN STEP SIZE = 1e+07
TOTAL BINS = 8.1325e+07
BINS PER READ = 14.2006
DISTINCT BINS = 4.09582e+07
TOTAL BASES = 8.1325e+08
TOTAL COVERED BASES = 4.09582e+08
MAX COVERAGE COUNT = 79775
COUNTS OF 1 = 3.13723e+07
OBSERVED BIN COUNTS (79776)
1 3.13723e+07
2 6.97799e+06
3 1.72101e+06
The first several lines of the output file were as follows:
TOTAL_BASES EXPECTED_COVERED_BASES LOWER_95%CI UPPER_95%CI
0 0 0 0
100000000.0 64522380.0 63075879.1 66002053.1
200000000.0 123422455.0 120747705.7 126156454.1
300000000.0 178054120.0 174319619.1 181868626.2
400000000.0 229008295.0 224352188.5 233761032.3
500000000.0 276727080.0 271265011.6 282299130.1
The final line of the output was:
999900000000.0 1826891418.6 1621208958.1 2058668772.4
To run gc_extrap
at single base resolution, the option
-b 1
is required. This results in a significant increase
in the running time of the algorithm. For this case the
running time was 113 minutes. The following is the command
along with the first several lines of information reported
to the terminal:
$ preseq gc_extrap SRR1777281_bwt2.sort.mr -o SRR1777281_bwt2_1bp_gc_extrap.txt -b 1 -v
LOADING READS
MAPPED READ FORMAT
TOTAL READS = 5726883
BASE STEP SIZE = 1e+08
BIN STEP SIZE = 1e+08
TOTAL BINS = 8.13236e+08
BINS PER READ = 142.003
DISTINCT BINS = 4.09454e+08
TOTAL BASES = 8.13236e+08
TOTAL COVERED BASES = 4.09454e+08
MAX COVERAGE COUNT = 80028
COUNTS OF 1 = 3.13527e+08
OBSERVED BIN COUNTS (80029)
1 3.13527e+08
2 6.98288e+07
3 1.722e+07
The output looks like this:
TOTAL_BASES EXPECTED_COVERED_BASES LOWER_95%CI UPPER_95%CI
0 0 0 0
100000000.0 64680427.0 64284593.4 65078698.0
200000000.0 123709880.0 122978429.6 124445680.9
300000000.0 178447901.0 177427065.8 179474609.6
400000000.0 229488768.0 228216803.2 230767822.1
500000000.0 277279369.5 275788078.2 278778724.8
with the final line:
999900000000.0 1838021604.0 1682515315.7 2007900543.3