FASIM: Fragments Assembly Simulation using Biased-Sampling Model and Assembly Simulation for Microbial Genome Shotgun Sequencing
-- Manual
Hur, Cheol-Goo1,3 , Sunny Kim1 , Chang Hoon Kim1 , Sung Ho Yoon1 , Yong-Ho In2 , Cheolmin Kim3 , AND Hwan Gue Cho3*
1 Division of Genomics and Proteomics, Korea Research Institute of Bioscience and Biotechnology (KRIBB), 52 Oun-dong, Yusong-gu, Daejon 305-333, Korea
2 Bioinfomatix, Inc., Nam Chang bldg, 748-16, Yeoksam-dong, Gangnam-gu, Seoul 135-925, KOREA
3 Bioinformatics Cooperative Course , Pusan National University, 30 Jangjeon-dong, Geumjeong-gu, Pusan 609-735, Korea
* Corresponding author
Phone: 82-51-510-2283; Fax: 82-51-515-2208; E-mail: hgcho@pusan.ac.kr
A current release of the FASIM can be available upon request at hurlee@kribb.re.kr .
Abstract
We have developed a program for generating shotgun data sets from
known genome sequences. Generation of synthetic data set by computer program is
a useful alternative to real data to which students and researchers have
limited access. Uniformly-distributed-sampling clones which were adopted by
previous programs cannot account for the real situation
where sampled reads tend to come from particular regions of the target genome. To
reflect such situation, a probabilistic model for biased sampling distribution
was developed by using experimental data set derivied from a microbial genome
project. Among experimental parameters tested - varied fragment or read lengths,
chimerism, and sequencing error, sequencing error was the most critical factor
that hampers sequence assembly. We propose that optimum sequencing strategy
employing different insert length and redundancy can be established by
performing a variety of simulations.
Introduction
With the advent of shotgun sequencing approach, genome sequnece data
are being rapidly accumulated. As for microorganisms, 142 complete genomes have
been published as of Oct. 2003. Whole
genome shotgun sequencing (WGSS) is to sequence randomly generated
fragments of an entire genome, and then to assemble the sequences by computer
program (Myers 1999b). Availability of a sequence assembly program and
well-designed shotgun approach can considerably reduce the cost and time for
genome projects. In silico simulation
of entire sequencing steps such as base calling from chromatogram, vector
masking and sequence assembly, can be extremely helpful in understanding each
step and in designing the experiments
with its predictive power. However, this ¡°virtual experience of genome project¡±
is hampered by the limited availability of real experimental data sets.
Generation of random short fragment sequences by computer program
can be a useful alternative to the real data set. To the best of our knowledge,
only two programs have been reported – GenFrag 2.1 (Engle and Burks 1994) and
celsim (Myers 1999a).
Previously presented GenFrag (version
1.0) was designed to generate artificial DNA sequences fragment set for testing
DNA fragment assembly algorithm. Dataset generated as mentioned above is independent
and system-like to evaluate algorithm. Version 2.1 adds the function to
generate dataset that reflects potentially troublesome aspects of fragment
assembly more than before. GenFrag generates fragment set for given DNA
sequences to be input according to criteria defined by users. For main
features, it can adjust fragment length of parent sequence, depth of coverage
and error rate in fragments. For Version 1.0, a single error rate is defined
and all nucleotides are randomly mutated on the basis of this error rate.
However, error rates are increased with increase of fragment length so that
errors are not generated in the same degree according to fragment length. Thus,
GenFrag 2.0 provides three error distributions as applying position-specific
error profile: Substitutions, insertions and deletions. Moreover, it adds new
module for repetitive elements that act as barrier in assembly projects so as
to allow repeat complexity before Parent sequences are divided into fragments.
You fill input files including one or more DNA sequences(used as repeat
templates) according to required insertion frequency values. That is, GenFrag
2.0 is a simulator enabling to examine performance of assembly tools and the
most distinguished feature is to adjust error rates of sample sequences.
Celsim are tools that can generate practical and repetitive target
sequences and shotgun data sets. It largely consists of three parts. Firstly, sequence generation stage makes DNA
sequences generated in artificially controlled types (repetitive array,
transposon-like repetitive elements, large-scale duplications, etc.) Secondly,
optional polymorphism stage applies
polymorphic variations to generated target sequences. Finally, shotgun stage simulates sampling and
end-sequencing as much as you want relative to insert libraries to fit to artificially
selected conditions (read length, numbers, single base errors, chimeric rate,
etc.)
That is, celsim is the tool to
generate shotgun data set using known DAN sequences or generate DNA sequences
that are artificially adjusted.
However, programs described above
still have some limits in developing assembly tools, examining performance of
assembly tools or simulating WGSS. While celsim is designed to enable users to
adjust conditions(chimerism, sequencing error, distribution of fragment or read
length) in consideration of a wide range of experimental conditions that can
occur in actual WGSS, these programs are designed on the assumption that
fragments are equally distributed on genome. That is, clone sampling is
uniformly carried out during fragment generation. Thus, these programs don¡¯t
sufficiently reflect actual conditions of WGSS.
This is caused by the observation that biased clone distribution
always happens due to various reasons – incomplete fracturing of genome, biased
insertion of fragments into vectors and improper cloning of some insert/vector
combinations (Myers 1999b).
In this study, we developed a program suite for generating random
fragments in microbial genome sequencing. To create a more realistic program, a
probabilistic model of biased sampling distribution was applied to generate
synthetic shotgun data sets from known genome sequences. We also report the
effects of various experimental factors on sequence assembly.
Materials and
methods
Generation of a probabilistic
model for biased random sampling
Probabilistic model for biased sampling distribution was developed
by using experimentally derived data set constructed from the genome project of
Mannheimia succiniciproducents
MBEL55E (accession no. AE016827, unpublished). This strain is a novel succinic
acid-producing bacterium, which was isolated from bovine rumen
(Lee et al. 2002). The chromosome consists of 2,314,078 bp, which were cloned
into 18,958 of 2-kb clones and 294 of 40-kb clones. As
shown in Fig. 1, the bacterial chromosome was scanned for finding out the start
position of each read by BLASTN program (Altschul et al. 1997). The cumulative
distribution function for biasedsampling, F(x), was defined as
(0≦x≦genome size) (1)
where f(i) is the number
of reads positioned at the ith position and N is the total number of reads. Once a random number between zero
and one is selected, the read position corresponding to that number is traced
from F(x). Starting from a generated
position, inserts having different length such as 2 kb, 40 kb or 150 kb were
taken out from the Pasteurella multocida
genome sequence (accession ro. NC_002663).

Fig. 1 Process to Develop Random
Distribution Model
Development of a data
set generation program
The program was written in C and runs as a Unix or Linux command
line tool. It was designed to generate sequence files of a shotgun data set
from known genome sequences (Fig. 2). Several parameters(Table. 1) allow users
to control the read sequences. Normal distribution of fragment and read lengths
can be parameterized. The sequencing error rate can be optionally specified.
Two file names, genome sequence file and vector sequence file, are arguments
with several parameter settings (Fig. 2). The positions of every fragment are
generated from a derived probabilistic model (see Materials and methods). The information
file containing the inserts distribution of length and position of inserts is also
provided. After reads were generated from ends of inserts, vector sequence was
added to the 5¡¯ of 3¡¯ end of each read sequence. At this step, chimeric reads
are generated by combining two randomly selected reads. pGEM3Zf was used as a
vector for 2 kb clone and pEpiFOS-5 for 40- and 150 kb clones. The resulting
read sequences are converted into ABI files by Phred package (Ewing et al. 1998).
Output file generally consists of three parts. Read sequence files
directory and chromatograms directory in Fasta format are generated, clone
position/length distribution is calculated and the results are printed by
section. Upon the completion of this process, fragment assembly is carried out
using CAP3(or Phrap). In this case, assembly results are analyzed as the counts
of contig check the counts of scaffold by coverage stage.

Fig. 2
Module Configuration of FASIM and
Functions of each Module. modul-1 enters command lines, genome sequence and vector
sequence. module –2 calculates start points, end points and lengths of clones
that are entered. On the basis of this data, module-3 calculates required
statistics. Then, defined number of normal clones are generated and converted
to ABI files. Next, chimeric clones are generated to match total number.
Module -1 : Command line interpretation and Vector and genome
sequence input processing
1)
A variety of options given by
command lines are processed. Each option is entered and checked by purpose of
each parameter.
2)
DNA sequences in Fasta Format are
entered and then blank and numbers are deleted. Then, standard nucleotide and
ambiguous sequences are received.
Module-2 : Strain
sequences to simulate and vector data to test are entered.
Module-3 : Random position sampling on genome
1)
This module carries out sampling for
clone picking start points with each genome position from 0.0 to 1.0 generated
by random number generation using c.d.f function that is generated through
learning.
2)
This module temporally calculates
end points using average lengths from start points and define final end points
as adjusting temporal end points to acquire standard distribution within given
scope of deviation.
Module-4 : Routine to calculate statistics of templates
1)
This module prints distribution by
given section to check whether clone boundaries in Position distribution
Sampling are evenly distributed on genome.
2)
Clone length distribution : Clone
length distribution is set according to standard distribution. This module
prints clone length distribution by section to check the results.
Module-5 : Routine to
generate normal clone
1)
This module singles out sequences
from genome sequences on the basis of clone data from Module-3.
2)
If start point is too close to the
end of genome, this module processes in accordance with conditions(when
bacterial genome is linear, it is not selected as clone. For circular genome,
deficient portions are selected from the first part of genome). (using ran1(),
gasdev() in Numeric Receipe routine)
3)
This module attaches vector sequence
to the sides 5¡¯ and 3¡¯.
Module-6 : Routine to generate Chimeric clones
1)
This module introduces chimerism
only when lengths of two clones are over 0 as selecting data on boundaries and
lengths of two independent clones.
2)
This module decides positions for
recombination as generating random numbers on the basis of short clone lengths,
selects two sequences from genome sequence and generates chimeric clones on the
basis of determined positions.
3)
This module attaches vector sequence
to the sides 5¡¯ and 3¡¯.
Module-7 : Read maker
1) This
module decides read lengths conforming to standard distribution with input read
length deviation as standard deviation and input read length as average. Then,
it generates reads in required length including vector sequences from generated
clones above.
2) This
module allows adjusting ratio of substitution, insertion and deficiency in
consideration of sequencing errors.
3 ) This module generates chromatogram
files as bring mktrace.
Files for
FASIM Program
1) MHPOS.dat : ) Frequency function data according to actual
positions of reads used until completion of Mannheimia succiniciproducens
MBEL55E finishing on chromosome
2) simul.c : generation of sim, executable program for main
program including mai().
3) seq_cutil.c : Groups of routines required for sequence
processing
4) seq_cutil.h : Header for seq_cutil.c
5) nrutil.h : Header
for using numerical receipe routine
6) nr.h : Header for using numerical receipe
routine
7) mktrace : Generating ABI files from sequences as the program
included in phred/phrap package. It shall be positioned to allow fasim program
can retrieve without special settings, as being separately complied.
Major parameter
of FASIM program

Table. 1 : Major
parameters of FASIM program can carry out a wide range of simulation works for programs.
Sequence assembly
Base calling from ABI file was done by using Phred package
(http://www.phrap.org). After the vector sequence was masked, read sequences
were assembled into contigs (overlapped fragments in a contiguous region) and scaffolds
(maximally linked and ordered set of contigs) by CAP3 (Huang and Madan 1999). (Fig.
3)

Fig. 3 Assembly Process on the basis of Generated Data
Results
Results from
Running Program
(a) Position
distribution
For Uniform distribution( Fig. 4 ), sampling
was evenly carried out on genome in general. It indicates sampling of start
points on genome by section of which the size was set to 10Kbp. Since the size
of overall genome in Pasteurella
multocida was about 2.26 Million bp, X-axis reaches to about 2.5 Million
and about 90(black-2kb clone) clones were allocated on each section in average.
It shows uniform distribution in general in spite of substantial deviation. Red
points were fragments generated on the basis of about 40 Kbp. In this step,
about 20 fragments with different types per section were generated and were
also evenly distributed. On the contrary, when cumulative distribution model generated
through learning data of Mannheimia succiniciproducents ( Fig. 5 ), we got the clone picking
results that were biased similarly to shotgun sequencing approach.

Fig. 5 Uniform distribution

Fig.6 Cumulative distribution
(b) Clone length
distribution
It was confirmed that clone length distribution was also
controlled in a pattern according to standard distribution. From this
distribution, we can understand the length distributions when 40kb mate and 2kb
mate were generated (Fig. 8).

Fig. 7 Length Distribution of 2Kb clones

Fig. 8 Length Distribution
of 40Kb clones
(c) Read length
distribution
The
chart shows that read length was well controlled to conform to standard
distribution. The length in this chart indicates the length in the state
including vector sequence.

Fig. 9 Calculation with 600dp in average size distribution length
of read length and 200bp in deviation
Distribution of
fragments generated by the program
To test the performance of the program, we generated sequencing
reads uniformly from published genome sequence, Pasteurella multocida, which is circular genome having 2,257,487 bp
(G). The simulation condition was given as insert length of 2 kb, read length
of 600 bp (L), and no experimental errors. The contigs were generated by
assembling the reads with setting of minimal overlapped length between reads
(T) as 30 bp. Theoretically, if all the reads having same size are uniformly
sampled from genome, the expected number of contigs can be calculated from
(2) (Lander and Waterman 1988)
where N is the number of
sequenced reads.
As shown in Fig. 3A, the summed number of contigs and singlets,
that is reads left from contig assembly was perfectly matched to equation (2)
as number of reads went. This shows this program performs well in generating
shotgun data sets. In theory, uniformly sampled reads with no experimental
errors guarantee 99 percent of the genome can be covered by eightfold
redundancy (Pop et al. 2002), which hardly occurred in real situation. When the
number of contigs generated from Mannheimia
WGSS were plotted according to redundancy, the patterns were quite different
from that of equation (2), and the extent of deviation was largely propagated
at the higher redundancy (Fig. 3B).


Fig. 3 Redundancy profiles of contig number
generated by the program (A) and by coming from Manheimia genome sequencing
project (B). (--¡à-- contigs; --¡â--singlets;
--¡Ü-- contigs + singlets; solid line fitted to equation (2))
Effect of sequencing strategy on sequence assembly
Redundancy of sequence and the choice of insert length are key
factors in reducing scaffolds. To find out the effect of sequencing strategy on
assembly, combinations of inserts having different length were simulated and
the results were compared in Table 2. The data sets were generated based on
biased-sampled model for the purpose of simulating more realistic situation. We
chose 40 kb as a representative insert length for cosmid, and 150 kb for BAC
(bacterial artificial chromosome). At the fixed redundancy, fewer scaffolds and
larger target coverage resulted when the proportion of longer inserts was
increased. For example, 250 scaffolds were made when only 2 kb inserts were
used to give 3 fold redundancy. When 2¢¥ redundancy was further added,
85, 124, 130, and 141 scaffolds were reduced at the 2 kb-, 40 kb- and 150kb-
inserts ratio of 5:0:0, 4:1:1, 3:1:1 and 3:0:2, respectively. As for the target
coverage, 85.4 %, 90.72%, 91.34%, 93.62% were covered by scaffolds as addition
of longer inserts increased. A mixture of 60% of 2 kb inserts and 40% of 150 kb
inserts data gave best assembling performance and 109 scaffolds and 93.62%
target coverage was achieved. Same discussion was applied to the assemblies at
redundancy of 7 fold.
Table 2. Effects of using multi-library on sequence assembly at
various redundancies. Data sets were derived from biased-sampled model.
|
Redundancy |
Redundancy of inserts (2 kb/40 kb/150 kb) |
No. contigs |
No. scaffolds |
Target coverage(%) |
|
3X |
3X/0X/0X |
994 |
250 |
74.82 |
|
5X |
5X/0X/0X |
735 |
165 |
85.40 |
|
4X/1X/0X |
653 |
126 |
90.72 |
|
|
3X/1X/1X |
642 |
120 |
91.34 |
|
|
3X/0X/2X |
536 |
109 |
93.62 |
|
|
7X |
7X/0X/0X |
584 |
121 |
89.54 |
|
6X/1X/0X |
454 |
96 |
93.35 |
|
|
5X/1X/1X |
447 |
83 |
94.57 |
|
|
5X/0X/2X |
360 |
71 |
95.93 |
Effect of experimental conditions on sequence assembly
Influences of a wide range of
experimental conditions on fragment assembly were examined. Assembly was
carried out using PHRAP on the basis of total 15 dataset (Table. 3).
Simulations were carried out to find out the effects of
experimental conditions on sequence assembly (Fig. 10). We simulated data sets
generated by uniformly distributed samples from Pasteurella genome sequence and the results were compared to line
fitted to equation (2), which assumed uniformly sampled reads with no
experimental errors. Distribution of read lengths with standard deviation of
100 bp gave no difference (Fig. 10-1). However, deviation of 200 bp made effect
on contig assembly, and 115 contigs were remained even at the redundancy of
9.3. This implies that it is critical to prepare libraries with narrow deviated
insert length. Contrast to varied read length, the variation in fragment length
resulted in no difference (data not shown). Chimerism made little difference
and 10 % of chimerism resulted in slight deviation. Because 10 % of chimerism
is hard to occur in real experiments, chimerism seems to be minor factors in
WGSS(Fig 10-2). It was observed that clone length deviation didn¡¯t make
significant influences on contig counts. The results were pre-estimated because
each read is important in fragment assembly and clone length is required for
picking scaffold in finishing process later (Fig. 10-3 ). One major concern to
WGSS were the sequencing errors (Fig. 10-4). Large deviation from equation (2)
was found when sequencing error percentile was assumed to be one and two.
Especially, beyond four-fold redundancy, number of contigs increased. This
result showed that reducing sequencing error is critical to completing WGSS. Insertion/deletion
sequencing error has possibility to play more important roles in fragment
assembly process. And this study examined what are the effects on overall
results when sequencing error rate was fixed (Fig. 10- 5). While it was
observed that higher insertion/deletion rate generally caused poor assembly
results, the most important factor may be sequencing errors. However,
sequencing errors are not identified as the same probabilities from beginning
to the end in sequencing data actually acquired, but tended to concentrate on
front and end portions of sequences. Thus, it is almost impossible that these
test results explained every tendency. Nevertheless, it is obvious that
accuracy of sequences is the important factor for effective performance of
assembly. Next, this study put a variety of factors that had been separately
studied before together and examined effects of those assembled factors (Fig.
10-6). Accordingly, it was confirmed that other factors relative to effects by
sequencing errors gave impact on assembly results.
- -½Ä(3)

Table. 3 A wide range of conditions set
to examine why test results didn¡¯t match theoretic estimates.

Fig. 10-1
Influences of Read Length on contig counts.
In this figure, 600bp in average length, SD=0.0 for case 1, SD=100 for case 2,
SD=200 for case 3. The black solid line is the results when Formula (2) was
applied to case 1 and the green solid line when Formula (3) was applied to case
3.

Fig. 10-2 Influences of Chimeric clone on
contig counts. case 1 (X=0.0), case 9(X=0.01), case 10(X=0.05), case 11( X=0.10)

Fig. 10-3
Influences of Clone length deviation on
contig. The solid line is the result when Formula (2) was applied, with
D=0.0 for case 1, D=0.10 for case 7, D=0.20 for case 3.

Fig. 10-4 Influences of sequencing errors on contig counts. R=0.000 for case 1, R=0.0010
for case 4, R=0.0020 for case 5 and R=0.0025 for case 6. Sequencing error rates
are higher in the order of case 1(black), case 2(red), case3(green) and case 4(blue).
Solid line and case 1 showed the results when each data point and Formula (2)
were applied. Other results were obtained as applying Formula (3). B is the
coefficient of Formula (3).

Fig. 10-5. Influences of Insertion/Deficiency/Substitution
Sequencing Error on Contig Count. case1- (R=0.000,IR=DR=0.000) case12-(R=0.002,IR=DR=0.010),
case 13- (R=0.002, IR=DR=0.05), case 14- (R=0.002, IR=DR= 0.010) and case5 -(R=0.002,
IR=DR=0.20) are related, where as Case1 for Formula (2) and others for Formula (3),
and coefficient b is in round bracket.

Fig. 10-6
Simultaneous Influences of a Variety of Factors on contig
counts
Discussion
We have developed data set generation program for microbial WGSS.
In contrast to uniformly-sampled clones assumption which was adopted by
previous programs, we applied biased distributed sampling model to our program.
Sampled reads biased to come from particular region of the target sequences
frequently occurrs in WGSS (Myers 1999b). This can happen by incomplete
fracturing of genome, biased insertion of fragments into vectors and improper
cloning of some insert/vector combinations. Up to date, the data set generators
(Engle and Burks 1994; Myers 1999a) were designed based on uniformly sampled
model. However, as shown in Fig. 3B, this assumption didn¡¯t hold in real
experiment. Therefore, the assembly simulation should be based on the real
biased distributed clones.
Although the advent of sequencing technology makes easy to do
WGSS, genome project still cost huge amount of money and time. Therefore, it is
essential to establish optimal sequencing strategy before getting into the
sequencing. In this study, we simulated sequence assembles with various data
sets consisting of different redundancy of different sized inserts (Table 1).
When the redundancy was held at 7 and 8, smallest number of scaffolds and
largest target coverage was achieved when 2 fold redundancy of 150 kb inserts
were added to 2 kb inserts data. This result is consistent with previous
finding that among one type of inserts, fewer scaffolds resulted when longer
insert lengths were used (Roach et al. 1995). The possibility of spanning
larger region between contigs can be enhanced by using longer inserts. However,
we should consider practical limitation to use large insert size. This
limitation is caused by the experimental difficulty of sequencing the ends of
long inserts and preparing large sized library. Therefore, determination of
redundancy and insert length should be based on economic balance. For example,
in Table 1, 5X/1X/1X of 2 kb-, 40 kb-, 150 kb inserts can cost less money than
3X/0X/2X due to less use of BAC libraries for 150 kb inserts. Likewise, if 3X
redundancy of 2 kb inserts were done and we should decide which insert length
and the extent of redundancy further sequenced, simulation of a variety of
strategies can give us clue to that decision.
Among experimental parameters tested - varied insert and read
length, chimerism, and sequencing error -, sequencing error was the most
critical factor in hampering sequence assembly. When the read sequences
contained above 10 %, the increased contig number beyond four-fold redundancy
was found (Fig. 3C). It is possible that the kind of error, substitution,
insertion, or deletion, affect the assembling process. Although, there was a
tendency for the increased proportion of insertion and deletion led to large
contig number at fixed total error rate (data not shown), the extent was
negligible. It has been widely accepted and an open secret that the quality of
sequencing ultimately determines the success of WGSS. However, there was no
systematic study to support this idea.
In summary, we have developed data set generator for microbial
WGSS and performed a variety of simulation in systematic way. Allowing users to
vary parameters such as distribution of fragment length and read length,
sequence error rate, and chimerism rate facilitate WGSS by establishing optimal
strategy. Development of sequence assembly related tools can be benefited from
systematically controlled data set by testing algorithm and capability. We
believe that students and researchers in microbiology who have limited access
to WGSS can take advantage of this program for enhancing their understanding in
genomics. A current release of the software and documentation can be available
upon request at hurlee@kribb.re.kr.
Acknowledgements
We thank Manheimia succiniciproducents MBEL55E genome sequencing team ? professor Sang Yup Lee at KAIST, Bioinfomatix Inc., and GenoTech Corporation for providing the genome sequences. We appreciate Dr. Huang, X at Michigan Uiv. and Dr. P. Green for providing CAP3 and Phred/Phrap package, respectively. This work was supported by the 21C Frontier Microbial Genomics and Applications Center Program (grant MG02-0402-001-1-0-0) from the Ministry of Science and Technology (MOST) of the Republic of Korea .
References
Altschul SF,
Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped
BLAST and PSI-BLAST: A new generation of protein database search programs.
Nucleic Acids Res 25: 3389–3402
Engle ML, Burks C (1994) Artificially Generated Data sets for testing DNA sequence Assembly
Algorithms. Genomics 16:286-288
Ewing B,
Hillier L, Wendl M, Green P (1998) Base-calling of automated sequencer traces
using Phred. I. Accuracy assessment.
Genome Res 8:175–185
Huang X,
Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9:868-877
Lander ES,
Waterman MS (1988) Genomic Mapping by Fingerprinting Random Clones: A
Mathematical Analysis. Genomics 2: 231-239
Lee PC, Lee SY, Hong SH, Chang
HN (2002) Isolation and characterization of a new succinic acid-producing
bacterium, Mannheimia succiniciproducens
MBEL55E, from bovine rumen. Appl Microbiol Biotechnol 58:663-668
Myers G (1999a) A Dataset
Generator for Whole Genome Shotgun Sequencing. Proc Int Conf Intell Syst Mol Biol 202-210
Myers G.
(1999b) Whole-genome DNA sequencing. Comput Sci Eng 1:33-43
Pop M,
Salzberg S, Shumway M (2002) Genome sequence assembly: Algorithms and Issues.
IEEE Computer 35: 47-54
Roach JC, Boysen C, Wang K, Hood L
(1995) Pairwise end sequencing: a unified approach to genomic mapping and
sequencing. Genomics 26:345-353