FASIM: Fragments Assembly Simulation using Biased-Sampling Model and Assembly Simulation for Microbial Genome Shotgun Sequencing

-- Manual

Hur, Cheol-Goo1,3 , Sunny Kim1 , Chang Hoon Kim1 , Sung Ho Yoon1 , Yong-Ho In2 , Cheolmin Kim3 , AND Hwan Gue Cho3*

 

 

1 Division of Genomics and Proteomics, Korea Research Institute of Bioscience and Biotechnology (KRIBB), 52 Oun-dong, Yusong-gu, Daejon 305-333, Korea

2 Bioinfomatix, Inc., Nam Chang bldg, 748-16, Yeoksam-dong, Gangnam-gu, Seoul 135-925, KOREA

3 Bioinformatics Cooperative Course , Pusan National University, 30 Jangjeon-dong, Geumjeong-gu, Pusan 609-735, Korea

* Corresponding author

Phone: 82-51-510-2283; Fax: 82-51-515-2208; E-mail: hgcho@pusan.ac.kr

A current release of the FASIM can be available upon request at hurlee@kribb.re.kr .


Abstract

We have developed a program for generating shotgun data sets from known genome sequences. Generation of synthetic data set by computer program is a useful alternative to real data to which students and researchers have limited access. Uniformly-distributed-sampling clones which were adopted by previous programs cannot account for the real situation where sampled reads tend to come from particular regions of the target genome. To reflect such situation, a probabilistic model for biased sampling distribution was developed by using experimental data set derivied from a microbial genome project. Among experimental parameters tested - varied fragment or read lengths, chimerism, and sequencing error, sequencing error was the most critical factor that hampers sequence assembly. We propose that optimum sequencing strategy employing different insert length and redundancy can be established by performing a variety of simulations.

 

Introduction

With the advent of shotgun sequencing approach, genome sequnece data are being rapidly accumulated. As for microorganisms, 142 complete genomes have been published as of Oct. 2003. Whole genome shotgun sequencing (WGSS) is to sequence randomly generated fragments of an entire genome, and then to assemble the sequences by computer program (Myers 1999b). Availability of a sequence assembly program and well-designed shotgun approach can considerably reduce the cost and time for genome projects. In silico simulation of entire sequencing steps such as base calling from chromatogram, vector masking and sequence assembly, can be extremely helpful in understanding each step and in designing  the experiments with its predictive power. However, this ¡°virtual experience of genome project¡± is hampered by the limited availability of real experimental data sets.

Generation of random short fragment sequences by computer program can be a useful alternative to the real data set. To the best of our knowledge, only two programs have been reported – GenFrag 2.1 (Engle and Burks 1994) and celsim (Myers 1999a).

Previously presented GenFrag (version 1.0) was designed to generate artificial DNA sequences fragment set for testing DNA fragment assembly algorithm. Dataset generated as mentioned above is independent and system-like to evaluate algorithm. Version 2.1 adds the function to generate dataset that reflects potentially troublesome aspects of fragment assembly more than before. GenFrag generates fragment set for given DNA sequences to be input according to criteria defined by users. For main features, it can adjust fragment length of parent sequence, depth of coverage and error rate in fragments. For Version 1.0, a single error rate is defined and all nucleotides are randomly mutated on the basis of this error rate. However, error rates are increased with increase of fragment length so that errors are not generated in the same degree according to fragment length. Thus, GenFrag 2.0 provides three error distributions as applying position-specific error profile: Substitutions, insertions and deletions. Moreover, it adds new module for repetitive elements that act as barrier in assembly projects so as to allow repeat complexity before Parent sequences are divided into fragments. You fill input files including one or more DNA sequences(used as repeat templates) according to required insertion frequency values. That is, GenFrag 2.0 is a simulator enabling to examine performance of assembly tools and the most distinguished feature is to adjust error rates of sample sequences.

Celsim are tools that can generate practical and repetitive target sequences and shotgun data sets. It largely consists of three parts. Firstly, sequence generation stage makes DNA sequences generated in artificially controlled types (repetitive array, transposon-like repetitive elements, large-scale duplications, etc.) Secondly, optional polymorphism stage applies polymorphic variations to generated target sequences. Finally, shotgun stage simulates sampling and end-sequencing as much as you want relative to insert libraries to fit to artificially selected conditions (read length, numbers, single base errors, chimeric rate, etc.)

That is, celsim is the tool to generate shotgun data set using known DAN sequences or generate DNA sequences that are artificially adjusted.

However, programs described above still have some limits in developing assembly tools, examining performance of assembly tools or simulating WGSS. While celsim is designed to enable users to adjust conditions(chimerism, sequencing error, distribution of fragment or read length) in consideration of a wide range of experimental conditions that can occur in actual WGSS, these programs are designed on the assumption that fragments are equally distributed on genome. That is, clone sampling is uniformly carried out during fragment generation. Thus, these programs don¡¯t sufficiently reflect actual conditions of WGSS.

This is caused by the observation that biased clone distribution always happens due to various reasons – incomplete fracturing of genome, biased insertion of fragments into vectors and improper cloning of some insert/vector combinations (Myers 1999b).

In this study, we developed a program suite for generating random fragments in microbial genome sequencing. To create a more realistic program, a probabilistic model of biased sampling distribution was applied to generate synthetic shotgun data sets from known genome sequences. We also report the effects of various experimental factors on sequence assembly.


Materials and methods

 

Generation of a probabilistic model for biased random sampling

 

Probabilistic model for biased sampling distribution was developed by using experimentally derived data set constructed from the genome project of Mannheimia succiniciproducents MBEL55E (accession no. AE016827, unpublished). This strain is a novel succinic acid-producing bacterium, which was isolated from bovine rumen (Lee et al. 2002). The chromosome consists of 2,314,078 bp, which were cloned into 18,958 of 2-kb clones and 294 of 40-kb clones. As shown in Fig. 1, the bacterial chromosome was scanned for finding out the start position of each read by BLASTN program (Altschul et al. 1997). The cumulative distribution function for biasedsampling, F(x), was defined as

             (0xgenome size) (1)

where f(i) is the number of reads positioned at the  ith position and N is the total number of reads. Once a random number between zero and one is selected, the read position corresponding to that number is traced from F(x). Starting from a generated position, inserts having different length such as 2 kb, 40 kb or 150 kb were taken out from the Pasteurella multocida genome sequence (accession ro. NC_002663).

  Fig. 1 Process to Develop Random Distribution Model

 

 

Development of a data set generation program

 

The program was written in C and runs as a Unix or Linux command line tool. It was designed to generate sequence files of a shotgun data set from known genome sequences (Fig. 2). Several parameters(Table. 1) allow users to control the read sequences. Normal distribution of fragment and read lengths can be parameterized. The sequencing error rate can be optionally specified. Two file names, genome sequence file and vector sequence file, are arguments with several parameter settings (Fig. 2). The positions of every fragment are generated from a derived probabilistic model (see Materials and methods). The information file containing the inserts distribution of length and position of inserts is also provided. After reads were generated from ends of inserts, vector sequence was added to the 5¡¯ of 3¡¯ end of each read sequence. At this step, chimeric reads are generated by combining two randomly selected reads. pGEM3Zf was used as a vector for 2 kb clone and pEpiFOS-5 for 40- and 150 kb clones. The resulting read sequences are converted into ABI files by Phred package (Ewing et al. 1998).

Output file generally consists of three parts. Read sequence files directory and chromatograms directory in Fasta format are generated, clone position/length distribution is calculated and the results are printed by section. Upon the completion of this process, fragment assembly is carried out using CAP3(or Phrap). In this case, assembly results are analyzed as the counts of contig check the counts of scaffold by coverage stage.

 

 

Fig. 2 Module Configuration of FASIM and Functions of each Module. modul-1 enters command lines, genome sequence and vector sequence. module –2 calculates start points, end points and lengths of clones that are entered. On the basis of this data, module-3 calculates required statistics. Then, defined number of  normal clones are generated and converted to ABI files. Next, chimeric clones are generated to match total number.

 

Module -1 : Command line interpretation and Vector and genome sequence input processing

       1)             A variety of options given by command lines are processed. Each option is entered and checked by purpose of each parameter.

       2)             DNA sequences in Fasta Format are entered and then blank and numbers are deleted. Then, standard nucleotide and ambiguous sequences are received.

Module-2 :  Strain sequences to simulate and vector data to test are entered.

Module-3 : Random position sampling on genome

       1)             This module carries out sampling for clone picking start points with each genome position from 0.0 to 1.0 generated by random number generation using c.d.f function that is generated through learning.

       2)             This module temporally calculates end points using average lengths from start points and define final end points as adjusting temporal end points to acquire standard distribution within given scope of deviation.

 

Module-4 : Routine to calculate statistics of templates

1)      This module prints distribution by given section to check whether clone boundaries in Position distribution Sampling are evenly distributed on genome.

2)      Clone length distribution : Clone length distribution is set according to standard distribution. This module prints clone length distribution by section to check the results.

 

Module-5 :  Routine to generate normal clone

1)      This module singles out sequences from genome sequences on the basis of clone data from Module-3.

2)      If start point is too close to the end of genome, this module processes in accordance with conditions(when bacterial genome is linear, it is not selected as clone. For circular genome, deficient portions are selected from the first part of genome). (using ran1(), gasdev() in Numeric Receipe routine)

3)      This module attaches vector sequence to the sides 5¡¯ and 3¡¯.

                                                           

Module-6 : Routine to generate Chimeric clones

       1)             This module introduces chimerism only when lengths of two clones are over 0 as selecting data on boundaries and lengths of two independent clones.

       2)             This module decides positions for recombination as generating random numbers on the basis of short clone lengths, selects two sequences from genome sequence and generates chimeric clones on the basis of determined positions.

       3)             This module attaches vector sequence to the sides 5¡¯ and 3¡¯.

 

Module-7 : Read maker

1)      This module decides read lengths conforming to standard distribution with input read length deviation as standard deviation and input read length as average. Then, it generates reads in required length including vector sequences from generated clones above.

2)      This module allows adjusting ratio of substitution, insertion and deficiency in consideration of sequencing errors.

 3 ) This module generates chromatogram files as bring mktrace.

 

Files for FASIM Program

   1) MHPOS.dat : ) Frequency function data according to actual positions of reads used until completion of Mannheimia succiniciproducens MBEL55E finishing on chromosome

2) simul.c : generation of sim, executable program for main program including mai().

3) seq_cutil.c : Groups of routines required for sequence processing

4) seq_cutil.h : Header for seq_cutil.c

5) nrutil.h : Header for using numerical receipe routine

6) nr.h  : Header for using numerical receipe routine

7) mktrace : Generating ABI files from sequences as the program included in phred/phrap package. It shall be positioned to allow fasim program can retrieve without special settings, as being separately complied.

 

 

Major parameter of FASIM program

Table. 1 : Major parameters of FASIM program can carry out a wide range of simulation works for programs.

 

Sequence assembly

Base calling from ABI file was done by using Phred package (http://www.phrap.org). After the vector sequence was masked, read sequences were assembled into contigs (overlapped fragments in a contiguous region) and scaffolds (maximally linked and ordered set of contigs) by CAP3 (Huang and Madan 1999). (Fig. 3)

Fig. 3 Assembly Process on the basis of Generated Data

 


Results

Results from Running Program

(a) Position distribution

For Uniform distribution( Fig. 4 ), sampling was evenly carried out on genome in general. It indicates sampling of start points on genome by section of which the size was set to 10Kbp. Since the size of overall genome in Pasteurella multocida was about 2.26 Million bp, X-axis reaches to about 2.5 Million and about 90(black-2kb clone) clones were allocated on each section in average. It shows uniform distribution in general in spite of substantial deviation. Red points were fragments generated on the basis of about 40 Kbp. In this step, about 20 fragments with different types per section were generated and were also evenly distributed. On the contrary, when cumulative distribution model generated through learning data of Mannheimia succiniciproducents ( Fig. 5 ), we got the clone picking results that were biased similarly to shotgun sequencing approach.

Fig. 5 Uniform distribution

Fig.6 Cumulative distribution

(b) Clone length distribution

It was confirmed that clone length distribution was also controlled in a pattern according to standard distribution. From this distribution, we can understand the length distributions when 40kb mate and 2kb mate were generated (Fig. 8).

          Fig. 7  Length Distribution of 2Kb clones

                 

            Fig. 8  Length Distribution of 40Kb clones

(c) Read length distribution

The chart shows that read length was well controlled to conform to standard distribution. The length in this chart indicates the length in the state including vector sequence.

Fig. 9 Calculation with 600dp in average size distribution length of read length and 200bp in deviation

 

Distribution of fragments generated by the program

 

To test the performance of the program, we generated sequencing reads uniformly from published genome sequence, Pasteurella multocida, which is circular genome having 2,257,487 bp (G). The simulation condition was given as insert length of 2 kb, read length of 600 bp (L), and no experimental errors. The contigs were generated by assembling the reads with setting of minimal overlapped length between reads (T) as 30 bp. Theoretically, if all the reads having same size are uniformly sampled from genome, the expected number of contigs can be calculated from

                    (2) (Lander and Waterman 1988)

where N is the number of sequenced reads.

As shown in Fig. 3A, the summed number of contigs and singlets, that is reads left from contig assembly was perfectly matched to equation (2) as number of reads went. This shows this program performs well in generating shotgun data sets. In theory, uniformly sampled reads with no experimental errors guarantee 99 percent of the genome can be covered by eightfold redundancy (Pop et al. 2002), which hardly occurred in real situation. When the number of contigs generated from Mannheimia WGSS were plotted according to redundancy, the patterns were quite different from that of equation (2), and the extent of deviation was largely propagated at the higher redundancy (Fig. 3B).

Fig. 3 Redundancy profiles of contig number generated by the program (A) and by coming from Manheimia genome sequencing project (B). (--¡à-- contigs; --¡â--singlets; --¡Ü-- contigs + singlets; solid line fitted to equation (2))

 

Effect of sequencing strategy on sequence assembly

 

Redundancy of sequence and the choice of insert length are key factors in reducing scaffolds. To find out the effect of sequencing strategy on assembly, combinations of inserts having different length were simulated and the results were compared in Table 2. The data sets were generated based on biased-sampled model for the purpose of simulating more realistic situation. We chose 40 kb as a representative insert length for cosmid, and 150 kb for BAC (bacterial artificial chromosome). At the fixed redundancy, fewer scaffolds and larger target coverage resulted when the proportion of longer inserts was increased. For example, 250 scaffolds were made when only 2 kb inserts were used to give 3 fold redundancy. When 2¢¥ redundancy was further added, 85, 124, 130, and 141 scaffolds were reduced at the 2 kb-, 40 kb- and 150kb- inserts ratio of 5:0:0, 4:1:1, 3:1:1 and 3:0:2, respectively. As for the target coverage, 85.4 %, 90.72%, 91.34%, 93.62% were covered by scaffolds as addition of longer inserts increased. A mixture of 60% of 2 kb inserts and 40% of 150 kb inserts data gave best assembling performance and 109 scaffolds and 93.62% target coverage was achieved. Same discussion was applied to the assemblies at redundancy of 7 fold.

 

Table 2. Effects of using multi-library on sequence assembly at various redundancies. Data sets were derived from biased-sampled model.

 

Redundancy

Redundancy of inserts

(2 kb/40 kb/150 kb)

No. contigs

No. scaffolds

Target coverage(%)

3X

3X/0X/0X

994

250

74.82

5X

5X/0X/0X

735

165

85.40

4X/1X/0X

653

126

90.72

3X/1X/1X

642

120

91.34

3X/0X/2X

536

109

93.62

7X

7X/0X/0X

584

121

89.54

6X/1X/0X

454

96

93.35

5X/1X/1X

447

83

94.57

5X/0X/2X

360

71

95.93

 

 

Effect of experimental conditions on sequence assembly

 

Influences of a wide range of experimental conditions on fragment assembly were examined. Assembly was carried out using PHRAP on the basis of total 15 dataset (Table. 3).

Simulations were carried out to find out the effects of experimental conditions on sequence assembly (Fig. 10). We simulated data sets generated by uniformly distributed samples from Pasteurella genome sequence and the results were compared to line fitted to equation (2), which assumed uniformly sampled reads with no experimental errors. Distribution of read lengths with standard deviation of 100 bp gave no difference (Fig. 10-1). However, deviation of 200 bp made effect on contig assembly, and 115 contigs were remained even at the redundancy of 9.3. This implies that it is critical to prepare libraries with narrow deviated insert length. Contrast to varied read length, the variation in fragment length resulted in no difference (data not shown). Chimerism made little difference and 10 % of chimerism resulted in slight deviation. Because 10 % of chimerism is hard to occur in real experiments, chimerism seems to be minor factors in WGSS(Fig 10-2). It was observed that clone length deviation didn¡¯t make significant influences on contig counts. The results were pre-estimated because each read is important in fragment assembly and clone length is required for picking scaffold in finishing process later (Fig. 10-3 ). One major concern to WGSS were the sequencing errors (Fig. 10-4). Large deviation from equation (2) was found when sequencing error percentile was assumed to be one and two. Especially, beyond four-fold redundancy, number of contigs increased. This result showed that reducing sequencing error is critical to completing WGSS. Insertion/deletion sequencing error has possibility to play more important roles in fragment assembly process. And this study examined what are the effects on overall results when sequencing error rate was fixed (Fig. 10- 5). While it was observed that higher insertion/deletion rate generally caused poor assembly results, the most important factor may be sequencing errors. However, sequencing errors are not identified as the same probabilities from beginning to the end in sequencing data actually acquired, but tended to concentrate on front and end portions of sequences. Thus, it is almost impossible that these test results explained every tendency. Nevertheless, it is obvious that accuracy of sequences is the important factor for effective performance of assembly. Next, this study put a variety of factors that had been separately studied before together and examined effects of those assembled factors (Fig. 10-6). Accordingly, it was confirmed that other factors relative to effects by sequencing errors gave impact on assembly results.

         - -½Ä(3)

 Table. 3 A wide range of conditions set to examine why test results didn¡¯t match theoretic estimates.

 

Fig. 10-1 Influences of Read Length on contig counts. In this figure, 600bp in average length, SD=0.0 for case 1, SD=100 for case 2, SD=200 for case 3. The black solid line is the results when Formula (2) was applied to case 1 and the green solid line when Formula (3) was applied to case 3.

 

Fig. 10-2 Influences of Chimeric clone on contig counts. case 1 (X=0.0), case 9(X=0.01), case 10(X=0.05), case 11( X=0.10)

Fig. 10-3 Influences of Clone length deviation on contig. The solid line is the result when Formula (2) was applied, with D=0.0 for case 1, D=0.10 for case 7, D=0.20 for case 3.

Fig. 10-4 Influences of sequencing errors on contig counts. R=0.000 for case 1, R=0.0010 for case 4, R=0.0020 for case 5 and R=0.0025 for case 6. Sequencing error rates are higher in the order of case 1(black), case 2(red), case3(green) and case 4(blue). Solid line and case 1 showed the results when each data point and Formula (2) were applied. Other results were obtained as applying Formula (3). B is the coefficient of Formula (3).

 

 

Fig. 10-5. Influences of Insertion/Deficiency/Substitution Sequencing Error on Contig Count. case1- (R=0.000,IR=DR=0.000) case12-(R=0.002,IR=DR=0.010), case 13- (R=0.002, IR=DR=0.05), case 14- (R=0.002, IR=DR= 0.010) and case5 -(R=0.002, IR=DR=0.20) are related, where as Case1 for Formula (2) and others for Formula (3), and coefficient b is in round bracket.

 

Fig. 10-6 Simultaneous Influences of a Variety of Factors on contig counts


Discussion

 

We have developed data set generation program for microbial WGSS. In contrast to uniformly-sampled clones assumption which was adopted by previous programs, we applied biased distributed sampling model to our program. Sampled reads biased to come from particular region of the target sequences frequently occurrs in WGSS (Myers 1999b). This can happen by incomplete fracturing of genome, biased insertion of fragments into vectors and improper cloning of some insert/vector combinations. Up to date, the data set generators (Engle and Burks 1994; Myers 1999a) were designed based on uniformly sampled model. However, as shown in Fig. 3B, this assumption didn¡¯t hold in real experiment. Therefore, the assembly simulation should be based on the real biased distributed clones.

Although the advent of sequencing technology makes easy to do WGSS, genome project still cost huge amount of money and time. Therefore, it is essential to establish optimal sequencing strategy before getting into the sequencing. In this study, we simulated sequence assembles with various data sets consisting of different redundancy of different sized inserts (Table 1). When the redundancy was held at 7 and 8, smallest number of scaffolds and largest target coverage was achieved when 2 fold redundancy of 150 kb inserts were added to 2 kb inserts data. This result is consistent with previous finding that among one type of inserts, fewer scaffolds resulted when longer insert lengths were used (Roach et al. 1995). The possibility of spanning larger region between contigs can be enhanced by using longer inserts. However, we should consider practical limitation to use large insert size. This limitation is caused by the experimental difficulty of sequencing the ends of long inserts and preparing large sized library. Therefore, determination of redundancy and insert length should be based on economic balance. For example, in Table 1, 5X/1X/1X of 2 kb-, 40 kb-, 150 kb inserts can cost less money than 3X/0X/2X due to less use of BAC libraries for 150 kb inserts. Likewise, if 3X redundancy of 2 kb inserts were done and we should decide which insert length and the extent of redundancy further sequenced, simulation of a variety of strategies can give us clue to that decision.

Among experimental parameters tested - varied insert and read length, chimerism, and sequencing error -, sequencing error was the most critical factor in hampering sequence assembly. When the read sequences contained above 10 %, the increased contig number beyond four-fold redundancy was found (Fig. 3C). It is possible that the kind of error, substitution, insertion, or deletion, affect the assembling process. Although, there was a tendency for the increased proportion of insertion and deletion led to large contig number at fixed total error rate (data not shown), the extent was negligible. It has been widely accepted and an open secret that the quality of sequencing ultimately determines the success of WGSS. However, there was no systematic study to support this idea.

In summary, we have developed data set generator for microbial WGSS and performed a variety of simulation in systematic way. Allowing users to vary parameters such as distribution of fragment length and read length, sequence error rate, and chimerism rate facilitate WGSS by establishing optimal strategy. Development of sequence assembly related tools can be benefited from systematically controlled data set by testing algorithm and capability. We believe that students and researchers in microbiology who have limited access to WGSS can take advantage of this program for enhancing their understanding in genomics. A current release of the software and documentation can be available upon request at hurlee@kribb.re.kr.

 


Acknowledgements

We thank Manheimia succiniciproducents MBEL55E genome sequencing team ? professor Sang Yup Lee at KAIST, Bioinfomatix Inc., and GenoTech Corporation for providing the genome sequences. We appreciate Dr. Huang, X at Michigan Uiv. and Dr. P. Green for providing CAP3 and Phred/Phrap package, respectively. This work was supported by the 21C Frontier Microbial Genomics and Applications Center Program (grant MG02-0402-001-1-0-0) from the Ministry of Science and Technology (MOST) of the Republic of Korea .

 


References

 

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402

Engle ML, Burks C (1994) Artificially Generated Data sets for testing DNA sequence Assembly Algorithms. Genomics 16:286-288

Ewing B, Hillier L, Wendl M, Green P (1998) Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8:175–185

Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9:868-877

Lander ES, Waterman MS (1988) Genomic Mapping by Fingerprinting Random Clones: A Mathematical Analysis. Genomics 2: 231-239

Lee PC, Lee SY, Hong SH, Chang HN (2002) Isolation and characterization of a new succinic acid-producing bacterium, Mannheimia succiniciproducens MBEL55E, from bovine rumen. Appl Microbiol Biotechnol 58:663-668

Myers G (1999a) A Dataset Generator for Whole Genome Shotgun Sequencing. Proc Int Conf Intell Syst Mol Biol 202-210

Myers G. (1999b) Whole-genome DNA sequencing. Comput Sci Eng 1:33-43

Pop M, Salzberg S, Shumway M (2002) Genome sequence assembly: Algorithms and Issues. IEEE Computer 35: 47-54

Roach JC, Boysen C, Wang K, Hood L (1995) Pairwise end sequencing: a unified approach to genomic mapping and sequencing. Genomics 26:345-353