Manual Paper

 

-Implementation

-Installation

-How to Use c shell scripts adopted in this study

-Files in Program

-Input Data for Program

-Output

 

- c shell script adopted in this study

==============================================================

Manual

 

-Implementation

FASIM was developed using C language in SGI ORIGIN 2000 IRIX 6.5 and executable even in Compaq True 64.

 

============================================================

-Installation

 

      - Supported platforms

Ÿ          SGI ORIGIN 2000 IRIX 6.5 

Ÿ          COMPAQ True 64

Ÿ          Red Hat Linux release 8.0

        

      - Download

       To download the FASIM and Scripts, just click here coming soon!

     

After downloading:

> tar –xvzf  simulation.tar.Z

> cd simulation

> make

 

* mktrace(phred package), Phred and Phrap/CAP3 are separately compiled so that they have to be retrieved without other setting by sim program.

 

============================================================

-Files for Program

 

1) MHPOS.dat : Frequency function data according to actual position on reads used before completion of Mannheimia succiniciproducens MBEL55E finshing on chromosome

2)      simul.c : Creating sim, a main program executable program including mai()

3)      seq_cutil.c : groups of routines required to processing sequence

4)      seq_cutil.h : header for seq_cutil.c

5)      nrutil.h : header to use numerical receipe routine

6)      nr.h  : header to use numerical receipe routine

mktrace : Generating ABI file from sequence as the program including phred/phrap package.

 

 

============================================================

-Input Data for Program

 

1.      Genome sequence (FASTA format): This shall conform to FASTA format as Text file and one line in sequence shall not be too much long. While the upper limit for Buffer size is set 1024 at present, general FASTA format sets a relative upper limit to approximately 60 so that it is preferable to conform to the limit. Of course, numbers or blank in sequence are ignored

2.      Vector sequence (Fasta Format): Using the same format as genome sequence.

 

3.      Command line options

a. General Method to Use

       > sim [-x value] [-stat x y [z] ] [-circle]

      

     ex)

       > sim L 2000 N 500 sd seq01_dir cd chromat01_dir stat

       100 10000 seed 1 h PMP vn vector.seq vf 30 circle

       X 0.1 > stat_01.out &

 

b. Description for each Option (default setting in round blackets)

a) x=g : genome sequence file name (=genome.seq)

enter genome sequence file names including paths.

 

b) x=m: genome size in bp.(=2257487)

 

c) x=L : average fragment length  (=2000)

It means average length of shotgun fragment. Once it is executed, it is fixed and cannot be changed.

 

d) x=D : Deviation of fragment length  (=0.10)

Since the length of shotgun fragment is irregular and follows regular distribution on the basis of average length, standard deviation of length is set to adjust it. Enter a percentage to L. That is, [ -L 2000 -D 0.10] means that standard devision is 200bp.

 

e) x=X : Rate of chimerism  (=0.00)

It indicates percentage of chimera among the entire template. It is known that this rate has a variety of values in accordance with test conditions from about 1% to 10% related to actual test data. However, it will be difficult to get useful data with chimerism over 20% so that its upper limit was set to 0.2.

 

f) x=N : number of fragment pairs (=20000)

It indicates total number of shotgun fragment. Since one fragment(e.i. clone) is read in both directions, if N=20000, 4,000 reads are created. Even though the number of Fragment is set to 20000, 40,000 reads are not always created. In particular, when genome structure is linear, the fragment starting from the end of genome is skipped. Thus, we get slightly fewer fragments. If fragment in a certain direction during the process to creat chimerism, it is skipped so that we have the same result. 

 

g) x=h : header (=PML)

Shotgun fragment are generally named according to project names and types of libraries. Thus, this enables to define names. The length of this string cannot exceed 9 characters. In general, thee characters are applied.

 

h) x=vn: vector sequence file name (=vector.seq)

It is file name with vector sequence. It can include Directory name. String length is limited to 199.

 

i) x=vf : vector fragment length in a clone (=30)

The length of vector sequence required by each fragment(clone) is set. It is set to 30¿¡ by default at present. It is carefully adjusted as properly adjusting minmatch and minscore of cross_match during vector screening as keeping in mind that vector screening can fail. There will be no problems unless this value is excessively high.

 

j) x=vl : vector size (=8000)

It means actual size of vector(bp). You can enter accurate value, but low values are restricted. For sufficient allowance, it is preferable to enter bigger value than actual vector size.

 

k) x=sd: sequence files directory (=./seq_dir)

You can designate sequence file directory here. You can also enter Full path name. String length is limited to 199.

 

l) x=cd: chromatograms directory (=./chromat_dir)

The final chromatogram will be positioned in this directory. You can enter full path name. String length is also limited to 199.

 

m) x=rl : average read length  (=600)

Two reads(because this string is read in both directions) are created from one shotgun fragment, and each read length is not regular. Then, average length is set to 600 by default here, which means the length including vector. What to pay attention here is that if you do trimming for base calling using phred, the maximum length of remaining sequence after trimming is only 550 bp. Accordingly, you lose many portions of long sequences. For accurate calculation, it is required to set sizes below 550 or slightly modify phred source.

 

n) x=rd: deviation of read length  (=50)

It is mentioned above that read length is irregular. The range of this deviation is set in read length deviation, which will be standard deviation. In this case, we must pay attention that the range of this deviation is set in bp value unlike clone length deviation. Even though absolute value of read length is not changed in a broad scope, clone length has significantly broader scope of deviation. Thus, relative ratio is regarded as more proper input.

 

o) x=seed: a seed number for a random number generation (=-1)

It includes a number of random processes including basic process to create fragments. If completely random process is executed whenever you run program, it is difficult to compare the results of program execution or debugging. Thus, it is designed to always give the same results when seed values defined in this string are same values. Of course, pseudo random sequence is generated for one seed value during execution. With this process, you can achieve two goals. When you want to see real random effects, it is better to compare results by changing seed values. In this case, seed values shall be negative.

 

p) x=rate :sequencing error rate (=1.00e-04)

This option enables to introduce certain sequencing error. This option is adopted as alternative because it is not easy to accurately copy actual problems caused by irregular sequence quality. Thus, this option sets overall sequencing error rate. In addition, when this option is set, it effects options of ir and dr. In general, values corresponding to rate * (1.0 – irdr )become sequencing error rates for substitution.

 

q) x=ir :The portion of insertion sequencing error (=0.20 )

This option defines percentage of inserted sequencing error among entire sequencing errors. Thus, actual inserted sequencing error is rate * ir.

 

r) x=dr :The portion of deletion sequencing error (=0.20 )

This option defines percentage of deleted sequencing error(deletion mutation) among entire sequencing errors. Thus, actual deleted sequencing error rate is rate * dr.

 

s)-circle: This option is for circular genome(default=unset).

Sequence entered by Fasta format is assumed to be linear. However, since general bacterial genome is circular, this option shall be considered. Thus, it is required to clearly state command line option, that is, -circle, for circular sequences. Sequences are assumed to be linear when this option is omitted.

 

t)-stat x y z:

if this option is given then statistics are  printed

to stdout.(default=unset)

     x(=100  ) : bin size of fragments

     y(=10000) : bin size of positions

     z(=0    ) : statistics only options

 

This option is to check whether you get data controlled according to basic input. You will sometimes analyze after getting all results when repetitively running program. In other ways, you may be interested only in statistics related to sequence generation. In this case, executing all processes will be time-consuming process. Therefore, in some case, it is necessary to check even sampling by position of genome or clone length distribution. For setting –stat option, two options(x, y) immediately following it shall be entered and the final z is optional. If z>0, only statistics are printed, but ABI file is not generated. Bin size means the dimension of sector. That is, if x=100, it means that clone length is counted by 100bp unit. Y=10000 means that starting point counts the number of fragments received in each section when genome is divided by 10000bp unit.

 

 

============================================================

-Output

 

1.    Position distribution

Position distributions of generated colons are printed.

 

Example)

 

2.    Clone length distribution

Length distributions of generated clones are printed by section.

 

Example)

 

3.      Read length distribution

4.      Generated DNA sequences: These files are set in –sd. The default setting is ./seq_dir.

5.      ABI files from Sequences: These files are set in –cd. The default setting is./chromat_dir.

 

============================================================

-How to Use C shell scripts in this study

 

Using scripts in simulation/scripts, Operations of FASIM and other relevant programs (phred/cross_match/phrap)and corresponding results can be easily analyzed.

 

1. Script (mkall.csh) to make directories where dataset to be generated by FASIM is saved

Scripts make relevant directories to collect dataset in a wide range of conditions to be generated by FASIM.

 

¡°Usage :  mkall.csh  chromat_dirs  seq_dirs  stat_dirs ¡°

%  ./mkall.csh  ch  se  st

 

2.  Script (sall01.csh) to generate 11 different conditions for dataset using FASIM

It is the simulation process by FASIM and designates directory names for data to be generated.

 

¡°Usage:  sall01.csh chromat_dir seq_dir stat_dir¡±

%  ./sall01.csh  ch  se  st

 

3. Script (phred.csh)to execute phred

This process is for basecalling ABI files among dataset to be generated by FASIM using phred.

 

¡°Usage : phred.csh chromat_dir¡°  

% ./phred.csh ch

 

4. Script (call.csh)to execute cross_match

This process is for masking vector sequences including reads after basecalling by bphred.

 

¡°Usage : call.csh  vector_file¡±

% ./call.csh  pGEM3.seq

 

5. Script (pall.csh) to execute phrap

This process is for assembling sequences after vector masking and runs phrap.

 

¡°Usage : pall.csh ¡°

% ./pall.csh

 

6. Scripts (gall.csh) to count the numbers of generated contigs

This process is for counting contigs that are assembled by phrap and designates directory names to include assembly results.

 

¡°Usage: gall.csh directory¡±

% ./gall.csh edit_dir

 

7. Script (rmall.csh) to delete dataset generated by FASIM

 

8. Scripts ( mvall.csh) to change .qual file names generated by Phred to .screen.qual.

 

 

============================================================

< c shell scripts adopted in this study >

 

1. Scripts to generate 11 different conditions for dataset

 

#!/bin/csh

if($#argv < 3 ) then

   echo "Usage : $0  chromat_dirs  seq_dirs  stat_dir"

   exit(1)

endif

set ids=(01 03 05 08 10 15 20 25 30 35 40)

set prog="sim"

set option1="-L 2000 -stat 100 10000 -seed -1 -h PMP -vn pUC_GEM.seq -vf 30 -circle"

set option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0"

foreach tmp ($ids )

   @ idd = $tmp

   @ idd = $idd * 500

   set option3="-N $idd -sd $2${tmp}_dir -cd $1${tmp}_dir"

   set command="$prog $option1 $option2 $option3"

   $command > $3_dir/stat_${tmp}.out &

end

 

The script above is to generate data of case 1. Option2 was differentiated each other for different cases.

case 1 : option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0"

case 2 : option2="-D 0.0 -X 0.0 -rd 100 -rate 0.0"

case 3 : option2="-D 0.0 -X 0.0 -rd 200 -rate 0.0"

case 4 : option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0010"

case 5 : option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0010"

case 6 : option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0025"

case 7 : option2="-D 0.1 -X 0.0 -rd 0 -rate 0.0"

case 8 : option2="-D 0.2 -X 0.0 -rd 0 -rate 0.0"

case 9 : option2="-D 0.0 -X 0.01 -rd 0 -rate 0.0"

case10: option2="-D 0.0 -X 0.05 -rd 0 -rate 0.0"

case11: option2="-D 0.0 -X 0.10 -rd 0 -rate 0.0"


case12: option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0020 -ir 0.01 -dr 0.01"

case13: option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0020 -ir 0.05 -dr 0.05"

case14: option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0020 -ir 0.10 -dr 0.10"

case15: option2="-D 0.2 -X 0.01 -rd 200 -rate 0.002"

 

2. Scripts to run phred

#!/bin/csh

 

if($#argv<1) then

   echo "Usage: $0 chromat_directories"

   exit(1);

endif

 

set ids=(01 03 05 08 10 15 20 25 30 35 40)

set prog="phred"

 

foreach tmp ($ids )

   set options="-id $1${tmp}_dir -trim_fasta -sa pm_$tmp -qa pm_$tmp.qual"

   $prog $options -trim "" &

end

 

3.Script to run cross_match

!/bin/csh

if($#argv < 1 ) then

   echo "Usage : $0  vector_file"

   exit(0);

endif

 

set ids=(01 03 05 08 10 15 20 25 30 35 40)

set prog="cross_match"

 

foreach tmp ($ids )

$prog pm_$tmp $1 -minscore 15 -screen > pm_$tmp.out &

end

 

 

 

4. Script to run phrap

 

#!/bin/csh

 

#if($#argv < 1 ) then

#  echo "Usage : $0  vector_file"

#  exit(0);

#endif

 

set ids=(01 03 05 08 10 15 20 25 30 35 40)

set prog="phrap"

 

foreach tmp ($ids )

   $prog -minmatch 30 pm_$tmp.screen > pm_$tmp.screen.phrap.out &

end

 

 

5.Script to count the number of contig

 

#!/bin/csh

 

if($#argv < 1 ) then

   "Usage : $0  Directory"

   exit 0

endif

 

if( -e $1) then

   set dir=$1

else

   "The directory does not exist..."

   exit 0

endif

 

set ids=(01 03 05 08 10 15 20 25 30 35 40)

set prog="grep"

echo "-------------------------------"

echo "Reads Contigs  Singlets Total"

echo "-------------------------------"

 

@ contigs  = 0

@ singlets = 0

@ total    = 0

@ reads    = 0

 

foreach tmp ($ids )

   @ reads  ="${tmp}000"

   @ contigs=`grep ">" $dir/pm_$tmp.screen.contigs | wc -l`

   @ singlets=`grep ">" $dir/pm_$tmp.screen.singlets | wc -l`

   @ total = $contigs + $singlets

   echo "$reads   $contigs $singlets   $total";

end

echo "-------------------------------"

 

 

============================================================