Manual
-Implementation
-Installation
-How to Use c shell scripts adopted in this
study
-Files in Program
-Input Data for Program
-Output
- c shell script adopted in this study
==============================================================
Manual
-Implementation
FASIM
was developed using C language in SGI ORIGIN 2000 IRIX 6.5 and executable even
in Compaq True 64.
============================================================
-Installation
- Supported platforms
SGI ORIGIN 2000 IRIX 6.5
COMPAQ True 64
Red Hat Linux release 8.0
-
Download
To
download the FASIM and Scripts, just click here coming soon!
After downloading:
> tar –xvzf simulation.tar.Z
> cd
simulation
> make
* mktrace(phred package), Phred and Phrap/CAP3 are separately compiled
so that they have to be retrieved without other setting by sim
program.
============================================================
-Files for Program
1)
MHPOS.dat
: Frequency function data according to actual position on reads used
before completion of Mannheimia succiniciproducens
MBEL55E finshing on chromosome
2) simul.c : Creating sim, a main program executable program including mai()
3) seq_cutil.c : groups of routines
required to processing sequence
4) seq_cutil.h : header for seq_cutil.c
5) nrutil.h : header to use numerical receipe routine
6) nr.h : header to use numerical receipe routine
mktrace : Generating ABI file from
sequence as the program including phred/phrap package.
============================================================
-Input Data for Program
1. Genome sequence (FASTA
format): This shall conform to FASTA format as Text file and one line in
sequence shall not be too much long. While the upper limit for Buffer size is
set 1024 at present, general FASTA format sets a relative upper limit to
approximately 60 so that it is preferable to conform to the limit. Of course,
numbers or blank in sequence are ignored
2. Vector sequence (Fasta Format): Using the same format as genome sequence.
3. Command line options
a. General Method to Use
>
sim [-x value] [-stat x y [z] ]
[-circle]
ex)
>
sim –L 2000 –N 500 –sd seq01_dir –cd chromat01_dir –stat
100
10000 –seed –1 –h PMP –vn vector.seq –vf 30 –circle
–X 0.1 >
stat_01.out &
b.
Description for each Option (default setting in round blackets)
a)
x=g : genome sequence file name (=genome.seq)
enter
genome sequence file names including paths.
b)
x=m: genome size in bp.(=2257487)
c)
x=L : average fragment length
(=2000)
It
means average length of shotgun fragment. Once it is executed, it is fixed and
cannot be changed.
d)
x=D : Deviation of fragment length
(=0.10)
Since the
length of shotgun fragment is irregular and follows regular distribution on the
basis of average length, standard deviation of length is set to adjust it.
Enter a percentage to L. That is, [ -L 2000 -D 0.10]
means that standard devision is 200bp.
e)
x=X : Rate of chimerism (=0.00)
It
indicates percentage of chimera among the entire template. It is known that
this rate has a variety of values in accordance with test conditions from about
1% to 10% related to actual test data. However, it will be difficult to get
useful data with chimerism over 20% so that its upper limit was set to 0.2.
f)
x=N : number of fragment pairs (=20000)
It indicates
total number of shotgun fragment. Since one fragment(e.i. clone) is read in both directions, if N=20000, 4,000
reads are created. Even though the number of Fragment is set to 20000, 40,000
reads are not always created. In particular, when genome structure is linear,
the fragment starting from the end of genome is skipped. Thus, we get slightly
fewer fragments. If fragment in a certain direction during the process to creat chimerism, it is skipped so that we have the same
result.
g)
x=h : header (=PML)
Shotgun
fragment are generally named according to project names and types of libraries.
Thus, this enables to define names. The length of this string cannot exceed 9
characters. In general, thee characters are applied.
h)
x=vn: vector sequence file
name (=vector.seq)
It
is file name with vector sequence. It can include Directory name. String length
is limited to 199.
i) x=vf :
vector fragment length in a clone (=30)
The
length of vector sequence required by each fragment(clone)
is set. It is set to 30¿¡ by default at present. It is carefully adjusted as
properly adjusting minmatch and minscore
of cross_match during vector screening as keeping in
mind that vector screening can fail. There will be no problems unless this
value is excessively high.
j)
x=vl : vector size (=8000)
It means actual
size of vector(bp). You can
enter accurate value, but low values are restricted. For sufficient allowance,
it is preferable to enter bigger value than actual vector size.
k)
x=sd: sequence files
directory (=./seq_dir)
You
can designate sequence file directory here. You can also enter Full path name.
String length is limited to 199.
l)
x=cd: chromatograms
directory (=./chromat_dir)
The
final chromatogram will be positioned in this directory. You can enter full
path name. String length is also limited to 199.
m)
x=rl : average read
length (=600)
Two
reads(because this string is read in both directions)
are created from one shotgun fragment, and each read length is not regular.
Then, average length is set to 600 by default here, which means the length
including vector. What to pay attention here is that if you do trimming for
base calling using phred, the maximum length of remaining sequence after
trimming is only 550 bp. Accordingly,
you lose many portions of long sequences. For accurate calculation, it is
required to set sizes below 550 or slightly modify phred source.
n)
x=rd: deviation of read length (=50)
It
is mentioned above that read length is irregular. The range of this deviation
is set in read length deviation, which will be standard deviation. In this
case, we must pay attention that the range of this deviation is set in bp value unlike clone length deviation. Even though
absolute value of read length is not changed in a broad scope, clone length has
significantly broader scope of deviation. Thus, relative ratio is regarded as
more proper input.
o)
x=seed: a seed number for a random number generation
(=-1)
It
includes a number of random processes including basic process to create
fragments. If completely random process is executed whenever you run program,
it is difficult to compare the results of program execution or debugging. Thus,
it is designed to always give the same results when seed values defined in this
string are same values. Of course, pseudo random sequence is generated for one
seed value during execution. With this process, you can achieve two goals. When
you want to see real random effects, it is better to compare results by
changing seed values. In this case, seed values shall be negative.
p)
x=rate :sequencing error rate (=1.00e-04)
This
option enables to introduce certain sequencing error. This option is adopted as
alternative because it is not easy to accurately copy actual problems caused by
irregular sequence quality. Thus, this option sets overall sequencing error
rate. In addition, when this option is set, it effects options of ir and dr. In general, values corresponding to rate * (1.0 –
ir – dr )become sequencing error rates for substitution.
q)
x=ir :The portion of
insertion sequencing error (=0.20 )
This
option defines percentage of inserted sequencing error among entire sequencing
errors. Thus, actual inserted sequencing error is rate * ir.
r)
x=dr :The portion of
deletion sequencing error (=0.20 )
This
option defines percentage of deleted sequencing error(deletion
mutation) among entire sequencing errors. Thus, actual deleted sequencing error
rate is rate * dr.
s)-circle:
This option is for circular genome(default=unset).
Sequence
entered by Fasta format is assumed to be linear.
However, since general bacterial genome is circular, this option shall be
considered. Thus, it is required to clearly state command line option, that is,
-circle, for circular sequences. Sequences are assumed to be linear when this
option is omitted.
t)-stat
x y z:
if this option is given then statistics are printed
to stdout.(default=unset)
x(=100 ) : bin size of fragments
y(=10000)
: bin size of positions
z(=0 ) : statistics only options
This
option is to check whether you get data controlled according to basic input.
You will sometimes analyze after getting all results when repetitively running
program. In other ways, you may be interested only in statistics related to
sequence generation. In this case, executing all processes will be
time-consuming process. Therefore, in some case, it is necessary to check even
sampling by position of genome or clone length distribution. For setting –stat
option, two options(x, y) immediately following it shall be entered and the
final z is optional. If z>0, only statistics are printed, but ABI file is
not generated. Bin size means the dimension of sector. That is, if x=100, it
means that clone length is counted by 100bp unit. Y=10000 means that starting point
counts the number of fragments received in each section when genome is divided
by 10000bp unit.
============================================================
-Output
1. Position distribution
Position
distributions of generated colons are printed.
Example)

2. Clone length distribution
Length
distributions of generated clones are printed by section.
Example)

3. Read length distribution

4. Generated DNA sequences:
These files are set in –sd. The default setting is ./seq_dir.
5. ABI files from Sequences:
These files are set in –cd. The default setting is./chromat_dir.
============================================================
-How to Use C shell scripts in this
study
Using
scripts in simulation/scripts, Operations of FASIM and other relevant programs
(phred/cross_match/phrap)and
corresponding results can be easily analyzed.
1.
Script (mkall.csh) to make directories where dataset
to be generated by FASIM is saved
Scripts
make relevant directories to collect dataset in a wide range of conditions to
be generated by FASIM.
¡°Usage : mkall.csh chromat_dirs seq_dirs stat_dirs ¡°
% ./mkall.csh ch se
st
2. Script (sall01.csh) to generate 11
different conditions for dataset using FASIM
It
is the simulation process by FASIM and designates directory names for data to
be generated.
¡°Usage: sall01.csh chromat_dir
seq_dir stat_dir¡±
% ./sall01.csh ch se
st
3.
Script (phred.csh)to execute
phred
This
process is for basecalling ABI files among dataset to
be generated by FASIM using phred.
¡°Usage
: phred.csh chromat_dir¡°
%
./phred.csh ch
4.
Script (call.csh)to execute cross_match
This
process is for masking vector sequences including reads after basecalling by
bphred.
¡°Usage
: call.csh vector_file¡±
%
./call.csh pGEM3.seq
5.
Script (pall.csh) to execute phrap
This process is
for assembling sequences after vector masking and runs phrap.
¡°Usage : pall.csh ¡°
%
./pall.csh
6.
Scripts (gall.csh) to count the numbers of generated
contigs
This
process is for counting contigs that are assembled by phrap and designates
directory names to include assembly results.
¡°Usage: gall.csh directory¡±
%
./gall.csh edit_dir
7.
Script (rmall.csh) to delete dataset generated by
FASIM
8.
Scripts ( mvall.csh) to
change .qual file names generated by Phred to .screen.qual.
============================================================
< c shell
scripts adopted in this study >
1. Scripts to generate 11 different conditions for dataset
#!/bin/csh
if($#argv < 3 ) then
echo
"Usage : $0 chromat_dirs seq_dirs stat_dir"
exit(1)
endif
set ids=(01 03 05 08 10 15 20 25 30 35 40)
set prog="sim"
set option1="-L 2000 -stat 100 10000 -seed -1 -h PMP -vn pUC_GEM.seq -vf 30 -circle"
set option2="-D 0.0 -X 0.0 -rd 0 -rate 0.0"
foreach tmp
($ids )
@ idd = $tmp
@ idd = $idd * 500
set
option3="-N $idd -sd
$2$
set
command="$prog $option1 $option2 $option3"
$command > $3_dir/stat_$
end
The
script above is to generate data of case 1. Option2 was differentiated each
other for different cases.
case 2 : option2="-D 0.0
-X 0.0 -rd 100 -rate 0.0"
case 3 : option2="-D 0.0
-X 0.0 -rd 200 -rate 0.0"
case 4 : option2="-D 0.0
-X 0.0 -rd 0 -rate 0.0010"
case 5 : option2="-D 0.0
-X 0.0 -rd 0 -rate 0.0010"
case 6 : option2="-D 0.0
-X 0.0 -rd 0 -rate 0.0025"
case 7 : option2="-D 0.1
-X 0.0 -rd 0 -rate 0.0"
case 8 : option2="-D 0.2
-X 0.0 -rd 0 -rate 0.0"
case 9 : option2="-D 0.0
-X 0.01 -rd 0 -rate 0.0"
case10: option2="-D
0.0 -X 0.05 -rd 0 -rate 0.0"
case11: option2="-D
0.0 -X 0.10 -rd 0 -rate 0.0"
case12: option2="-D
0.0 -X 0.0 -rd 0 -rate 0.0020 -ir 0.01 -dr 0.01"
case13: option2="-D
0.0 -X 0.0 -rd 0 -rate 0.0020 -ir 0.05 -dr 0.05"
case14: option2="-D
0.0 -X 0.0 -rd 0 -rate 0.0020 -ir 0.10 -dr 0.10"
case15: option2="-D
0.2 -X 0.01 -rd 200 -rate 0.002"
2. Scripts to run phred
#!/bin/csh
if($#argv<1) then
echo
"Usage: $0 chromat_directories"
exit(1);
endif
set ids=(01 03 05 08 10 15 20 25 30 35 40)
set prog="phred"
foreach tmp
($ids )
set
options="-id $1$
$prog
$options -trim "" &
end
3.Script to run cross_match
!/bin/csh
if($#argv < 1 ) then
echo
"Usage : $0 vector_file"
exit(0);
endif
set ids=(01 03 05 08 10 15 20 25 30 35 40)
set prog="cross_match"
foreach tmp
($ids )
$prog pm_$tmp $1 -minscore 15 -screen
> pm_$tmp.out &
end
4. Script to run phrap
#!/bin/csh
#if($#argv < 1 ) then
# echo "Usage : $0 vector_file"
# exit(0);
#endif
set ids=(01 03 05 08 10 15 20 25 30 35 40)
set prog="phrap"
foreach tmp
($ids )
$prog -minmatch 30 pm_$tmp.screen > pm_$tmp.screen.phrap.out &
end
5.Script to count the number of contig
#!/bin/csh
if($#argv < 1 ) then
"Usage :
$0 Directory"
exit 0
endif
if( -e $1) then
set
dir=$1
else
"The directory does not
exist..."
exit 0
endif
set ids=(01 03 05 08 10 15 20 25 30 35 40)
set prog="grep"
echo "-------------------------------"
echo "Reads Contigs Singlets Total"
echo "-------------------------------"
@
contigs =
0
@
singlets = 0
@
total
= 0
@
reads
= 0
foreach tmp
($ids )
@ reads ="$
@ contigs=`grep ">" $dir/pm_$tmp.screen.contigs
| wc -l`
@ singlets=`grep ">" $dir/pm_$tmp.screen.singlets
| wc -l`
@ total =
$contigs + $singlets
echo
"$reads $contigs
$singlets $total";
end
echo "-------------------------------"
============================================================