This page was last updated on Wednesday, 12-Sep-2012 09:58:41 CDT
454 Instructions - Index
Roche software runs in a Linux environment.
RedHat is the supported distribution, and Fedora is reported to work in some cases.
The VCRU has two computers with lots of memory and processors,
the Cucumber server (32GB of memory and 8 processors), and
the Cranberry server (64GB of memory and 16 processors).
talk to Doug to get an account set up on one of these computers.
If you set up a vnc client, you can run the Roche programs (and others) from your own computer.
Instructions to do that are on this page
Skip this if you will use the software on the Cranberry server
The Roche recommended platform is RedHat linux.
To install on Fedora 13, there are a few things to do first,
see this page for more information
Basic install steps are:
change to a temporary working directory cd /tmp
download the installer with wget --user=xxx --password=xxx http://www.vcru.wisc.edu/simonlab/bioinformatics/up/download/DataAnalysis_2.3.tgz
where --user and --password provide access
to the password protected sections of this web site (see Doug for password)
uncompress with tar -zxvf DataAnalysis_2.3.tgz
install with cd DataAnalysis_2.3
sudo ./INSTALL
Roche Technical Support Contact Information
Our local Roche representative:
- You will get 454 sequences the same way as you have been getting Sanger sequences, on the
UW Biotech Center web server at
https://facilities.biotech.wisc.edu/download
Just login using your own UW password and netID.
Note that instead of being in the DNA Sequencing folder, they will be
in a folder named Advanced Genome Analysis Resource
- You need to get this file to a linux computer with the Roche software. Here are some ways to do that
- Download using Firefox while working on the linux computer (easiest)
- Copy using the "share" network accessible folder on the Cranberry server
- Copy to AFS and then access AFS from the linux server
- Where to put them?
Don't put them in your home directory, there is not space there and it is not backed up
automatically. I have lots of space available on the data drives.
And these are backed up nightly to two different computers.
The Cranberry server has 1.5 TB on /vmdata1 and 2TB on /vmdata2,
make a folder on one of these.
I already have /vmdata1/454 created.
and within that directory I am dividing sequence data by species, into
cranberry, carrot, onion, and cucumber
(If you are wondering, the vm in vmdata stands for Vaccinium macrocarpon, i.e. cranberry)
- Uncompress the file
- Uncompress a .gz archive with gunzip < yourfilename.gz > yourfilename
(this keeps the original .gz file)
- or uncompress a .tgz or .tar.gz archive with tar -zxvf yourfilename.tar.gz
or, maybe easiest, uncompress in File Browser by right clicking on the file and selecting Extract Here
- Safety: I recommend you make the original sequence files read-only so that
it is a little bit harder to accidentally delete them. Do that with this command:
chmod -w *.sff
I have added the Roche software to the "Bioinformatics" menu for all users.
For example, to run gsAssembler:
The Roche documentation is in this file (password protected, see Doug for password):
Manual is on this page
Here are a few things I have learned that are not immediately obvious:
- Parameters - Input
- Heterozygotic mode - we probably want this checked
- Parameters - Computation
- Make sure you change the default number of CPUs to use in the "Number of CPUs to use (0=all):"
box to a number greater than 1 when assembling!
For now I am recommending 12
- Parameters - Output
- Set the minimum contig length to 1, in the "All contig threshold:*" box.
Short contigs are still useful because they can be used to link other
contigs together, or they may actually be SNPs or indels
- Change the "Ace/Consed" option to "ACE file per contig", a single file may be too big to be usable
if you want to view the ace file in another program
- Post-assembly
- Singletons can be useful - You may want the unassembled (but still good) reads, especially
when starting out.
By default they are not to be had, you will need to extract them yourself.
Use bb.454singletons to extract these
- I have been making a directory called scripts inside the assembly directory,
for example /454/carrot/Carrot20101030/ contains a sff and a
assembly directory which is created by the Roche assembler program in addition to
my scripts directory.
In this scripts directory, I place post-assembly analysis shell scripts.
For organization, I number the scripts in order of their use, e.g.
My standard first step is:
01.postprocess.sh - this script performs these functions:
- Creates a summary text file = 454Summary.txt
- Creates FASTQ files of all raw reads
- Creates FASTA and FASTQ files of unassembled singletons
- Creates an annotated FASTA file of contigs = 454AllContigsAnnotated.fna
- Creates a BLAST database of contigs + singlets
- Create several graphs of read and assembly statistics
Run gsMapper from the "Bioinformatics" menu
The Roche documentation is in this file (password protected, see Doug for password):
Manual is on this page
Look at this page for some other things you can do after assembly