VCRU Bioinformatics - 454 Instructions

This page was last updated on Wednesday, 12-Sep-2012 09:58:41 CDT

454 Instructions - Index

Computer Requirements
VNC (Remote desktop connection to server)
Roche 454 Software
Madison WI Users' Contact Information for Roche
Getting your 454 sequences from Biotech
Assembly
Secrets!
Mapping
I have saved a few Roche documents locally in documents (Password protected, see Doug for password)
or, Sign up for or log onto your My454 page here to get Roche documentation files yourself
After Assembly

Computer Requirements

Roche software runs in a Linux environment.
RedHat is the supported distribution, and Fedora is reported to work in some cases.

The VCRU has two computers with lots of memory and processors,
the Cucumber server (32GB of memory and 8 processors), and the Cranberry server (64GB of memory and 16 processors).
talk to Doug to get an account set up on one of these computers.

VNC (Remote desktop connection to server)

If you set up a vnc client, you can run the Roche programs (and others) from your own computer.
Instructions to do that are on this page

Roche 454 Software

Skip this if you will use the software on the Cranberry server

The Roche recommended platform is RedHat linux.
To install on Fedora 13, there are a few things to do first, see this page for more information

Basic install steps are:
change to a temporary working directory cd /tmp
download the installer with wget --user=xxx --password=xxx http://www.vcru.wisc.edu/simonlab/bioinformatics/up/download/DataAnalysis_2.3.tgz
where --user and --password provide access to the password protected sections of this web site (see Doug for password)
uncompress with tar -zxvf DataAnalysis_2.3.tgz
install with cd DataAnalysis_2.3 sudo ./INSTALL

Madison WI Users' Contact Information for Roche

Roche Technical Support Contact Information

Our local Roche representative:

Dan Brekken, M.S.
Key Account Manager-Sequencing
Cell 1-303-941-3155
dan.brekken@roche.com

Illustration of gsAssembler 2.6 bug

Getting your 454 sequences from Biotech

You will get 454 sequences the same way as you have been getting Sanger sequences, on the UW Biotech Center web server at
https://facilities.biotech.wisc.edu/download
Just login using your own UW password and netID.
Note that instead of being in the DNA Sequencing folder, they will be in a folder named Advanced Genome Analysis Resource
You need to get this file to a linux computer with the Roche software. Here are some ways to do that

Download using Firefox while working on the linux computer (easiest)
Copy using the "share" network accessible folder on the Cranberry server
Copy to AFS and then access AFS from the linux server

Where to put them?
Don't put them in your home directory, there is not space there and it is not backed up automatically. I have lots of space available on the data drives. And these are backed up nightly to two different computers.
The Cranberry server has 1.5 TB on /vmdata1 and 2TB on /vmdata2, make a folder on one of these.
I already have /vmdata1/454 created.
and within that directory I am dividing sequence data by species, into cranberry, carrot, onion, and cucumber
(If you are wondering, the vm in vmdata stands for Vaccinium macrocarpon, i.e. cranberry)
Uncompress the file

Uncompress a .gz archive with gunzip < yourfilename.gz > yourfilename
(this keeps the original .gz file)
or uncompress a .tgz or .tar.gz archive with tar -zxvf yourfilename.tar.gz
or, maybe easiest, uncompress in File Browser by right clicking on the file and selecting Extract Here

Safety: I recommend you make the original sequence files read-only so that it is a little bit harder to accidentally delete them. Do that with this command:
chmod -w *.sff

Assembly

I have added the Roche software to the "Bioinformatics" menu for all users.
For example, to run gsAssembler:
bioinformatics menu, gsAssembler sub menu

The Roche documentation is in this file (password protected, see Doug for password):
Manual is on this page

Secrets!

Here are a few things I have learned that are not immediately obvious:

Parameters - Input

Heterozygotic mode - we probably want this checked

Parameters - Computation

Make sure you change the default number of CPUs to use in the "Number of CPUs to use (0=all):" box to a number greater than 1 when assembling!
For now I am recommending 12

Parameters - Output

Set the minimum contig length to 1, in the "All contig threshold:*" box.
Short contigs are still useful because they can be used to link other contigs together, or they may actually be SNPs or indels
Change the "Ace/Consed" option to "ACE file per contig", a single file may be too big to be usable if you want to view the ace file in another program

Post-assembly

Singletons can be useful - You may want the unassembled (but still good) reads, especially when starting out.
By default they are not to be had, you will need to extract them yourself.
Use bb.454singletons to extract these
I have been making a directory called scripts inside the assembly directory, for example /454/carrot/Carrot20101030/ contains a sff and a assembly directory which is created by the Roche assembler program in addition to my scripts directory.
In this scripts directory, I place post-assembly analysis shell scripts.
For organization, I number the scripts in order of their use, e.g.
My standard first step is:
01.postprocess.sh - this script performs these functions:
1. Creates a summary text file = 454Summary.txt
2. Creates FASTQ files of all raw reads
3. Creates FASTA and FASTQ files of unassembled singletons
4. Creates an annotated FASTA file of contigs = 454AllContigsAnnotated.fna
5. Creates a BLAST database of contigs + singlets
6. Create several graphs of read and assembly statistics

Mapping

Run gsMapper from the "Bioinformatics" menu

The Roche documentation is in this file (password protected, see Doug for password):
Manual is on this page

After Assembly

Look at this page for some other things you can do after assembly