Bioops

Bioinformatics=(ACGAAG->AK)+(#!/bin/sh)+(P(A|B)=P(B|A)*P(A)/P(B))

Learning Java #1

| Comments

Hello World!

HelloWorldApp.java
1
2
3
4
5
6
7
8
9
/**
 * The HelloWorldApp class implements an application that
 * simply prints "Hello World!" to standard output.
 */
class HelloWorldApp {
    public static void main(String[] args) {
        System.out.println("Hello World!"); // Display the string.
    }
}

compile and run

javac HelloWorldApp.java
java HelloWorldApp

object, class, package

inheritance

class MountainBike extends Bicycle {
     // new fields and methods defining a mountain bike would go here
}

interface

class ACMEBicycle implements Bicycle {
// remainder of this class implemented as before
}

package

Introduction to Using PBS

| Comments

Introduction to Using PBS

Introduction

PBS is the batch scheduler running on the Sun Opteron cluster, midnight. Unlike interactive jobs, batch jobs are controlled via scripts. These scripts tell the system which resources a job will require and how long they will be needed.

Command Interface to PBS

Frequently used PBS commands:

Command	Purpose
qmap	Displays a grid of current jobs along with a list of queued and running jobs.(ARSC Only) (see qmap -h for additional job and queue information)
qsub	Submits a job to the PBS batch scheduler. (see also man qsub)
qdel	Removes a job from the queue. This includes all running, waiting, and held jobs.(also see man qdel)
qstat -f	Displays more information about a particular job (also see man qstat)

Queues

A user’s job is submitted to the queue with the “qsub” command. The user must be certain that the specific resources requested (such as number of nodes and walltime hours) are within the ranges offered by the particular queue.

The command “qmap -r” shows the names of the queues available and their maximum walltimes. For a more verbose narrative on queue usage, type the following at the command prompt:

news queues

Submitting Jobs to PBS

The command:

qsub <PBS script>

will submit the given script for processing. The script contains information PBS needs to allocate resources for your job, directions for handling standard I/O streams, and instructions to run the job. Example scripts are included below .

Running Interactive Jobs

You are encouraged to use the PBS batch system, but may run interactive jobs as well. An interactive command is simply typed at the prompt in a terminal window. Standard error and standard output are printed to the terminal, redirected to a file, or piped to another command using appropriate Unix shell syntax.

You can spawn an interactive job using the following command:

qsub -q debug -l select=1:ncpus=4:node_type=4way -I

Once your job is started, you may then run interactive commands on the compute node(s) PBS assigned to your session.

Monitoring Queues and Requests

The command:

qmap

will show all jobs currently running or queued on the system. For details about your particular jobs, issue the command:

qmap -u <user name>

Canceling Queued and Running Jobs

The command:

qdel <job id>

where <job id> is obtained from the “Job Id” field of the qmap output. This command will remove the job from the queue and terminate the job if it is running.

Example Scripts

Example #1 MPI using Sun Fire x2200 nodes (4way nodes)

#!/bin/bash
#PBS -q standard
#PBS -l select=8:ncpus=4:node_type=4way
#PBS -l walltime=08:00:00
#PBS -j oe
cd $PBS_O_WORKDIR
mpirun -np 32 ./myprog

Here is a line-by-line breakdown of the keywords and their assigned values listed in this MPI script:

#!/bin/bash Specifies the shell to be used when executing the command portion of the script.

#PBS -q standard Specifies which queue the job will be submitted to.

#PBS -l select=8:ncpus=4:node_type=4way Requests 8 “blocks” of 4 processors on x2200 nodes. You can also think of this as requesting 8 nodes, running 4 tasks on each of those nodes, with the nodes being 4way (i.e. x2200) only.

#PBS -l walltime=08:00:00 Requests that the running job be allowed to run for a maximum of 8 hours.

#PBS -j oe Joins the output and error files.

cd $PBS_O_WORKDIR Change to the initial working directory.

mpirun -np 32 ./myprog Runs the mpi program with a total of 32 tasks.

Example #2 OpenMP using Sun Fire x4600 nodes (16way nodes)

#!/bin/bash
#PBS -q standard
#PBS -l select=1:ncpus=16:node_type=16way
#PBS -l walltime=08:00:00
#PBS -j oe
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=16
export PSC_OMP_AFFINITY=TRUE
./myprog

Here is a line-by-line breakdown of the example OpenMP script:

#!/bin/bash Specifies the shell to be used when executing the command portion of the script.

#PBS -q standard Specifies which queue the job will be submitted to.

#PBS -l select=1:ncpus=16:node_type=16way Requests 1”block” of 16 processors on an x4600 node. You can also think of this as requesting 1 node, running 16 tasks on the node, with the node being 16way.

#PBS -l walltime=08:00:00 Requests that the running job be allowed to run for a maximum of 8 hours.

#PBS -j oe Joins the output and error files.

cd $PBS_O_WORKDIR Change to the initial working directory.

export OMP_NUM_THREADS=16 Sets the number of OpenMP threads to 16.

export PSC_OMP_AFFINITY=TRUE Sets the threads to have CPU affinity.

./myprog Runs the OpenMP program.

Example #3 Data Staging Script

#!/bin/bash
#PBS -q transfer
#PBS -l select=1:ncpus=1
#PBS -l walltime=04:00:00
#PBS -j oe
cd $PBS_O_WORKDIR
cp -r $ARCHIVE_HOME/mydataset/* . || exit 1
qsub mpi_job.pbs

Here is a line-by-line breakdown of the keywords and their assigned values listed in this data staging script:

#!/bin/bash Specifies the shell to be used when executing the command portion of the script.

#PBS -q transfer Specifies to run a job in the transfer queue.

#PBS -l select=1:ncpus=1 Requests 1 node to run 1 process in the queue. Data transfer jobs must be run in serial.

#PBS -l walltime=04:00:00 Requests that the running job be allowed to run for a maximum of 4 hours.

#PBS -j oe Joins the output and error files.

cd $PBS_O_WORKDIR Change to the initial working directory.

cp -r $ARCHIVE_HOME/mydataset/* . || exit 1 Copies files from long term storage to the current working directory.

qsub mpi_job.pbs Submits a new job to the batch scheduler once the data transfer is complete.

Data Lost

| Comments

My previous posts are all lost during transferring server. Sorry!
I am trying to recover them as much as I can. Thanks!

ABySS: A Parallel Assembler for Short Read Sequence Data

| Comments

Abstract

Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs > or =100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.

PMID: 1925173

ABySS README

ABySS - assemble short reads into contigs

Compiling ABySS

Compiling ABySS should be as easy as
./configure && make
To install ABySS in a specified directory
./configure --prefix=/opt/ABySS && make && sudo make install
If you wish to build the parallel assembler with MPI support, MPI should be found in /usr/include and /usr/lib or its location specified to configure:
./configure --with-mpi=/usr/lib/openmpi && make
ABySS should be built using Google sparsehash to reduce memory usage, although it will build without. Google sparsehash should be found in /usr/include or its location specified to configure:
./configure CPPFLAGS=-I/usr/local/include
The default maximum k-mer size is 64 and may be decreased to reduce memory usage or increased at compile time:
./configure --enable-maxk=96 && make
To run ABySS, its binaries should be found in your PATH.

Single-end assembly

Assemble short reads in a file named reads.fa into contigs in a file named contigs.fa with the following command:
ABYSS -k25 reads.fa -o contigs.fa
where -k is an appropriate k-mer length. The only method to find the optimal value of k is to run multiple trials and inspect the results. The following shell snippet will assemble for every value of k from 20 to 40.
for k in {20..40}; do
    ABYSS -k$k reads.fa -o contigs-k$k.fa
done
The maximum value for k is 64. This limit may be changed at compile time using the –enable-maxk option of configure. It may be decreased to 32 to decrease memory usage, which is particularly useful for large parallel jobs, or increased to 96.

Paired-end assembly

To assemble paired short reads in a file named reads.fa into contigs in a file named paired-contigs.fa, run the command:
abyss-pe k=25 n=10 in='reads1.fa reads2.fa' name=ecoli
where k is the k-mer length as before. n is the minimum number of pairs needed to consider joining two contigs. The optimal value for n must be found by trial. in specifies the input files to read, which may be in FASTA, FASTQ, qseq or export format and compressed with gz, bz2 or xz. The assembled contigs will be stored in ${name}-contigs.fa.

The suffix of the read identifier for a pair of reads must be one of ‘1’ and ‘2’, or ‘A’ and ‘B’, or ‘F’ and ‘R’, or ‘F3’ and ‘R3’, or ‘forward’ and ‘reverse’. The reads may be interleaved in the same file or found in different files; however, interleaved mates will use less memory.

abyss-pe is a driver script implemented as a Makefile and runs a single-end assembly, as described above, and the following commands, which must be found in your PATH:

  • ABYSS - the single-end assembler
  • AdjList - finds overlaps of length k-1 between contigs
  • KAligner** - aligns reads to contigs
  • ParseAligns** - finds pairs of reads in alignments
  • DistanceEst** - estimates distances between contigs
  • Overlap - find overlaps between blunt contigs
  • SimpleGraph - finds paths between pairs of contigs
  • MergePaths - merges consistent paths
  • Consensus - for a colour-space assembly, convert the colour-space contigs to nucleotide contigs
** These steps can be run in parallel (see below)

Paired-end assembly of multiple fragment libraries

The distribution of fragment sizes of each library is calculated empirically by aligning paired reads to the contigs produced by the single-end assembler, and the distribution is stored in a file with the extension .hist, such as ecoli-4.hist. The N50 of the single-end assembly must be well over the fragment-size to obtain an accurate empirical distribution.

Here’s an example scenario of assembling a data set with two different fragment libraries and single-end reads:

Library lib1 has reads in two files, lib1_1.fa and lib1_2.fa. Library lib2 has reads in two files, lib2_1.fa and lib2_2.fa. Single-end reads are stored in two files se1.fa and se2.fa.

The command line to assemble this example data set is…

abyss-pe -j2 k=25 n=10 name=ecoli lib='lib1 lib2' 
    lib1='lib1_1.fa lib1_2.fa' lib2='lib2_1.fa lib2_2.fa' 
    se='se1.fa se2.fa'
The paired-end assembly of lib1 and lib2 may be run in parallel by specifying the -j option of make to abyss-pe, which is implemented as a Makefile script. The -j option should be set to the number of libraries, but setting it higher will not cause any trouble.

The empirical distribution of fragment sizes will be stored in two files named lib1-3.hist and lib2-3.hist. These files may be plotted to check that the empirical distribution agrees with the expected distribution. The assembled contigs will be stored in ${name}-contigs.fa.

Reads without mates should be placed in a file specified by the `se’ (single-end) parameter. Reads without mates in the paired-end files will slow down the paired-end assembler considerably during the ParseAligns stage.

Parallel assembly

The `np’ option of abyss-pe specifies the number of processes to use for the ABYSS-P parallel MPI job. Without any MPI configuration, this will allow you to make use of multiple cores on a single machine. To use multiple machines for assembly, you must create a hostfile for mpirun, which is describe in the mpirun man page.

The paired-end assembly runs on a single processor. For very large jobs, a good portion of the paired-end assembly (KAligner, ParseAligns, DistanceEst) may be run in parallel separate processes, but this process is not automated by the driver script abyss-pe.

Open MPI integrates well with SGE (Sun Grid Engine). For example, to submit an array of jobs to assemble every odd value of k between 51 and 63 using 64 processes for each job:

qsub -pe openmpi 64 -t 51-63:2 -N testing abyss-pe in=reads.fa n=10
For more information on using SGE and qsub, please refer to the qsub manual page. Open MPI must have been compiled with support for SGE using the ./configure –with-sge option.

See also

Try `abyss –help’ for more information on command line options, or see the manual page in the file `ABYSS.1’. Please refer to the mpirun manual page for information on configuring parallel jobs.

Written by Jared Simpson and Shaun Jackman. Subscribe to the users’ mailing list at http://www.bcgsc.ca/mailman/listinfo/abyss-users Contact the users’ mailing list at <abyss-users@bcgsc.ca> or the authors directly at <abyss@bcgsc.ca>.