Bioops

Bioinformatics=(ACGAAG->AK)+(#!/bin/sh)+(P(A|B)=P(B|A)*P(A)/P(B))

八卦

| Comments

Daniela Witten,26岁当上assistant professor,被福布斯评为科学界30个30岁以下的牛人之一都是普林斯顿的professor,姐姐是普林斯顿的assistant professor,她爸得过菲尔兹奖,老公是facebook的元老。

总结:科研是聪明且有钱人的游戏。

Lasso Related Links

| Comments

The Lasso Page is maintained by the inventor of lasso and provides most important references for lasso.

The book Elements of Statistical Learning (pdf) describes the lasso in detail.

Lasso in R: lars: Least Angle Regression, Lasso and Forward Stagewise, and glmnet: Lasso and elastic-net regularized generalized linear models (Note: lars() function from the lars package is probably much slower than glmnet() from glmnet.)

The adaptive lasso paper

Adaptive lasso in R
adaptive.lasso function in lqa package (Penalized Likelihood Inference for GLMs)
adalasso function in parcor package (Regularized estimation of partial correlation matrices)

The graphical lasso paper

Graphical lasso in R (glasso: Graphical lasso- estimation of Gaussian graphical models)

The joint graphical lasso paper

Joint graphical lasso in R (JGL: Performs the Joint Graphical Lasso for sparse inverse covariance estimation on multiple classes)

The Bayesian adaptive lasso paper

R Function of Monte Carlo Simulation to Get the P-value From the Joint Cumulative Distribution of an N-dimensional Order Statistic

| Comments

I want to compute the P-value from the joint cumulative distribution of an n-dimensional order statistic.

One efficient way is using the following recursive formula.

However, the facts are (or would be):

  1. I am too stupid to write a recursive function.
  2. I didn’t find the efficient formula at first.
  3. In other cases, the efficient formula have not been derived yet, or too complicated to derive.

In Statistics Monte Carlo simulation is a “quick” way to compute some complicated formulas. By saying “quick”, I mean I can see the results without knowing or deriving “ugly” Math formulas. It’s actually a very “slow” method in computing aspect.

Anyway, the R function is here.

P_order_stat.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# sub function of monte carlo simulation to get the p-value
P_order_stat <- function(ranks) {
  NumRep <- 10000 # number of replicates
  newvec <- sort(ranks) # sort the rank ratios
  pvalue <- 0 # inital pvalue
  for (i in 1:NumRep){

      # generate random uniform distributed data,
      # and then sort the simulated rank ratios
      newx <- sort(runif(length(ranks), min=0, max=1))

      # if all the simulated data is lower than the input,
      # then sucess+1
      judge <- sum(newvec >= newx)
      if (judge == length(ranks)) pvalue<-pvalue+1
  }
  pvalue <- pvalue / NumRep  # get the p-value
  return(pvalue)
}

Simulate Multivariate Normal Distribution Using R

| Comments

I have been doing some research about co-expression network. “co-expression” means that genes have similar expression profiles across different conditions or tissues. In the network, genes are nodes, and “co-expression” relationship between two genes can be reprensented as edges. The co-expressed genes may involve in similar pathways or biological process.

In a small part of my research, I am testing some algorithms to detect co-expression relationship. One way to test algorithm is simulation. In an ideal (simple) case, the expression values of two co-expressed genes can be considered as bivariate normal distributed. To generate expression values of such gene pair or a group of genes given a correlation coefficient, is just to simulate multivariate normal distribution. MASS library in R has an function, mvrnorm, to do that, but it requires a covariance matrix.

The function below is to firstly generate the covariance matrix in order to use the mvnorm function. Because we only know the correlation coefficient, i.e. co-expression relationship (degree), the mean and variance of each gene’s expression profile are random generated in the function. Then the matrix can be calulated as follows.

multi_norm.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# function to simulate multivariate normal distribution
# given gene number, sample size and correlation coefficient
multi_norm <- function(gene_num,sample_num,R) {
  # initial covariance matrix
  V <- matrix(data=NA, nrow=gene_num, ncol=gene_num)

  # mean for each gene
  meansmodule <- runif(gene_num, min=-3, max=3)
  # variance for each gene
  varsmodule <- runif(gene_num, min=0, max=5)

  for (i in 1:gene_num) {
  # a two-level nested loop to generate covariance matrix
    for (j in 1:gene_num) {
      if (i == j) {
        # covariances on the diagonal
        V[i,j] <- varsmodule[i]
      } else {
        # covariances
        V[i,j] <- R * sqrt(varsmodule[i]) * sqrt(varsmodule[j])
      }
    }
  }

  # simulate multivariate normal distribution
  # given means and covariance matrix
  X <- t(mvrnorm(n = sample_num, meansmodule, V))

  return(X)
}

Warning

| Comments

A couple of months ago, I transfered the website engine from wordpress into Octopress. A lot of errors were found in previous posts due to format incompatibility, especially some perl scripts.

Please read carefully before using any code or script, and leave a comment if you find some “terrible” error.

Thank you!

[update-2013-08-08] now only perl howto related posts

[update-2015-01-01] All format issues are (putatively) resolved.

Final Transfer to Github

| Comments

I had been using 000webhost.com to host my website till several days ago when I noticed my website was suspendend for “violating 20%+ CPU usage limit for more than 1000 times.”

The 000wbehost server is good and stable. Most importantly it’s free. Now, I have to transfter to another free and good web hosting service. Github is a good choice. But github does not support wordpress. I tried to transferr the website to github before, but I am not comfortable to write blogs using Markdown.

It’s difficult to find another free service supporting wordpress, and lots of people said the static blogging engine is much better than wordpress. Looks like I will stay here for a while.

NGS Startups

| Comments

23andme

Wiki:

23andMe is a privately held personal genomics and biotechnology company based in Mountain View, California that provides rapid genetic testing. The company is named for the 23 pairs of chromosomes in a normal human cell. Their personal genome test kit was named “Invention of the Year” by Time magazine in 2008.

Jobs:

Engineering: HPC Systems Administrator
Senior Software Engineer
Software Engineer
Storage Systems Architect/Engineer Science: Backend Software Engineer
Health Content Scientist
Research Assistant
Scientist
Statistical Geneticist
Statistical Geneticist focusing on Parkinson’s Disease
Survey Methodologist
User Interface Designer

Bina

About:

Bina is the big data science platform accelerating personalized medicine for researchers and clinicians requiring fast, accurate and scalable genomic analysis. The word “Bina” means “knowledge” or “insight”, translated from both Persian and Hebrew. We use cutting-edge big data technologies to dramatically reduce the amount of time and money required to process raw genetic data in order to generate insights for personalized medicine. Bina was started by a team of Stanford and Berkeley researchers and entrepreneurs, with the vision that whole genome sequencing (WGS) is just the beginning of a brighter future. Bina is accelerating personalized medicine, one genome at a time.

Jobs:

Big Data Software Architect
Senior Software Engineer
Senior Computational Biologist
Senior Data Scientist
Senior Applications Support Scientist
Senior Bioinformatics Scientist

TechCrunch: With $6.25M In Tow, Bina Technologies Wants To Bring Big Data Insight To Genomic Sequencing

(to be continued)

Sequencing Cost (2013 Feb)

| Comments

Reasonably priced genomes

Although no reports of big innovations in DNA sequencing are expected at a major conference this week, the current cost and capabilities of the technology now make medical applications worthwhile.
Name Machine cost Read length (bases) Cost per megabase
Illumina MiSeq US$125,000 500 14–70 cents
Illumina HiSeq US$690,000 300 4–5 cents
PacBio RS US$695,000 4,575 $2–17
Ion Torrent PGM US$49,000 400 60 cents–$5
Ion Torrent Proton US$224,000 200 1–9 cents
Source: The companies; Travis Glenn

世界末日

| Comments

我特别希望世界末日是真的!
。。。。。。
看评论。