Bioops

Demo of Classification

2015-02-24T12:35:23-05:00

R code demo of

Linear discriminant analysis (LDA)
Quadratic discriminant analysis (QDA)
k-nearest neighbor (KNN)

# read data
library(RCurl)
data <- getURL("https://raw.githubusercontent.com/bioops/mis_scripts/master/statistics/data/admission.txt")
admission<-read.table(text=data,header=T)
admission$CLASS<-as.factor(admission$CLASS)

# scale the data
admission$GMAT<-scale(admission$GMAT)
admission$GPA<-scale(admission$GPA)

# (1) LDA
library(MASS)
lda.fit<-lda(CLASS~GMAT+GPA,data=admission)
lda.pred<-predict(lda.fit)
# plot
# different symbols represents true classifications
# different colors are predicted classifications
plot(admission[,1:2],col=lda.pred$class, pch=as.numeric(admission$CLASS),
     xlab="GPA",ylab="GMAT",main="LDA")

# (2) QDA
qda.fit<-qda(CLASS~GMAT+GPA,data=admission)
qda.pred<-predict(qda.fit)
# plot
# different symbols represents true classifications
# different colors are predicted classifications
plot(admission[,1:2],col=qda.pred$class, pch=as.numeric(admission$CLASS),
     xlab="GPA",ylab="GMAT",main="QDA")

library(class)

# tune parameter using cross-validation (CV)
k<-seq(1,15,1) # different k
cv.err<-NULL # cv error
for (ki in k){
  knn.pred.cv<-knn.cv(admission[,1:2],admission[,3],k=ki)
  cv.err<-c(cv.err, mean(knn.pred.cv!=admission[,3]))
}
plot(k,cv.err, main="CV error vs k")

# using the optimal parameter
knn.pred<-knn(admission[,1:2],admission[,1:2], admission[,3],k=k[which.min(cv.err)])

# plot
# different symbols represents true classifications
# different colors are predicted classifications
plot(admission[,1:2],col=knn.pred, pch=as.numeric(admission$CLASS),
     xlab="GPA",ylab="GMAT",main="KNN")

Solving Bridge Regression Using Local Quadratic Approximation (LQA)

2015-02-24T11:27:50-05:00

Bridge regression is a broad class of penalized regression, and can be used in high-dimensional regression problems.

It includes the ridge (q=2) and lasso (q =1) as special cases.

More technical details can be found here. Below R code demonstrates:

sovling bridge regression using local quadratic approximation (LQA) and Newton–Raphson algorithm.
simulation of tuning parameters using 50/50/200 observations (training/validation/testing).

# The function to solve bridge regression
bridge<-function(x, y, lambda, q=1, eta=0.001){
  library(glmnet)
  # use ridge coefficients as a starting value
  beta.start<-coef(glmnet(x, y, alpha=0, lambda=lambda, intercept=F))
  beta.mat<-matrix(NA,ncol=ncol(beta.start),nrow=nrow(beta.start)-1)

  # take each lambda value in the grid
  for(i in 1:length(grid)){
    # initial beta without intercept
    beta_prev<-as.vector(beta.start[-1,i])
    # initial converge
    converge<-10^10

    # iteration until converge or two many iterations
    iteration<-0
    while(converge>eta && iteration<=100){
      iteration<-iteration+1
      # judge whether some beta is small enough
      del<-which(abs(beta_prev)<eta)

      # if all coefficient are small enough then stop the iteration
      if(length(del)==length(beta_prev)){
        beta_prev[del]<-0
        converge<-0

        # else if we need to remove some but not all of the coefficients  
      }else if(length(del)>0){
        # set these beta to 0
        beta_prev[del]<-0

        # update design matrix x
        x.new<-x;x.new<-x.new[,-del]

        # calculate the diagonal matrix involving penalty terms
        # and the next beta
        if(length(beta_prev)-length(del)==1){
          diag<-grid[i]*q*(abs(beta_prev[-del])^(q-2))/2
        }else{
          diag<-diag(grid[i]*q*(abs(beta_prev[-del])^(q-2))/2)
        }

        # next beta
        beta_curr<-beta_prev
        beta_curr[-del]<-solve(t(x.new)%*%x.new+diag)%*%t(x.new)%*%y

        # new converge value
        converge<-sum((beta_curr-beta_prev)^2)
        # next iteration
        beta_prev<-beta_curr

        # if we don't need to remove the coefficients
      }else{
        x.new<-x
        diag<-diag(grid[i]*q*(abs(beta_prev)^(q-2))/2)
        # next beta
        beta_curr<-solve(t(x.new)%*%x.new+diag)%*%t(x.new)%*%y

        # new converge value
        converge<-sum((beta_curr-beta_prev)^2)
        # next iteration
        beta_prev<-beta_curr

      }
    }

    beta.mat[,i]<-beta_prev
  }
  colnames(beta.mat)<-grid
  return(beta.mat)
}

# The function to find the optimal coefficients set
# that minimize mse of validation set
bridge.opt<-function(coef.matrix, newx, newy){
  # calculate the mse
  mse.validate<-NULL
  for(i in 1:ncol(coef.matrix)){
    mse.validate<-c(mse.validate, mean((newx%*%coef.matrix[,i]-newy)^2))
  }
  # find the optimal coefficients set
  coef.opt<-coef.matrix[,which.min(mse.validate)]
  return(coef.opt)
}


##################
# start simulation
# For this simulation, create three data sets 
# consisting of 50/50/200 observations (training/validation/testing).
# Use validation data to select the tuning parameter (lambda).


# q settings
q_seq<-c(0.1,0.5,1,2)
# lambda grid
grid=10^seq(10,-2,length=100)
# training and testing set
ntrain<-50
ntest<-200
# validation set size
nvalidate<-50

# repeat number
repeatnum<-100
# initialize MSE
MSE<-matrix(NA, nrow=repeatnum,ncol=length(q_seq))
colnames(MSE)<-q_seq

library(MASS)
# the beta
beta<-matrix(c(3,1.5,0,0,2,0,0,0),ncol=1)
# covariance matrix of X
cov<-matrix(NA,8,8)
for (i in 1:8){
  for (j in 1:8){
    cov[i,j]<-0.5^(abs(i-j))
  }
}

for (i in 1:repeatnum){
  # generate X
  x.train<-mvrnorm(n=ntrain,rep(0,8),cov)
  x.test <-mvrnorm(n=ntest,rep(0,8),cov)
  x.validate <-mvrnorm(n=nvalidate,rep(0,8),cov)
  # generate error term
  err.train<-matrix(rnorm(n=ntrain,mean=0,sd=3),ncol=1)
  err.test<-matrix(rnorm(n=ntest,mean=0,sd=3),ncol=1)
  err.validate<-matrix(rnorm(n=nvalidate,mean=0,sd=3),ncol=1)
  # calculate Y
  y.train<-x.train%*%beta+err.train
  y.test<-x.test%*%beta+err.test
  y.validate<-x.validate%*%beta+err.validate
  # centralize Y and standardize X
  y.train<-scale(y.train,scale=F);y.test<-scale(y.test,scale=F)
  y.validate<-scale(y.validate,scale=F)
  x.train<-scale(x.train);x.test<-scale(x.test)
  x.validate<-scale(x.validate)

  # run for each q
  for(j in 1:length(q_seq)){

    # coefficients for trainning set
    bridge.train<-bridge(x.train, y.train, grid, q=q_seq[j], eta=0.001)
    # find the optimal coefficients set that minimize mse of validation set
    coef.opt<-bridge.opt(bridge.train, x.validate, y.validate)
    # MSE on the test set using the optimal model
    MSE[i,j]<-mean((x.test%*%coef.opt-y.test)^2)
  }
}

# mean of MSEs under different models
apply(MSE,2,mean)
# sd of MSEs under different models
apply(MSE,2,sd)
# boxplot of MSEs under different models
boxplot(MSE, ylab="MSE", xlab="q")

The Consistent Estimator of Bernouli Distribution

2015-01-03T20:53:14-05:00

This is a simple post showing the basic knowledge of statistics, the consistency.

For Bernoulli distribution, $ Y \sim B(n,p) $, $ \hat{p}=Y/n $ is a consistent estimator of $ p $, because:

$\lim_{n \to \infty} \left(p-\epsilon<\frac{Y}{n}<p+\epsilon\right)=1,$

for any positive number $ \epsilon $.

Here is the simulation to show the estimator is consitent.

# set parameters
n<-1000;p<-0.5;
# n Bernoulli trails
obs<-rbinom(n,1,p)
# estimate p on different number of trials.
phat<-cumsum(obs)/cumsum(rep(1,n))
# the convergence plot
plot(phat, type="l", xlab="Trails")
abline(h=p)

# then, 100 repetitions

# set parameters
n<-1000;p<-0.5;B<-100;
# n*B Bernoulli trails
obs<-rbinom(n*B,1,p)
# convert n*B observations to a n*B matrix
obs_mat<-matrix(obs, nrow=n, ncol=B)
# a function to estimate p on different number of trials
est_p<-function(x,n) cumsum(x)/cumsum(rep(1,n))
# estimate p on different number of trials for each repetition
phat_mat<-apply(obs_mat,2, est_p, n=n)
# the convergence plot with 100 repetitions
matplot(phat_mat,type="l",lty=1,xlab="Trials",ylab="phat")

Permutation Test for Principal Component Analysis

2015-01-02T00:35:51-05:00

The procedure of permutation test for PCA is as follows:

For each replicate,

Individually permute each column of the data matrix.
Conduct the PCA and find the proportion of variance explained by each of the components 1 to s. Store this information.
Repeat 1 and 2 R times.

At the end of this we will have a matrix with R rows and s columns that contains the proportion of variance explained by each component for each replicate.

Finally, compare the observed values from the original data to the set of values from the permutations in order to determine the approximate p-value.

The R code:

pca_perm.R

# the fuction to assess the significance of the principal components.
sign.pc<-function(x,R=1000,s=10, cor=T,...){
  # run PCA
  pc.out<-princomp(x,cor=cor,...)
  # the proportion of variance of each PC
  pve=(pc.out$sdev^2/sum(pc.out$sdev^2))[1:s]

  # a matrix with R rows and s columns that contains
  # the proportion of variance explained by each pc
  # for each randomization replicate.
  pve.perm<-matrix(NA,ncol=s,nrow=R)
  for(i in 1:R){
    # permutation each column
    x.perm<-apply(x,2,sample)
    # run PCA
    pc.perm.out<-princomp(x.perm,cor=cor,...)
    # the proportion of variance of each PC.perm
    pve.perm[i,]=(pc.perm.out$sdev^2/sum(pc.perm.out$sdev^2))[1:s]
  }
  # calcalute the p-values
  pval<-apply(t(pve.perm)>pve,1,sum)/R
  return(list(pve=pve,pval=pval))
}


# apply the function
library(RCurl)
data <- getURL("https://raw.githubusercontent.com/bioops/mis_scripts/master/statistics/data/pca.txt")
OCRdata <- read.table(text = data, header=T,sep="\t")
OCRdat<-OCRdata[,-1] #leave out location id column
sign.pc(OCRdat,cor=T)

The result:

$pve
    Comp.1     Comp.2     Comp.3     Comp.4     Comp.5     Comp.6     Comp.7     Comp.8 
0.23129378 0.14864525 0.11552865 0.06741744 0.06274641 0.05858431 0.05033795 0.04484122 
    Comp.9    Comp.10 
0.03873311 0.03431297 

$pval
 [1] 0.000 0.000 0.000 1.000 1.000 0.996 1.000 1.000 1.000 1.000

Demo of Support Vector Machine

2015-01-01T15:43:07-05:00

Demo of SVM

svm.R

# load library
library(e1071)

# simulate x and y
x1<-rnorm(100);x2<-rnorm(100)
y<-as.factor(ifelse(x1^2+x2^2<=1.6,1,-1))
dat3<-data.frame(x1,x2,y)

# (a) tuning parameters
cost<-c(0.001, 0.01, 0.1, 1, 5, 10, 100)
gamma<-seq(0.1,1,0.1)
tune.out<-tune(svm, y~., data=dat3, kernel="radial",
              ranges=list(cost=cost,gamma=gamma))
bestmod<-tune.out$best.model
# the best model
summary(bestmod)

# (b) test set
# simulate test set
x1<-rnorm(100);x2<-rnorm(100)
y<-as.factor(ifelse(x1^2+x2^2<=1.6,1,-1))
dat3.test<-data.frame(x1,x2,y)
ypred<-predict(bestmod,dat3.test)
# the confusion matrix
table(predict=ypred, truth=dat3.test$y)

Linear Regression With Cross Validation

2015-01-01T15:33:24-05:00

Cross validation for linear model and the bootstrap confidence interval for coefficients

linear_CV.R

# (a) Linear model

# read data
library(RCurl)
data <- getURL("https://raw.githubusercontent.com/bioops/mis_scripts/master/statistics/data/prostateData.txt")
prostate <- read.table(text = data, header=T,sep="\t")
# remove the first column
prostate<-prostate[,-1]
# run the linear model
prostate_lm<-lm(lpsa~.,data=prostate)

# summary of the fit
summary(prostate_lm)


# (b) Cross-validation

# need boot package
library(boot)
# fit the linear model
glm.fit<-glm(lpsa~lcavol+lweight+age+lbph+svi+lcp+gleason+pgg45,data=prostate)

# LOOCV
cv.glm(prostate, glm.fit)$delta[1]

# 10-fold CV
cv.err.10<-rep(0,10)
for(i in 1:10){
  cv.err.10[i]<-cv.glm(prostate, glm.fit, K=10)$delta[1]
}

# mean error of 10-fold CV
mean(cv.err.10)


# (c) Bootstrap

# get ther residuals and fitted values from linear model
prostate_new<-data.frame(prostate, res=resid(prostate_lm), fitted=fitted(prostate_lm))
# a function to get the coefficients from each bootstrap
prostate.fun<-function(data, i){
  d<-data
  d$lpsa<-d$fitted+d$res[i]
  coef(update(prostate_lm, data=d))
}
# bootstrap
prostate.lm.boot<-boot(prostate_new, prostate.fun, R=1000)

# 95% conficence level for lcavol
boot.ci(prostate.lm.boot, index=2)
# 95% conficence level for lweight
boot.ci(prostate.lm.boot, index=3)

Estimate Gamma Distribution Parmaters Using MME and MLE

2015-01-01T07:13:59-05:00

This post shows how to estimate gamma distribution parameters using (a) moment of estimation (MME) and (b) maximum likelihood estimate (MLE).

The probability density function of Gamma distribution is

$\frac{1}{\Gamma (\alpha) \beta ^{\alpha}} x^{\alpha - 1} e^{- \frac{x}{\beta}}$

The MME:

$\hat{\alpha}=\frac{n\bar{X} ^2}{\sum_{i=1}^{n} (X_i-\bar{X})^2}$ $\hat{\beta}=\frac{\sum_{i=1}^{n} (X_i-\bar{X})^2}{n \bar{X}}$

We can calculate the MLE of $ \alpha $ using the Newton-Raphson method.

For $ k =1,2,…,$

$\hat{\alpha} ^{(k)}=\hat{\alpha} ^{(k-1)} - \frac{\ell'(\hat{\alpha} ^{(k-1)})}{\ell'' (\hat{\alpha} ^{(k-1)})}$

where

$\ell' (\alpha) = n \log \left(\frac{\alpha}{\bar{X}}\right)-n \frac{\Gamma '(\alpha)}{\Gamma(\alpha)}+\sum_{i=1}^{n} \log X_i$ $\ell'' (\alpha) = \frac{n}{\alpha} - n \left(\frac{\Gamma '(\alpha)}{\Gamma (\alpha)}\right)'$

Use the MME for the initial value of $ \alpha^{(0)} $, and stop the approximation when $ \vert \hat{\alpha}^{(k)}-\hat{\alpha}^{(k-1)} \vert < 0.0000001 $. The MLE of $ \beta $ can be found by $ \hat{\beta} = \bar{X} / \hat{\alpha} $.

Below is the R code.

gamma.R

# (a) MME
gamma_MME<-function(x){
  n<-length(x)
  mean_x<-mean(x)
  alpha<-n*(mean_x^2)/sum((x-mean_x)^2)
  beta<-sum((x-mean_x)^2)/n/mean_x
  estimate_MME<-data.frame(alpha,beta)
  return(estimate_MME)
}


# (b) MLE
gamma_MLE<-function(x){
  n<-length(x)
  mean_x<-mean(x)

  # initiate the convergence and alpha value
  converg<-1000
  alpha_prev<-gamma_MME(x)$alpha

  # initiate two vectors to store alpha and beta in each step
  alpha_est<-alpha_prev
  beta_est<-mean_x/alpha_prev

  # Newton-Raphson
  while(converg>0.0000001){
    #first derivative of alpha_k-1
    der1<-n*log(alpha_prev/mean_x)-n*digamma(alpha_prev)+sum(log(x))
    #second derivative of alpha_k-1
    der2<-n/alpha_prev-n*trigamma(alpha_prev)
    #calculate next alpha
    alpha_next<-alpha_prev-der1/der2
    # get the convergence value
    converg<-abs(alpha_next-alpha_prev)
    # store estimators in each step
    alpha_est<-c(alpha_est, alpha_next)
    beta_est<-c(beta_est, mean_x/alpha_next)
    # go to next alpha
    alpha_prev<-alpha_next
  }

  alpha<-alpha_next
  beta<-mean_x/alpha_next
  estimate_MLE<-data.frame(alpha,beta)

  return(estimate_MLE)
}

# apply
x<-rgamma(100,2,scale=5)
gammma_MME(x)
gamma_MLE(x)

2015

2014-12-31T21:31:05-05:00

Happy new year!

Hopefully, I will write and code more often in 2015.

Stay tuned!

CFDA Approved BGI's Next Generation Sequencing Diagnostic Products

2014-07-04T10:32:11-04:00

只收集一些相关资料，不评论。 http://www.sda.gov.cn/WS01/CL0051/102239.html 2014年6月30日，国家食品药品监督管理总局经审查，批准了BGISEQ-1000基因测序仪、BGISEQ-100基因测序仪和胎儿染色体非整倍体（T21、T18、T13）检测试剂盒（联合探针锚定连接测序法）、胎儿染色体非整倍体（T21、T18、T13）检测试剂盒（半导体测序法）医疗器械注册。这是国家食品药品监督管理总局首次批准注册的第二代基因测序诊断产品。

该批产品可通过对孕周12周以上的高危孕妇外周血血浆中的游离基因片段进行基因测序，对胎儿染色体非整倍体疾病21-三体综合征、18-三体综合征和13-三体综合征进行无创产前检查和辅助诊断。

http://www.knowgene.com/question/677 BGISEQ-1000基因测序仪基于Complete Genomics平台，配套的试剂盒为胎儿染色体非整倍体（T21、T18、T13）检测试剂盒（联合探针锚定连接测序法）。CG平台的特点是通量高，但周期较长，因此BGISEQ-1000应该主要会应用于全国范围内的样品，集中测序分析；

BGISEQ-100基因测序仪基于Ion Torrent平台，配套的试剂盒为胎儿染色体非整倍体（T21、T18、T13）检测试剂盒（半导体测序法）。Ion Torrent平台的特点是测序周期短，可灵活部署，BGISEQ-100有很大可能会被部署到有一定业务量的大中型医院，就地采样、测序、分析并出具报告.

Online 'Statistical Learning' Course Starts Jan 21 2014

2014-01-20T10:30:00-05:00

The free online course Statistical Learning will start tomorrow. It will be taught by Rob Tibshirani and Trevor Hastie. It is an excellent opportunity to learn directly from the two famous professors in the field and the authors of two great textbooks on statistical learning, An Introduction to Statistical Learning, with Applications in R and The Elements of Statistical Learning.

八卦

2013-09-17T12:48:00-04:00

Daniela Witten，26岁当上assistant professor，被福布斯评为科学界30个30岁以下的牛人之一。父母都是普林斯顿的professor，姐姐是普林斯顿的assistant professor，她爸得过菲尔兹奖，老公是facebook的元老。

总结：科研是聪明且有钱人的游戏。

Lasso Related Links

2013-08-17T20:18:00-04:00

The Lasso Page is maintained by the inventor of lasso and provides most important references for lasso.

The book Elements of Statistical Learning (pdf) describes the lasso in detail.

Lasso in R: lars: Least Angle Regression, Lasso and Forward Stagewise, and glmnet: Lasso and elastic-net regularized generalized linear models (Note: lars() function from the lars package is probably much slower than glmnet() from glmnet.)

The adaptive lasso paper

Adaptive lasso in R
adaptive.lasso function in lqa package (Penalized Likelihood Inference for GLMs)
adalasso function in parcor package (Regularized estimation of partial correlation matrices)

The graphical lasso paper

Graphical lasso in R (glasso: Graphical lasso- estimation of Gaussian graphical models)

The joint graphical lasso paper

Joint graphical lasso in R (JGL: Performs the Joint Graphical Lasso for sparse inverse covariance estimation on multiple classes)

The Bayesian adaptive lasso paper

R Function of Monte Carlo Simulation to Get the P-value From the Joint Cumulative Distribution of an N-dimensional Order Statistic

2013-08-08T21:21:00-04:00

I want to compute the P-value from the joint cumulative distribution of an n-dimensional order statistic.

$P(r_1 r_2 ,..., r_n)=n! \int\limits_0^{r_1} \int\limits_{s_1}^{r_2} ... \int\limits_{s_{n-1}}^{r_n} ds_1ds_2...ds_n$

One efficient way is using the following recursive formula.

$P(r_1 r_2 ,..., r_n)=\sum_{i=1}^{n} (r_{n-i+1}-r_{n-i}) P(r_i r_2 ,..., r_{n-i},r_{n-i+2} ,..., r_n)$

However, the facts are (or would be):

I am too stupid to write a recursive function.
I didn’t find the efficient formula at first.
In other cases, the efficient formula have not been derived yet, or too complicated to derive.

In Statistics Monte Carlo simulation is a “quick” way to compute some complicated formulas. By saying “quick”, I mean I can see the results without knowing or deriving “ugly” Math formulas. It’s actually a very “slow” method in computing aspect.

Anyway, the R function is here.

P_order_stat.R

# sub function of monte carlo simulation to get the p-value
P_order_stat <- function(ranks) {
  NumRep <- 10000 # number of replicates
  newvec <- sort(ranks) # sort the rank ratios
  pvalue <- 0 # inital pvalue
  for (i in 1:NumRep){

      # generate random uniform distributed data,
      # and then sort the simulated rank ratios
      newx <- sort(runif(length(ranks), min=0, max=1))

      # if all the simulated data is lower than the input,
      # then sucess+1
      judge <- sum(newvec >= newx)
      if (judge == length(ranks)) pvalue<-pvalue+1
  }
  pvalue <- pvalue / NumRep  # get the p-value
  return(pvalue)
}

Simulate Multivariate Normal Distribution Using R

2013-08-08T20:38:00-04:00

I have been doing some research about co-expression network. “co-expression” means that genes have similar expression profiles across different conditions or tissues. In the network, genes are nodes, and “co-expression” relationship between two genes can be reprensented as edges. The co-expressed genes may involve in similar pathways or biological process.

In a small part of my research, I am testing some algorithms to detect co-expression relationship. One way to test algorithm is simulation. In an ideal (simple) case, the expression values of two co-expressed genes can be considered as bivariate normal distributed. To generate expression values of such gene pair or a group of genes given a correlation coefficient, is just to simulate multivariate normal distribution. MASS library in R has an function, mvrnorm, to do that, but it requires a covariance matrix.

The function below is to firstly generate the covariance matrix in order to use the mvnorm function. Because we only know the correlation coefficient, i.e. co-expression relationship (degree), the mean and variance of each gene’s expression profile are random generated in the function. Then the matrix can be calulated as follows.

$\mu=\left( \begin{matrix} \mu_x \\ \mu_y \end{matrix} \right), \Sigma=\left( \begin{matrix} \sigma_x^2 & \rho \sigma_x \sigma_y \\ \sigma_x \sigma_y & \sigma_y^2 \end{matrix} \right)$

multi_norm.R

# function to simulate multivariate normal distribution
# given gene number, sample size and correlation coefficient
multi_norm <- function(gene_num,sample_num,R) {
  # initial covariance matrix
  V <- matrix(data=NA, nrow=gene_num, ncol=gene_num)

  # mean for each gene
  meansmodule <- runif(gene_num, min=-3, max=3)
  # variance for each gene
  varsmodule <- runif(gene_num, min=0, max=5)

  for (i in 1:gene_num) {
  # a two-level nested loop to generate covariance matrix
    for (j in 1:gene_num) {
      if (i == j) {
        # covariances on the diagonal
        V[i,j] <- varsmodule[i]
      } else {
        # covariances
        V[i,j] <- R * sqrt(varsmodule[i]) * sqrt(varsmodule[j])
      }
    }
  }

  # simulate multivariate normal distribution
  # given means and covariance matrix
  X <- t(mvrnorm(n = sample_num, meansmodule, V))

  return(X)
}

Warning

2013-08-07T14:19:00-04:00

A couple of months ago, I transfered the website engine from wordpress into Octopress. A lot of errors were found in previous posts due to format incompatibility, especially some perl scripts.

Please read carefully before using any code or script, and leave a comment if you find some “terrible” error.

Thank you!

[update-2013-08-08] now only perl howto related posts

[update-2015-01-01] All format issues are (putatively) resolved.

Final Transfer to Github

2013-05-24T22:14:00-04:00

I had been using 000webhost.com to host my website till several days ago when I noticed my website was suspendend for “violating 20%+ CPU usage limit for more than 1000 times.”

The 000wbehost server is good and stable. Most importantly it’s free. Now, I have to transfter to another free and good web hosting service. Github is a good choice. But github does not support wordpress. I tried to transferr the website to github before, but I am not comfortable to write blogs using Markdown.

It’s difficult to find another free service supporting wordpress, and lots of people said the static blogging engine is much better than wordpress. Looks like I will stay here for a while.

Common Job Requirements for Senior/mid-senior Bioinformatician and Computational Biologist

2013-04-18T00:00:00-04:00

Degree: Phd (=MS+“n” years experience)
NGS data processing experience
Biology + Statistics knowledge
Programming: Statistics (R/Matlab/SAS), Script language (Python/Perl), OOP (C++/Java), Database (SQL)
Linux/Unix
Written and oral communication skills

(Note: Based on an incomplete and unprofessional survey on US job market in April 2013)

NGS Startups

2013-03-26T00:00:00-04:00

23andme

Wiki:

23andMe is a privately held personal genomics and biotechnology company based in Mountain View, California that provides rapid genetic testing. The company is named for the 23 pairs of chromosomes in a normal human cell. Their personal genome test kit was named “Invention of the Year” by Time magazine in 2008.

Jobs:

Engineering: HPC Systems Administrator
Senior Software Engineer
Software Engineer
Storage Systems Architect/Engineer Science: Backend Software Engineer
Health Content Scientist
Research Assistant
Scientist
Statistical Geneticist
Statistical Geneticist focusing on Parkinson’s Disease
Survey Methodologist
User Interface Designer

Bina

About:

Bina is the big data science platform accelerating personalized medicine for researchers and clinicians requiring fast, accurate and scalable genomic analysis. The word “Bina” means “knowledge” or “insight”, translated from both Persian and Hebrew. We use cutting-edge big data technologies to dramatically reduce the amount of time and money required to process raw genetic data in order to generate insights for personalized medicine. Bina was started by a team of Stanford and Berkeley researchers and entrepreneurs, with the vision that whole genome sequencing (WGS) is just the beginning of a brighter future. Bina is accelerating personalized medicine, one genome at a time.

Jobs:

Big Data Software Architect
Senior Software Engineer
Senior Computational Biologist
Senior Data Scientist
Senior Applications Support Scientist
Senior Bioinformatics Scientist

TechCrunch: With $6.25M In Tow, Bina Technologies Wants To Bring Big Data Insight To Genomic Sequencing

(to be continued)

Sequencing Cost (2013 Feb)

2013-02-20T00:00:00-05:00

Reasonably priced genomes

Although no reports of big innovations in DNA sequencing are expected at a major conference this week, the current cost and capabilities of the technology now make medical applications worthwhile.

Name	Machine cost	Read length (bases)	Cost per megabase
Illumina MiSeq	US$125,000	500	14–70 cents
Illumina HiSeq	US$690,000	300	4–5 cents
PacBio RS	US$695,000	4,575	$2–17
Ion Torrent PGM	US$49,000	400	60 cents–$5
Ion Torrent Proton	US$224,000	200	1–9 cents

Source: The companies; Travis Glenn

世界末日

2012-11-30T00:00:00-05:00

我特别希望世界末日是真的！
。。。。。。
看评论。