八卦

The Lasso Page is maintained by the inventor of lasso and provides most important references for lasso.

The book Elements of Statistical Learning (pdf) describes the lasso in detail.

Lasso in R: lars: Least Angle Regression, Lasso and Forward Stagewise, and glmnet: Lasso and elastic-net regularized generalized linear models (Note: lars() function from the lars package is probably much slower than glmnet() from glmnet.)

adaptive.lasso function in lqa package (Penalized Likelihood Inference for GLMs)
adalasso function in parcor package (Regularized estimation of partial correlation matrices)

The graphical lasso paper

Graphical lasso in R (glasso: Graphical lasso- estimation of Gaussian graphical models)

The joint graphical lasso paper

Joint graphical lasso in R (JGL: Performs the Joint Graphical Lasso for sparse inverse covariance estimation on multiple classes)

R Function of Monte Carlo Simulation to Get the P-value From the Joint Cumulative Distribution of an N-dimensional Order Statistic

I want to compute the P-value from the joint cumulative distribution of an n-dimensional order statistic.

One efficient way is using the following recursive formula.

However, the facts are (or would be):

1. I am too stupid to write a recursive function.
2. I didn’t find the efficient formula at first.
3. In other cases, the efficient formula have not been derived yet, or too complicated to derive.

In Statistics Monte Carlo simulation is a “quick” way to compute some complicated formulas. By saying “quick”, I mean I can see the results without knowing or deriving “ugly” Math formulas. It’s actually a very “slow” method in computing aspect.

Anyway, the R function is here.

Simulate Multivariate Normal Distribution Using R

I have been doing some research about co-expression network. “co-expression” means that genes have similar expression profiles across different conditions or tissues. In the network, genes are nodes, and “co-expression” relationship between two genes can be reprensented as edges. The co-expressed genes may involve in similar pathways or biological process.

In a small part of my research, I am testing some algorithms to detect co-expression relationship. One way to test algorithm is simulation. In an ideal (simple) case, the expression values of two co-expressed genes can be considered as bivariate normal distributed. To generate expression values of such gene pair or a group of genes given a correlation coefficient, is just to simulate multivariate normal distribution. MASS library in R has an function, mvrnorm, to do that, but it requires a covariance matrix.

The function below is to firstly generate the covariance matrix in order to use the mvnorm function. Because we only know the correlation coefficient, i.e. co-expression relationship (degree), the mean and variance of each gene’s expression profile are random generated in the function. Then the matrix can be calulated as follows.

Warning

A couple of months ago, I transfered the website engine from wordpress into Octopress. A lot of errors were found in previous posts due to format incompatibility, especially some perl scripts.

Thank you!

[update-2013-08-08] now only perl howto related posts

[update-2015-01-01] All format issues are (putatively) resolved.

Final Transfer to Github

I had been using 000webhost.com to host my website till several days ago when I noticed my website was suspendend for “violating 20%+ CPU usage limit for more than 1000 times.”

The 000wbehost server is good and stable. Most importantly it’s free. Now, I have to transfter to another free and good web hosting service. Github is a good choice. But github does not support wordpress. I tried to transferr the website to github before, but I am not comfortable to write blogs using Markdown.

It’s difficult to find another free service supporting wordpress, and lots of people said the static blogging engine is much better than wordpress. Looks like I will stay here for a while.

Common Job Requirements for Senior/mid-senior Bioinformatician and Computational Biologist

• Degree: Phd (=MS+“n” years experience)
• NGS data processing experience
• Biology + Statistics knowledge
• Programming: Statistics (R/Matlab/SAS), Script language (Python/Perl), OOP (C++/Java), Database (SQL)
• Linux/Unix
• Written and oral communication skills
(Note: Based on an incomplete and unprofessional survey on US job market in April 2013)

23andme

Wiki:

23andMe is a privately held personal genomics and biotechnology company based in Mountain View, California that provides rapid genetic testing. The company is named for the 23 pairs of chromosomes in a normal human cell. Their personal genome test kit was named “Invention of the Year” by Time magazine in 2008.

Jobs:

Senior Software Engineer
Software Engineer
Storage Systems Architect/Engineer Science: Backend Software Engineer
Health Content Scientist
Research Assistant
Scientist
Statistical Geneticist
Statistical Geneticist focusing on Parkinson’s Disease
Survey Methodologist
User Interface Designer

Bina

Bina is the big data science platform accelerating personalized medicine for researchers and clinicians requiring fast, accurate and scalable genomic analysis. The word “Bina” means “knowledge” or “insight”, translated from both Persian and Hebrew. We use cutting-edge big data technologies to dramatically reduce the amount of time and money required to process raw genetic data in order to generate insights for personalized medicine. Bina was started by a team of Stanford and Berkeley researchers and entrepreneurs, with the vision that whole genome sequencing (WGS) is just the beginning of a brighter future. Bina is accelerating personalized medicine, one genome at a time.

Jobs:

Big Data Software Architect
Senior Software Engineer
Senior Computational Biologist
Senior Data Scientist
Senior Applications Support Scientist
Senior Bioinformatics Scientist

TechCrunch: With $6.25M In Tow, Bina Technologies Wants To Bring Big Data Insight To Genomic Sequencing (to be continued) Sequencing Cost (2013 Feb) | Comments Reasonably priced genomes Although no reports of big innovations in DNA sequencing are expected at a major conference this week, the current cost and capabilities of the technology now make medical applications worthwhile. Name Machine cost Read length (bases) Cost per megabase Illumina MiSeq US$125,000 500 14–70 cents
Illumina HiSeq US$690,000 300 4–5 cents PacBio RS US$695,000 4,575 $2–17 Ion Torrent PGM US$49,000 400 60 cents–$5 Ion Torrent Proton US$224,000 200 1–9 cents
Source: The companies; Travis Glenn