Rogue Scholar

Machine LearningRBiological Sciences

Split a Data Frame into Testing and Training Sets in R

Published February 24, 2011

I recently analyzed some data trying to find a model that would explain body fat distribution as predicted by several blood biomarkers. I had more predictors than samples (p>n), and I didn't have a clue which variables, interactions, or quadratic terms made biological sense to put into a model.

RRecommended ReadingSearchStatisticsBiological Sciences

Get all your Questions Answered

https://doi.org/10.59350/hh728-shc15

Published February 22, 2011

Author Stephen Turner

When I have a question I usually ask the internet before bugging my neighbor.

RBiological Sciences

R: Given column name in a Data Frame, Get the Index

https://doi.org/10.59350/je8h1-1fp82

Published February 17, 2011

Author Stephen Turner

Had a mental block today trying to figure out how to get the indices of columns in a data frame given their names. Simple task but difficult to search Google for an answer. Thanks to jashapiro, Matt, and Vince for giving me a heads up on the which() function. The which() function returns the indices of TRUE values in a logical vector. If you're looking at the iris data: data(iris) head(iris)

RBiological Sciences

Summarize Missing Data for all Variables in a Data Frame in R

https://doi.org/10.59350/my7ne-fph65

Published February 16, 2011

Author Stephen Turner

Something like this probably already exists in an R package somewhere out there, but I needed a function to summarize how much missing data I have in each variable of a data frame in R. Pass a data frame to this function and for each variable it'll give you the number of missing values, the total N, and the proportion missing.

ProductivityBiological Sciences

Results from Reference Management Poll

https://doi.org/10.59350/jq6v6-zzz18

Published February 15, 2011

Author Stephen Turner

A while back I asked you what reference management software you used, and how well you liked it. I received 180 responses, and here's what you said. Out of the choices on the poll, most of you used Mendeley (30%), followed by EndNote (23%) and Zotero (15%). Out of those of you who picked "other," it was mostly Papers or Qiqqa. There were even a few brave souls managing references caveman-style, manually.

Biological Sciences

Shellfish for Parallel PCA on GWAS data (Alternative to Eigenstrat)

https://doi.org/10.59350/s8yjz-w8e68

Published February 11, 2011

Author Stephen Turner

Recently I tried compiling Eigensoft on my Ubuntu 10.10 Linux system running in Virtualbox and had no success. From comments on this blog post, it looks like the newer Ubuntu distros don't have the libg2c0 and related libraries (which were a part of the gcc3) and gcc4 uses gfortran instead.

Biological Sciences

Extracting values from R summary objects

https://doi.org/10.59350/qsn86-5wc25

Published February 10, 2011

Author Stephen Turner

This builds on a previous post from Stephen. I was recently running a series of ANOVA analyses, and I used the aov() function because it had a few options that I preferred. Much like lm(), the function returns an object that you typically pass to summary() to view and interpret the output. It took me a bit of playing to figure out how to extract the information I needed.

AnnouncementsBiological Sciences

So long Vanderbilt, and thanks for all the fish!

https://doi.org/10.59350/zttvp-p8729

Published January 14, 2011

Author Stephen Turner

After finishing the final revisions on my dissertation I was reminded of this spot-on graphical guide to what a Ph.D. is really all about. Now that I'm finished, I'm leaving Vanderbilt to start a postdoc in genetic epidemiology with Dr. Loic Le Marchand at the University of Hawaii Cancer Center. Posts may be sparse over the next few weeks, but I plan on blogging as usual once I'm set up at my postdoc.

RStatisticsBiological Sciences

R function for extracting F-test P-value from linear model object

https://doi.org/10.59350/5928j-57454

Published January 10, 2011

Author Stephen Turner

I thought it would be trivial to extract the p-value on the F-test of a linear regression model (testing the null hypothesis R²=0). If I fit the linear model: fit<-lm(y~x1+x2), I can't seem to find it in names(fit) or summary(fit). But summary(fit)$fstatistic does give you the F statistic, and both degrees of freedom, so I wrote this function to quickly pull out the p-value from this F-test on a lm object, and added it to my R profile.

Epistasis in New Places

https://doi.org/10.59350/7vnsb-y9y82

Published December 16, 2010

Author Stephen Turner

Coming from the lineage of Jason Moore, I am obliged to occasionally remind everyone that biological systems are inherently complex, and to some degree, we should therefore expect statistical models involving those systems to be complex as well. With the development of GWAS, many approaches to examine epistasis are weighed down by the computational burden of exhaustively conducting billions of statistical tests.

AnnouncementsBiological Sciences

Which Reference Management Software do you use? (Reader Poll)

https://doi.org/10.59350/pq5f1-a6a52

Published December 15, 2010

Author Stephen Turner

When I started grad school I started using Reference Manager (RefMan), similar to EndNote, to manage my references and bibliographies. It's a real pain, and I often feel like I'm powering my computer with the endless pumping and clicking of the mouse that it takes to import a reference into my library. Recently I've started using Zotero because of how easy it is to import references, store PDFs, and sync between computers.

Getting Genetics Done

Split a Data Frame into Testing and Training Sets in R

Get all your Questions Answered

R: Given column name in a Data Frame, Get the Index

Summarize Missing Data for all Variables in a Data Frame in R

Results from Reference Management Poll

Shellfish for Parallel PCA on GWAS data (Alternative to Eigenstrat)

Extracting values from R summary objects

So long Vanderbilt, and thanks for all the fish!

R function for extracting F-test P-value from linear model object

Epistasis in New Places

Which Reference Management Software do you use? (Reader Poll)