![](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAyhseGx0ZYtUL61iiS8QJjC1rPocu-e8EZ4oEWxth4kgAJSETSUBBMuOC11B8Ra5TaWDa5rv24mmO15e6wmoZ1iRaLZahA7KS7E-g3Msvnp1uvHWIGFap1PI-0vo63XSC_abV6AB2bsyW/s72-c/langdon.png)
For a few years now, my EvoSTAR colleague, Bill Langdon, has been exploring the degree to which Mycoplasma bacteria have contaminated experimental systems and even "infected" online databases with the contents of their genomes.
For a few years now, my EvoSTAR colleague, Bill Langdon, has been exploring the degree to which Mycoplasma bacteria have contaminated experimental systems and even "infected" online databases with the contents of their genomes.
The ENCODE project continues to generate massive numbers of data points on how genes are regulated. This data will be of incredible use for understanding the role of genetic variation, both for altering low-level cellular phenotypes (like gene expression or splicing), but also for complex disease phenotypes. While it is all deposited into the UCSC browser, ENCODE data is not always the easiest to access or manipulate.
Sequencing company Complete Genomics recently made available 69 ethnically diverse complete human genome sequences: a Yoruba trio; a Puerto Rican trio; a 17-member, 3-generation pedigree; and a diversity panel representing 9 different populations. Some of the samples partially overlap with HapMap and the 1000 Genomes Project. The data can be downloaded directly from the FTP site.
PLINK/SEQ is an open source C/C++ library for analyzing large-scale genome sequencing data. The library can be accessed via the pseq command line tool, or through an R interface. The project is developed independently of PLINK but it's syntax will be familiar to PLINK users. PLINK/SEQ boasts an impressive feature set for a project still in the beta testing phase.
I recently analyzed some next-generation sequencing data, and I first wanted to compare the frequencies in my samples to those in the 1000 Genomes Project. It turns out this is much easier that I thought, as long as you're a little comfortable with the Linux command line. First, you'll need a Linux system, and two utilities: tabix and vcftools. I'm virtualizing an Ubuntu Linux system in Virtualbox on my Windows 7 machine.
I'm working on imputing GWAS data to the 1000 Genomes Project data using MaCH. For the model estimation phase you only need ~200 individuals. Here's a one-line unix command that will pull out 200 samples at random from a binary pedigree .fam file called myfamfile.fam: for i in `cut -d ' ' -f 1-2 myfamfile.fam | sed s/\ /,/g`; do echo "$RANDOM $i"; done | sort |
As I mentioned in my recap of the ASHG 1000 genomes tutorial, I'm doing to be imputing some of my own data to 1000 genomes, and I'll try to post lessons learned along the way here under the 1000 genomes and imputation tags. I'm starting from a binary pedigree format file (plink's bed/bim/fam format) and the first thing in the 1000 genomes imputation cookbook is to store your data in Merlin format, one per chromosome.
If you missed the tutorial on the 1000 genomes project data last week at ASHG, you can now watch the tutorials on youtube and download the slides online at http://genome.gov/27542240.
There will be a (free) tutorial on the 1000 genomes project at this year's ASHG meeting on Wednesday, November 3, 7:00 – 9:30pm. You can register online at the link below.