Biological SciencesSubstack

Paired Ends

Bioinformatics, computational biology, and data science updates from the field. Occasional posts on programming.
Home PageRSS FeedMastodon
language
Published
Author Stephen Turner

This week’s recap highlights a new method for gene-level alignment of single-cell trajectories, an R package for integrating gene and protein identifiers across biological sequence databases, characterization of SVs across humans and apes, universal prediction of cellular phenotypes, a method to quantify cell state heritability versus plasticity and infer cell state transition with single cell data, and a new AI-driven, natural language-oriented

Published
Author Stephen Turner

Yesterday I wrote about base R vs. dplyr vs. duckdb for a simple summary analysis. In that post I simulated 100 million rows of a dataset and wrote to disk as CSV. I then benchmarked how long it took to read in and compute a simple grouped mean. One thing I didn’t do here was separate the time it took to read data into memory (for base R and dplyr) versus computing the actual summary.

Published
Author Stephen Turner

TL;DR : For a very simple analysis (means by group on 100M rows), duckdb was 125x faster than base R, and 28x faster than readr+dplyr, without having to read data from disk into memory. The duckplyr package wraps DuckDB's analytical query processing techniques in a dplyr-compatible API. Learn more at duckdb.org/docs/api/r and duckplyr.tidyverse.org. I wanted to see for myself what the fuss was about with DuckDB.

Published
Author Stephen Turner

This week’s recap highlights a new multispecies codon optimization method, personalized pangenome references with vg, a commentary on the wild west of spike-in normalization, a new pipeline for comprehensive and scalable polygenic scoring across ancestrally diverse populations, a paper showing deep learning / transformer-based methods don’t outperform simple linear models for predicting gene expression after genetic perturbations, and finally, a

Published
Author Stephen Turner

This week’s recap highlights a Nextflow pipeline for eQTL detection, an end-to-end pipeline for spatial transcriptomics (visium) data analysis, a method for identification of perturbed cell types in single cell RNA-seq data, a method for guide assignment in single-cell CRISPR screens, a tool for on-target/off-target analysis of gene editing outcomes, and “digital microbes” for collaborative team science on emerging microbes.

Published
Author Stephen Turner

Last month I published a paper and an R package for summarizing preprints from bioRxiv using a local LLM. I wrote about it here: Llama 3.2 was just released today (announcement). The biggest news is the addition of a multimodal vision model, but I was intrigued by the reasonably good performance of the tiny 3B text model. I used this as an excuse to update the biorecap R package.

Published
Author Stephen Turner

I wrote my first public blog post in 2009. I started Getting Genetics Done to share what I was learning at the end of my PhD/postdoc through my first few years as faculty. Some of the earliest posts were simple, such as how to write and run a simple Perl script, to bigger topics like why it’s usually a bad idea to categorize continuous variables in a linear model.

Published
Author Stephen Turner

I recently stumbled across Phil Ewels’s ~18 minute nf-core/bytesize talk on Excalidraw: For years I’ve been using draw.io for making flowcharts and diagrams for documentation, papers, presentations, and for general brainstorming and communication with my team, clients, and collaborators.1 Excalidraw (excalidraw.com) looks like an attractive alternative.

Published
Author Stephen Turner

This week’s recap highlights a new nf-core workflow for multi-omics trait association studies, a new tool for linking genotype to phenotype (G2P) by directly sequencing alleles from CRISPR base editing experiments, the SplitsTree app for interactive analysis and visualization using phylogenetic trees and networks, mapping cellular interactions from spatially resolved transcriptomics data, a study of marine microbial diversity and bioprospecting