Rogue Scholar

R TILBiologiaInglês

Use nanoparquet instead of readr/CSV

Publicados 8 de outubro de 2024

Autor Stephen Turner

Yesterday I wrote about base R vs. dplyr vs. duckdb for a simple summary analysis. In that post I simulated 100 million rows of a dataset and wrote to disk as CSV. I then benchmarked how long it took to read in and compute a simple grouped mean. One thing I didn’t do here was separate the time it took to read data into memory (for base R and dplyr) versus computing the actual summary.

R TILBiologiaInglês

DuckDB vs dplyr vs base R

https://doi.org/10.59350/b4eds-n3h83

Publicados 7 de outubro de 2024

Autor Stephen Turner

TL;DR : For a very simple analysis (means by group on 100M rows), duckdb was 125x faster than base R, and 28x faster than readr+dplyr, without having to read data from disk into memory. The duckplyr package wraps DuckDB's analytical query processing techniques in a dplyr-compatible API. Learn more at duckdb.org/docs/api/r and duckplyr.tidyverse.org. I wanted to see for myself what the fuss was about with DuckDB.

PapersBiologiaInglês

Weekly Recap (Oct 2024, part 1)

https://doi.org/10.59350/qnb80-c4912

Publicados 4 de outubro de 2024

Autor Stephen Turner

This week’s recap highlights a new multispecies codon optimization method, personalized pangenome references with vg, a commentary on the wild west of spike-in normalization, a new pipeline for comprehensive and scalable polygenic scoring across ancestrally diverse populations, a paper showing deep learning / transformer-based methods don’t outperform simple linear models for predicting gene expression after genetic perturbations, and finally, a

R AIPythonBiologiaInglês

AI code completion in Positron

https://doi.org/10.59350/bxp4b-neb71

Publicados 1 de outubro de 2024

Autor Stephen Turner

TL;DR: Codeium offers a free Copilot-like experience in Positron. You can install it from the Open VSX registry directly within the extensions pane in Positron.

PapersBiologiaInglês

Weekly Recap (Sep 2024, part 5)

https://doi.org/10.59350/pafs0-55948

Publicados 27 de setembro de 2024

Autor Stephen Turner

This week’s recap highlights a Nextflow pipeline for eQTL detection, an end-to-end pipeline for spatial transcriptomics (visium) data analysis, a method for identification of perturbed cell types in single cell RNA-seq data, a method for guide assignment in single-cell CRISPR screens, a tool for on-target/off-target analysis of gene editing outcomes, and “digital microbes” for collaborative team science on emerging microbes.

R AIBiologiaInglês

Llama 3.2 summaries of bioRxiv and medRxiv preprints with biorecap

https://doi.org/10.59350/mq5ye-p7c14

Publicados 25 de setembro de 2024

Autor Stephen Turner

Last month I published a paper and an R package for summarizing preprints from bioRxiv using a local LLM. I wrote about it here: Llama 3.2 was just released today (announcement). The biggest news is the addition of a multimodal vision model, but I was intrigued by the reasonably good performance of the tiny 3B text model. I used this as an excuse to update the biorecap R package.

TILBiologiaInglês

Learning in Public

https://doi.org/10.59350/xwgsf-nj906

Publicados 24 de setembro de 2024

Autor Stephen Turner

Cite: Stephen Turner. “Learning in Public.” Paired Ends (2024). DOI: https://doi.org/10.59350/xwgsf-nj906. I wrote my first public blog post in 2009. I started Getting Genetics Done to share what I was learning at the end of my PhD/postdoc through my first few years as faculty.

TILBiologiaInglês

Excalidraw: create and share workflow diagrams with end-to-end encryption

https://doi.org/10.59350/87vzv-n0k31

Publicados 22 de setembro de 2024

Autor Stephen Turner

I recently stumbled across Phil Ewels’s ~18 minute nf-core/bytesize talk on Excalidraw: For years I’ve been using draw.io for making flowcharts and diagrams for documentation, papers, presentations, and for general brainstorming and communication with my team, clients, and collaborators.1 Excalidraw (excalidraw.com) looks like an attractive alternative.

PapersBiologiaInglês

Weekly Recap (Sep 2024, part 4)

https://doi.org/10.59350/5td1w-wpf50

Publicados 20 de setembro de 2024

Autor Stephen Turner

This week’s recap highlights a new nf-core workflow for multi-omics trait association studies, a new tool for linking genotype to phenotype (G2P) by directly sequencing alleles from CRISPR base editing experiments, the SplitsTree app for interactive analysis and visualization using phylogenetic trees and networks, mapping cellular interactions from spatially resolved transcriptomics data, a study of marine microbial diversity and bioprospecting

TILAIBiologiaInglês

Illuminate preprints with an AI-generated podcast discussion

https://doi.org/10.59350/vagad-0sy73

Publicados 16 de setembro de 2024

Autor Stephen Turner

Google has a new experimental1 tool called Illuminate ( illuminate.google.com ) that takes a link to a preprint2 and creates a podcast discussing the paper. When I tested this with a few preprints, the podcasts it generated are about 6-8 minutes long, featuring a male and female voice discussing the key points of the paper in a conversational style. There are some obvious shortcomings.

PapersBiologiaInglês

Weekly Recap (Sep 2024, part 3)

https://doi.org/10.59350/k97fq-qqr61

Publicados 14 de setembro de 2024

Autor Stephen Turner

This week’s recap highlights a new tool from Wei Shen and Zamin Iqbal for efficient sequence alignment against millions of prokaryotic genomes (LexicMap), a new tool from Heng Li for efficiently constructing and querying a sequence index at scale, an R/Bioconductor package for detecting and correcting DNA contamination in RNA-seq data, a method for dating gene age using synteny, how AlphaFold predictions for some types of conformations are