TranscriptFormer: A Cross-Species Generative Cell Atlas for Single-Cell Analysis

TranscriptFormer Overview

Unlock the power of single-cell analysis with TranscriptFormer, a groundbreaking family of generative foundation models. Trained on a massive dataset of up to 112 million cells across 12 species, TranscriptFormer enables researchers to explore cellular diversity and regulatory relationships like never before. Dive into the details of this cross-species generative cell atlas and learn how it can revolutionize your research.

What is TranscriptFormer? A Generative Foundation Model for Cells.

TranscriptFormer is a novel deep learning model designed to understand single-cell transcriptomes. It learns complex relationships between genes and their expression levels, providing a powerful tool for:

Cell type classification: Accurately identify cell types across different species.
Disease state identification: Pinpoint disease states in human cells with unprecedented accuracy.
Regulatory relationship prediction: Predict cell-type-specific transcription factors and gene-gene regulatory relationships.

Three Powerful Models: Choose the Right TranscriptFormer for Your Needs

TranscriptFormer offers three distinct pre-trained models, each tailored to specific research applications, allowing you to choose the right model for your specific need in single-cell analysis:

TF-Metazoa: The most comprehensive model, trained on 112 million cells from 12 species, ideal for broad cross-species comparisons.
TF-Exemplar: Focuses on human and four key model organisms, perfect for comparative studies and translational research.
TF-Sapiens: Trained exclusively on 57 million human cells, optimized for in-depth analysis of human biology and disease.

Key Features: Why TranscriptFormer Stands Out

TranscriptFormer leverages a unique architecture to achieve state-of-the-art performance in single-cell analysis. Here's what makes it special:

Generative Autoregressive Joint Model: Simultaneously models genes and their expression levels to capture complex biological relationships.
Transformer-Based Architecture: Employs a powerful transformer network with novel coupling between gene and transcript heads.
Expression-Aware Multi-Head Self-Attention: Improves the model's ability to understand the context of gene expression within individual cells.
Count Likelihood: Accurately captures the variability in transcript-level data.

Installation: Get Started with TranscriptFormer in Minutes

Ready to start using TranscriptFormer? Here's how to install it:

Clone the repository:

git clone https://github.com/czi-ai/transcriptformer.git
cd transcriptformer

Create and activate a virtual environment:

uv venv --python=3.11
source .venv/bin/activate

Install TranscriptFormer:

uv pip install -e . # For development mode
# Or
uv pip install transcriptformer # From PyPI

Running Inference: Analyze Your Single-Cell Data with Ease

Once installed, running inference with TranscriptFormer is simple and straightforward. Use the inference.py script with a YAML configuration file to specify your desired parameters.

Example: Running inference on human data with TF-Sapiens:

python inference.py --config-name=inference_config.yaml \
 model.checkpoint_path=./checkpoints/tf_sapiens \
 model.inference_config.data_files.0=test/data/human_val.h5ad \
 model.inference_config.batch_size=8

Input and Output: Understanding the Data Flow

To ensure proper functionality when doing single-cell analysis with TranscriptFormer, it's important to understand the expected input and output formats.

Input: H5AD format (AnnData objects) with raw count data and Ensembl gene identifiers.
Output: H5AD file containing cell embeddings, original cell metadata, and log-likelihood scores.

Hardware Requirements: Optimizing Performance

For efficient inference and embedding extraction, a GPU (A100 40GB recommended) is preferred. However, you can also run TranscriptFormer on a GPU with lower VRAM (16GB) by adjusting the inference batch size to 1-4. This makes single-cell analysis with TranscriptFormer accessible to a wider range of researchers.

Contributing: Join the TranscriptFormer Community

James D Pearce Sara E Simmonds Gita Mahmoudabadi Lakshmi Krishnan

TranscriptFormer is an open-source project, and contributions are welcome! Please adhere to the Contributor Covenant code of conduct and report any unacceptable behavior to [email protected]. Together, we can advance the field of single-cell analysis and unlock new discoveries.