Dive Deep into Cellular Evolution with TranscriptFormer: A Cross-Species Generative Cell Atlas

Ever wondered how cells have evolved across billions of years? The TranscriptFormer model, developed by the Chan Zuckerberg Initiative, offers a groundbreaking way to explore cellular diversity across species. Discover how this innovative tool can revolutionize your research in single-cell biology.

TranscriptFormer Model Overview

What is TranscriptFormer? Unlocking Cellular Secrets Across Species

TranscriptFormer is a family of generative foundation models trained on a massive dataset of up to 112 million cells. These cells span an incredible 1.53 billion years of evolution and cover 12 different species. This ambitious project aims to create a cross-species generative cell atlas, allowing researchers to analyze and compare cellular processes across diverse organisms.

Key Benefits of Using TranscriptFormer:

Unprecedented Scale: Analyze data across millions of cells and billions of years of evolution.
Cross-Species Insights: Compare cellular functions across diverse species, from humans to yeast.
Generative Modeling: Predict and simulate cellular behavior using a novel generative architecture.

Meet the TranscriptFormer Family: Three Powerful Models

TranscriptFormer comes in three distinct versions, each tailored to specific research needs and computational resources. Understanding the differences between these models is key to choosing the right tool for your analysis.

1. TF-Metazoa: The Comprehensive Atlas

This model is trained on all 112 million cells, encompassing all twelve species in the dataset. It is the most comprehensive model, ideal for broad comparative studies. Covering vertebrates, invertebrates, fungi, and protists, TF-Metazoa provides a holistic view of cellular evolution.

2. TF-Exemplar: Focus on Model Organisms

TF-Exemplar focuses on human cells and four key model organisms: mouse, zebrafish, fruit fly, and C. elegans. Trained on 110 million cells, it offers a balance between breadth and computational efficiency. If your research centers on these widely studied species, TF-Exemplar is a great choice.

3. TF-Sapiens: Human-Centric Analysis

The TF-Sapiens model is trained solely on 57 million human cells, providing unparalleled depth for human-specific research. With 368 million trainable parameters, it captures intricate details of human cellular processes. If you're focused on human health or disease, TF-Sapiens is the model for you.

Installation: Get Started with TranscriptFormer

Ready to start using TranscriptFormer? The installation process is straightforward, requiring Python >=3.11. Follow these steps to get up and running:

Step-by-Step Installation:

Clone the Repository:

git clone https://github.com/czi-ai/transcriptformer.git
cd transcriptformer

Create a Virtual Environment:

uv venv --python=3.11
source .venv/bin/activate

Install in Development Mode (recommended):
```
uv pip install -e .
```

Alternatively, install directly from PyPI:

uv pip install transcriptformer

Downloading Model Weights: Accessing the Power of TranscriptFormer

The pre-trained model weights are stored on AWS S3. Use the provided download_artifacts.py script to download the specific models you need.

Downloading Specific Models:

python download_artifacts.py tf-sapiens
python download_artifacts.py tf-exemplar
python download_artifacts.py tf-metazoa

Downloading All Models and Embeddings:

python download_artifacts.py all
python download_artifacts.py all-embeddings

Running Inference: Unleashing the Power of Generative Single-Cell Analysis

The inference.py script provides a user-friendly interface for running TranscriptFormer. It uses Hydra for configuration, allowing flexible parameter adjustments.

Basic Inference Command:

python inference.py --config-name=inference_config.yaml model.checkpoint_path=./checkpoints/tf_sapiens

Key Parameters to Consider:

model.checkpoint_path: Path to the model weights.
model.inference_config.data_files: Path to your input data (H5AD format).
model.inference_config.pretrained_embedding: Path to pretrained embeddings for out-of-distribution species.
model.inference_config.batch_size: Adjust batch size based on your GPU VRAM.

Input Data Format: Preparing Your Data for TranscriptFormer

Ensure your input data is in H5AD format (AnnData objects) with specific requirements:

Gene IDs: The var dataframe must contain an ensembl_id column populated with Ensembl gene identifiers.
Expression Data: Raw count data should be stored in the adata.X matrix.
Cell Metadata: Any cell metadata in the obs dataframe will be preserved in the output.

Output Format: Understanding Your Results

The inference results are saved as an AnnData object (embeddings.h5ad) in the output directory.

Key Output Components:

Cell Embeddings: Stored in obsm['embeddings'].
Original Cell Metadata: Preserved in the obs dataframe.
Log-Likelihood Scores: Found in uns['llh'] (if available).

Hardware and Software: What You Need to Run TranscriptFormer

To run TranscriptFormer efficiently, especially for inference and embedding extraction, a GPU (A100 40GB recommended) is beneficial. You can also use a GPU with less VRAM (16GB) by reducing the inference batch size to 1-4. The core software dependencies include PyTorch, PyTorch Lightning, anndata, scanpy, and others. Refer to the pyproject.toml file for a complete list of dependencies.

Contributing and Staying Secure

The TranscriptFormer project welcomes contributions, adhering to the Contributor Covenant code of conduct. Security is paramount; responsibly report any security issues to [email protected].

Citing TranscriptFormer

If you use TranscriptFormer in your research, please cite the following publication:

Pearce, J. D., et. al. (2025). A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model. bioRxiv.

By leveraging TranscriptFormer, you can unlock new insights into cellular evolution, disease mechanisms, and gene regulation.