Dive Deep into Cellular Evolution with TranscriptFormer: A Cross-Species Generative Cell Atlas
Ever wondered how cells have evolved across billions of years? The TranscriptFormer model, developed by the Chan Zuckerberg Initiative, offers a groundbreaking way to explore cellular diversity across species. Discover how this innovative tool can revolutionize your research in single-cell biology.
What is TranscriptFormer? Unlocking Cellular Secrets Across Species
TranscriptFormer is a family of generative foundation models trained on a massive dataset of up to 112 million cells. These cells span an incredible 1.53 billion years of evolution and cover 12 different species. This ambitious project aims to create a cross-species generative cell atlas, allowing researchers to analyze and compare cellular processes across diverse organisms.
Key Benefits of Using TranscriptFormer:
- Unprecedented Scale: Analyze data across millions of cells and billions of years of evolution.
- Cross-Species Insights: Compare cellular functions across diverse species, from humans to yeast.
- Generative Modeling: Predict and simulate cellular behavior using a novel generative architecture.
Meet the TranscriptFormer Family: Three Powerful Models
TranscriptFormer comes in three distinct versions, each tailored to specific research needs and computational resources. Understanding the differences between these models is key to choosing the right tool for your analysis.
1. TF-Metazoa: The Comprehensive Atlas
This model is trained on all 112 million cells, encompassing all twelve species in the dataset. It is the most comprehensive model, ideal for broad comparative studies. Covering vertebrates, invertebrates, fungi, and protists, TF-Metazoa provides a holistic view of cellular evolution.
2. TF-Exemplar: Focus on Model Organisms
TF-Exemplar focuses on human cells and four key model organisms: mouse, zebrafish, fruit fly, and C. elegans. Trained on 110 million cells, it offers a balance between breadth and computational efficiency. If your research centers on these widely studied species, TF-Exemplar is a great choice.
3. TF-Sapiens: Human-Centric Analysis
The TF-Sapiens model is trained solely on 57 million human cells, providing unparalleled depth for human-specific research. With 368 million trainable parameters, it captures intricate details of human cellular processes. If you're focused on human health or disease, TF-Sapiens is the model for you.
Installation: Get Started with TranscriptFormer
Ready to start using TranscriptFormer? The installation process is straightforward, requiring Python >=3.11. Follow these steps to get up and running:
Step-by-Step Installation:
- Clone the Repository:
- Create a Virtual Environment:
- Install in Development Mode (recommended):
Alternatively, install directly from PyPI:
Downloading Model Weights: Accessing the Power of TranscriptFormer
The pre-trained model weights are stored on AWS S3. Use the provided download_artifacts.py
script to download the specific models you need.
Downloading Specific Models:
Downloading All Models and Embeddings:
Running Inference: Unleashing the Power of Generative Single-Cell Analysis
The inference.py
script provides a user-friendly interface for running TranscriptFormer. It uses Hydra for configuration, allowing flexible parameter adjustments.
Basic Inference Command:
Key Parameters to Consider:
model.checkpoint_path
: Path to the model weights.model.inference_config.data_files
: Path to your input data (H5AD format).model.inference_config.pretrained_embedding
: Path to pretrained embeddings for out-of-distribution species.model.inference_config.batch_size
: Adjust batch size based on your GPU VRAM.
Input Data Format: Preparing Your Data for TranscriptFormer
Ensure your input data is in H5AD format (AnnData objects) with specific requirements:
- Gene IDs: The
var
dataframe must contain anensembl_id
column populated with Ensembl gene identifiers. - Expression Data: Raw count data should be stored in the
adata.X
matrix. - Cell Metadata: Any cell metadata in the
obs
dataframe will be preserved in the output.
Output Format: Understanding Your Results
The inference results are saved as an AnnData object (embeddings.h5ad
) in the output directory.
Key Output Components:
- Cell Embeddings: Stored in
obsm['embeddings']
. - Original Cell Metadata: Preserved in the
obs
dataframe. - Log-Likelihood Scores: Found in
uns['llh']
(if available).
Hardware and Software: What You Need to Run TranscriptFormer
To run TranscriptFormer efficiently, especially for inference and embedding extraction, a GPU (A100 40GB recommended) is beneficial. You can also use a GPU with less VRAM (16GB) by reducing the inference batch size to 1-4. The core software dependencies include PyTorch, PyTorch Lightning, anndata, scanpy, and others. Refer to the pyproject.toml file for a complete list of dependencies.
Contributing and Staying Secure
The TranscriptFormer project welcomes contributions, adhering to the Contributor Covenant code of conduct. Security is paramount; responsibly report any security issues to [email protected].
Citing TranscriptFormer
If you use TranscriptFormer in your research, please cite the following publication:
Pearce, J. D., et. al. (2025). A Cross-Species Generative Cell Atlas Across 1.5 Billion Years of Evolution: The TranscriptFormer Single-cell Model. bioRxiv.
By leveraging TranscriptFormer, you can unlock new insights into cellular evolution, disease mechanisms, and gene regulation.