TranscriptFormer: A Cross-Species Generative Cell Atlas for Single-Cell Analysis
Unlock the power of single-cell analysis with TranscriptFormer, a groundbreaking family of generative foundation models. Trained on a massive dataset of up to 112 million cells across 12 species, TranscriptFormer enables researchers to explore cellular diversity and regulatory relationships like never before. Dive into the details of this cross-species generative cell atlas and learn how it can revolutionize your research.
What is TranscriptFormer? A Generative Foundation Model for Cells.
TranscriptFormer is a novel deep learning model designed to understand single-cell transcriptomes. It learns complex relationships between genes and their expression levels, providing a powerful tool for:
- Cell type classification: Accurately identify cell types across different species.
- Disease state identification: Pinpoint disease states in human cells with unprecedented accuracy.
- Regulatory relationship prediction: Predict cell-type-specific transcription factors and gene-gene regulatory relationships.
Three Powerful Models: Choose the Right TranscriptFormer for Your Needs
TranscriptFormer offers three distinct pre-trained models, each tailored to specific research applications, allowing you to choose the right model for your specific need in single-cell analysis:
- TF-Metazoa: The most comprehensive model, trained on 112 million cells from 12 species, ideal for broad cross-species comparisons.
- TF-Exemplar: Focuses on human and four key model organisms, perfect for comparative studies and translational research.
- TF-Sapiens: Trained exclusively on 57 million human cells, optimized for in-depth analysis of human biology and disease.
Key Features: Why TranscriptFormer Stands Out
TranscriptFormer leverages a unique architecture to achieve state-of-the-art performance in single-cell analysis. Here's what makes it special:
- Generative Autoregressive Joint Model: Simultaneously models genes and their expression levels to capture complex biological relationships.
- Transformer-Based Architecture: Employs a powerful transformer network with novel coupling between gene and transcript heads.
- Expression-Aware Multi-Head Self-Attention: Improves the model's ability to understand the context of gene expression within individual cells.
- Count Likelihood: Accurately captures the variability in transcript-level data.
Installation: Get Started with TranscriptFormer in Minutes
Ready to start using TranscriptFormer? Here's how to install it:
- Clone the repository:
- Create and activate a virtual environment:
- Install TranscriptFormer:
Running Inference: Analyze Your Single-Cell Data with Ease
Once installed, running inference with TranscriptFormer is simple and straightforward. Use the inference.py
script with a YAML configuration file to specify your desired parameters.
Example: Running inference on human data with TF-Sapiens:
Input and Output: Understanding the Data Flow
To ensure proper functionality when doing single-cell analysis with TranscriptFormer, it's important to understand the expected input and output formats.
- Input: H5AD format (AnnData objects) with raw count data and Ensembl gene identifiers.
- Output: H5AD file containing cell embeddings, original cell metadata, and log-likelihood scores.
Hardware Requirements: Optimizing Performance
For efficient inference and embedding extraction, a GPU (A100 40GB recommended) is preferred. However, you can also run TranscriptFormer on a GPU with lower VRAM (16GB) by adjusting the inference batch size to 1-4. This makes single-cell analysis with TranscriptFormer accessible to a wider range of researchers.
Contributing: Join the TranscriptFormer Community
TranscriptFormer is an open-source project, and contributions are welcome! Please adhere to the Contributor Covenant code of conduct and report any unacceptable behavior to [email protected]. Together, we can advance the field of single-cell analysis and unlock new discoveries.