Paper2Code: Generate Code Repositories from Machine Learning Papers via LLMs
Stop manually translating research papers into functional code. Paper2Code automates the process, turning complex scientific documents into ready-to-use code repositories. This article dives into how this innovative tool works, its benefits, and how you can start using it today. Automate your research workflow and bring cutting-edge algorithms to life faster than ever!
What is Paper2Code and How Does it Work?
Paper2Code is a multi-agent, Large Language Model (LLM) system designed to convert research papers into functional code repositories. It streamlines the process of implementing machine learning algorithms described in academic literature, saving you countless hours of manual coding and debugging. Forget struggling with cryptic descriptions. Paper2Code makes research accessible and actionable.
- Planning Stage: The system first analyzes the paper to understand the overall structure and goals.
- Analysis Stage: Specialized agents dissect the paper's sections, identifying key algorithms and data structures.
- Code Generation Stage: The system crafts a functional code repository based on the analysis.
Why Use Paper2Code? Key Benefits
- Saves Time & Resources: Automate code generation, freeing up time for research and experimentation.
- Reduces Errors: Minimize human error in the translation from paper to code.
- Improves Reproducibility: Ensure faithful implementations of research findings.
- Increases Accessibility: Makes complex algorithms readily available for practical application.
Quick Start Guide: Running Paper2Code
Ready to try Paper2Code? These steps will get you started quickly, using the "Attention Is All You Need" paper as an example. Here's how to get started using the OpenAI API:
- Install OpenAI:
pip install openai
- Set API Key:
export OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"
- Run the Script:
cd scripts; bash run.sh
If you prefer open-source models with vLLM:
- Install vLLM:
pip install vllm
(Refer to the official vLLM repository if you face issues.) - Run the Script:
cd scripts; bash run_llm.sh
(The default model isdeepseek-ai/DeepSeek-Coder-V2-Lite-Instruct
.)
Detailed Setup Instructions: Configure Your Environment
Follow these detailed steps to set up your environment for Paper2Code. Be sure your environment is correctly configured before diving in.
Setting Up Your Environment
- Install the necessary packages:
- For OpenAI API:
pip install openai
- For open-source models:
pip install vllm
- For OpenAI API:
Converting PDF to JSON
Paper2Code requires the research paper to be in JSON format. Use the s2orc-doc2json
repository to convert your PDF.
- Clone the Repository:
git clone https://github.com/allenai/s2orc-doc2json.git
- Run the PDF Processing Service:
cd ./s2orc-doc2json/grobid-0.7.3
./gradlew run
- Convert PDF to JSON:
mkdir -p ./s2orc-doc2json/output_dir/paper_coder
python ./s2orc-doc2json/doc2json/grobid2json/process_pdf.py -i ${PDF_PATH} -t ./s2orc-doc2json/temp_dir/ -o ./s2orc-doc2json/output_dir/paper_coder
Once converted, you can run PaperCoder with your own papers by modifying the environment variables in the scripts.
Exploring the Paper2Code Benchmark Datasets
Paper2Code includes benchmark datasets for evaluation. Find detailed descriptions in the data/paper2code
directory. Refer to Section 4.1 of the paper for more information on the Paper2Code Benchmark.
Evaluating Generated Repositories: Model-Based Approach
Paper2Code uses a model-based approach to evaluate the quality of generated repositories. This includes both reference-based and reference-free settings. The model assesses key implementation components, assigns severity levels, and generates a correctness score. Ensure the generated code meets high-quality standards.
How to Evaluate
- Install tiktoken:
pip install tiktoken
- Set API Key:
export OPENAI_API_KEY="<YOUR_OPENAI_API_KEY>"
Reference-Free Evaluation
Reference-Based Evaluation
Be sure to adjust paths and arguments to fit your evaluation needs.
Understanding Evaluation Output
The evaluation process provides a comprehensive summary of the generated code's performance. Key metrics include the correctness score, validity, and cost analysis.
Here's an example output:
========================================
π Evaluation Summary π
π Paper name: Transformer
π§ͺ Evaluation type: ref_based
π Target repo directory: ../outputs/Transformer_repo
π Evaluation result:
π Score: 4.5000
β
Valid: 8/8
========================================
π Usage Summary π
[Evaluation] Transformer - ref_based
π οΈ Model: o3-mini
π₯ Input tokens: 44318 (Cost: $0.04874980)
π¦ Cached input tokens: 0 (Cost: $0.00000000)
π€ Output tokens: 26310 (Cost: $0.11576400)
π΅ Current total cost: $0.16451380
πͺ Accumulated total cost so far: $0.16451380
============================================
This output tells you the paper name, evaluation type, target repository directory, evaluation score, and cost analysis. It gives insights into the efficiency and effectiveness of the code generation process.