Unlock the Power of Your Data: Fine-Tune LLMs with Easy Dataset
Tired of wrestling with complex tools to prepare your data for Large Language Models (LLMs)? Easy Dataset simplifies the entire process, empowering you to transform your domain expertise into high-quality training data. Stop wasting time on tedious data preparation and start fine-tuning your LLMs for optimal performance with this intuitive and powerful tool.
Key Benefits: Why Choose Easy Dataset?
- Streamlined Workflow: Transform raw documents into structured datasets ready for LLM fine-tuning.
- Enhanced Model Performance: Improve accuracy and relevance by training on domain-specific data.
- Increased Efficiency: Save time and resources with automated data processing and generation.
Features That Make a Difference
Easy Dataset boasts a comprehensive suite of features designed to accelerate your LLM fine-tuning workflow:
- Intelligent Document Processing: Automatically split Markdown files into logical segments for focused training.
- Smart Question Generation: Extract relevant questions from text segments, ensuring comprehensive coverage of your data.
- Automated Answer Generation: Leverage LLM APIs to generate detailed answers, reducing manual effort.
- Flexible Editing: Refine questions, answers, and datasets at any stage to ensure data quality.
- Versatile Export Options: Export datasets in Alpaca, ShareGPT, JSON, and JSONL formats for seamless compatibility.
- Broad Model Compatibility: Works with any LLM API following the OpenAI format, giving you flexibility.
- User-Friendly Interface: An intuitive UI makes Easy Dataset accessible to both technical and non-technical users.
- Customizable System Prompts: Guide model responses with custom prompts tailored to your specific needs.
Getting Started: Your Path to Fine-Tuned LLMs
Ready to experience the power of Easy Dataset? Here's how to get started:
-
Download the Client:
- Requires Node.js 18.x or higher and either pnpm (recommended) or npm.
-
Clone the Repository:
-
Install Dependencies:
-
Start the Development Server:
Build with Local Dockerfile (Optional)
For isolated environments, you can use Docker:
-
Clone the Repository:
-
Build the Docker Image:
-
Run the Container:
- Important: Replace
{YOUR_LOCAL_DB_PATH}
with the desired location for your local database.
- Important: Replace
-
Open in Browser: Access the application at
http://localhost:1717
.
Usage: A Step-by-Step Guide to Dataset Creation
Easy Dataset guides you through the process with these easy steps:
-
Create a Project:
- Click "Create Project," enter a name and description, and configure your LLM API settings.
-
Process Documents:
- Upload Markdown files in the "Text Split" section, review the segmented text, and adjust as needed.
-
Generate Questions:
- Navigate to "Questions," select text segments, and review/edit the generated questions. Organize them using the tag tree.
-
Create Datasets:
- Go to "Datasets," select questions to include, generate answers using your configured LLM, and review/edit.
-
Export Datasets:
- Click "Export," choose your preferred format (Alpaca or ShareGPT), file format (JSON or JSONL), add custom system prompts (optional), and export.
Project Structure: A Glimpse Under the Hood
The project's organized structure makes it easy to navigate and contribute:
app/
: Next.js application directory with API routes and front-end pages.components/
: React components for various sections like datasets, home, projects, questions, and text splitting.lib/
: Core libraries and utilities for database operations, internationalization, LLM integration, and text splitting.locales/
: Internationalization resources for English and Chinese.public/
: Static assets, including images.local-db/
: Local file-based database for storing project data.
Contribute and Shape the Future of LLM Training
Easy Dataset thrives on community contributions! Fork the repository, create a feature branch, make your changes, and submit a pull request.
License: Open and Accessible
This project is licensed under the Apache License 2.0, ensuring open access and collaboration.