Unleash the Power of Your Data: Effortlessly Fine-Tune LLMs with Easy Dataset
Tired of wrestling with data preparation for your Large Language Models (LLMs)? Easy Dataset is your all-in-one solution, designed to simplify and streamline the process of creating high-quality fine-tuning datasets. Stop wasting time on manual splitting, question generation, and formatting—and start building smarter, more customized AI models today.
If you find this project helpful, show your support with a Star ⭐️!
Transform Domain Knowledge into AI Gold
Easy Dataset lets you convert your domain-specific knowledge into structured datasets that are fully compatible with OpenAI-format LLM APIs. This means you can leverage the power of fine-tuning to create truly specialized AI models.
Key Benefits: What Makes Easy Dataset a Game-Changer?
- Intelligent Document Processing: Automatically split Markdown files into logical segments for efficient data handling.
- Smart Question Generation: Extract relevant questions from your text segments, saving you countless hours of manual work.
- Comprehensive Answer Generation: Generate detailed answers using LLM APIs, creating complete training examples.
- Flexible Editing: Fine-tune your questions, answers, and datasets at any point in the process.
- Multiple Export Formats: Export in Alpaca or ShareGPT formats as JSON or JSONL files for maximum compatibility.
- Wide Model Support: Works seamlessly with all LLM APIs that follow the OpenAI format.
- User-Friendly Interface: Enjoy a simple, intuitive interface designed for both technical and non-technical users.
- Customizable System Prompts: Add custom prompts to guide model responses and tailor your datasets.
Get Started in Minutes: A Quick Start Guide
Ready to dive in? Here's how to get Easy Dataset up and running:
-
Download the Client: Ensure you have Node.js 18.x or higher and either pnpm (recommended) or npm installed.
-
Clone the Repository:
-
Install Dependencies:
-
Start the Development Server:
Build with Local Dockerfile (Optional)
For containerized deployments:
-
Clone the Repository:
-
Build the Docker Image:
-
Run the Container:
- Replace
{YOUR_LOCAL_DB_PATH}
with the desired location for your local database.
- Replace
-
Access the Application: Open your browser and go to
http://localhost:1717
.
From Raw Text to Refined Datasets: Using Easy Dataset in 4 Steps
Easy Dataset simplifies the entire dataset creation workflow:
-
Create a Project:
- Click "Create Project" on the home page.
- Enter a name and description.
- Configure your LLM API settings.
-
Process Documents:
- Upload Markdown files in the "Text Split" section.
- Review and adjust the automatically split text segments.
-
Generate Questions:
- Navigate to the "Questions" section.
- Select segments to generate questions.
- Review, edit, and organize questions using tags.
-
Create Datasets:
- Go to the "Datasets" section.
- Select questions for your dataset.
- Generate and refine the answers using your configured LLM.
- Export your easy dataset in various formats for use in fine-tuning.
Exporting Your Datasets: Ready for Fine-Tuning
-
Click the "Export" button in the Datasets section.
-
Choose your preferred format (Alpaca or ShareGPT) and file type (JSON or JSONL).
-
Add custom system prompts (optional).
-
Export your dataset and start fine-tuning your LLM! The flexibility of Easy Dataset extends to supporting custom system prompts, allowing fine-grained control over the AI's behavior.
Contribute to the Future of Easy Dataset
We encourage community contributions!
- Fork the repository.
- Create a new branch (
git checkout -b feature/amazing-feature
). - Make your changes.
- Commit your changes (
git commit -m 'Add some amazing feature'
). - Push to the branch (
git push origin feature/amazing-feature
). - Open a Pull Request.
Remember to update tests and maintain code consistency.
License
This project is licensed under the Apache License 2.0. See the LICENSE
file for details.