Unleash the Power of Your Data: Effortlessly Fine-Tune LLMs with Easy Dataset

Tired of wrestling with data preparation for your Large Language Models (LLMs)? Easy Dataset is your all-in-one solution, designed to simplify and streamline the process of creating high-quality fine-tuning datasets. Stop wasting time on manual splitting, question generation, and formatting—and start building smarter, more customized AI models today.

If you find this project helpful, show your support with a Star ⭐️!

Transform Domain Knowledge into AI Gold

Easy Dataset lets you convert your domain-specific knowledge into structured datasets that are fully compatible with OpenAI-format LLM APIs. This means you can leverage the power of fine-tuning to create truly specialized AI models.

Key Benefits: What Makes Easy Dataset a Game-Changer?

Intelligent Document Processing: Automatically split Markdown files into logical segments for efficient data handling.
Smart Question Generation: Extract relevant questions from your text segments, saving you countless hours of manual work.
Comprehensive Answer Generation: Generate detailed answers using LLM APIs, creating complete training examples.
Flexible Editing: Fine-tune your questions, answers, and datasets at any point in the process.
Multiple Export Formats: Export in Alpaca or ShareGPT formats as JSON or JSONL files for maximum compatibility.
Wide Model Support: Works seamlessly with all LLM APIs that follow the OpenAI format.
User-Friendly Interface: Enjoy a simple, intuitive interface designed for both technical and non-technical users.
Customizable System Prompts: Add custom prompts to guide model responses and tailor your datasets.

Get Started in Minutes: A Quick Start Guide

Ready to dive in? Here's how to get Easy Dataset up and running:

Download the Client: Ensure you have Node.js 18.x or higher and either pnpm (recommended) or npm installed.

Clone the Repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Install Dependencies:
```
npm install
```
Start the Development Server:
```
npm run build
npm run start
```

Build with Local Dockerfile (Optional)

For containerized deployments:

Clone the Repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Build the Docker Image:
```
docker build -t easy-dataset .
```

Run the Container:

docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset

Replace {YOUR_LOCAL_DB_PATH} with the desired location for your local database.

Access the Application: Open your browser and go to http://localhost:1717.

From Raw Text to Refined Datasets: Using Easy Dataset in 4 Steps

Easy Dataset simplifies the entire dataset creation workflow:

Create a Project:
- Click "Create Project" on the home page.
- Enter a name and description.
- Configure your LLM API settings.
Process Documents:
- Upload Markdown files in the "Text Split" section.
- Review and adjust the automatically split text segments.
Generate Questions:
- Navigate to the "Questions" section.
- Select segments to generate questions.
- Review, edit, and organize questions using tags.
Create Datasets:
- Go to the "Datasets" section.
- Select questions for your dataset.
- Generate and refine the answers using your configured LLM.
- Export your easy dataset in various formats for use in fine-tuning.

Exporting Your Datasets: Ready for Fine-Tuning

Click the "Export" button in the Datasets section.
Choose your preferred format (Alpaca or ShareGPT) and file type (JSON or JSONL).
Add custom system prompts (optional).
Export your dataset and start fine-tuning your LLM! The flexibility of Easy Dataset extends to supporting custom system prompts, allowing fine-grained control over the AI's behavior.

Contribute to the Future of Easy Dataset

We encourage community contributions!

Fork the repository.
Create a new branch ( git checkout -b feature/amazing-feature).
Make your changes.
Commit your changes ( git commit -m 'Add some amazing feature').
Push to the branch ( git push origin feature/amazing-feature).
Open a Pull Request.

Remember to update tests and maintain code consistency.

License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.