Unlock the Power of Your Data: Fine-Tune LLMs with Easy Dataset
Tired of wrestling with complex tools to prepare your data for Large Language Models (LLMs)? Easy Dataset is your answer – a streamlined application designed to make LLM fine-tuning accessible and efficient. Effortlessly transform your domain knowledge into structured datasets ready for any OpenAI-format compatible LLM API. Boost your productivity and unlock the full potential of your data with Easy Dataset.
Why Choose Easy Dataset for LLM Fine-Tuning?
- Stop wasting time on tedious data preparation. Easy Dataset simplifies the entire process, from document uploading to dataset exporting.
- Achieve superior model performance. Fine-tune your LLMs with high-quality, domain-specific data crafted using Easy Dataset's intelligent features.
Key Features That Supercharge Your LLM Fine-Tuning Workflow
Easy Dataset is packed with tools designed to optimize every stage of dataset creation:
- Intelligent Document Processing: Automatically split Markdown files into meaningful segments. No more manual chopping and rearranging!
- Smart Question Generation: Extract relevant questions from text segments, paving the way for effective training data.
- Answer Generation: Leverage LLM APIs to generate comprehensive answers for each question.
- Flexible Editing: Fine-tune questions, answers, and datasets at any point in the process. You are always in control.
- Multiple Export Formats: Export datasets in Alpaca or ShareGPT formats as JSON or JSONL files, ensuring compatibility.
- Wide Model Support: Seamlessly integrate with all LLM APIs compatible with the OpenAI format.
- User-Friendly Interface: An intuitive UI caters to both technical and non-technical users.
- Customizable System Prompts: Guide model responses with tailored system prompts.
Getting Started with Easy Dataset: A Quick Guide
Ready to dive in? Here’s how to get started with Easy Dataset:
Installation
-
Prerequisites: Ensure you have Node.js 18.x or higher and either pnpm (recommended) or npm installed.
-
Clone the Repository:
-
Install Dependencies:
-
Start the Development Server:
Docker Installation
-
Clone the Repository:
-
Build the Docker Image:
-
Run the Container:
Note: Replace
{YOUR_LOCAL_DB_PATH}
with the desired path for your local database. -
Access Easy Dataset: Open your browser and navigate to
http://localhost:1717
.
Using Easy Dataset: From Project Creation to Export
1. Creating a Project
- Click the "Create Project" button on the home page.
- Enter a project name and description.
- Configure your LLM API settings.
2. Processing Documents
- Upload Markdown files in the "Text Split" section.
- Review automatically split text segments.
- Adjust segmentation as needed.
3. Generating Questions
- Navigate to the "Questions" section.
- Select text segments to generate questions from.
- Review and edit the generated questions.
- Organize questions using the tag tree.
4. Creating Datasets
- Go to the "Datasets" section.
- Select questions to include in your dataset.
- Generate answers using your configured LLM.
- Review and edit the generated answers.
5. Exporting Datasets
- Click the "Export" button in the Datasets section.
- Select your format (Alpaca or ShareGPT).
- Choose file format (JSON or JSONL).
- Add custom system prompts if needed.
- Export your dataset.
Dive Deeper: Exploring the Project Structure
The easy-dataset/
project directory is neatly organized to facilitate easy navigation and customization:
app/
: Contains the Next.js application, including API routes and project pages.components/
: Houses React components for various sections like datasets, home, projects, questions, and text splitting.lib/
: Includes core libraries and utilities, such as database operations, internationalization, LLM integration, and text splitting functionalities.locales/
: Provides internationalization resources with English (en/
) and Chinese (zh-CN/
) translations.public/
: Stores static assets, including image resources.local-db/
: Contains the local file-based database.projects/
: Stores project data.
Contribute and Shape the Future of Easy Dataset
Your contributions are welcome! Help improve Easy Dataset by:
- Forking the repository.
- Creating a new branch (
git checkout -b feature/amazing-feature
). - Making your changes.
- Committing (
git commit -m 'Add some amazing feature'
). - Pushing to the branch (
git push origin feature/amazing-feature
). - Opening a Pull Request.
License
Easy Dataset is licensed under the Apache License 2.0. See the LICENSE
file for details.
By using Easy Dataset, you're not just simplifying your LLM fine-tuning process; you're unlocking the potential within your data. Start building better models, faster.