Unlock the Power of Your Data: Fine-Tune LLMs with Easy Dataset

Tired of wrestling with complex tools to prepare your data for Large Language Models (LLMs)? Easy Dataset is your answer – a streamlined application designed to make LLM fine-tuning accessible and efficient. Effortlessly transform your domain knowledge into structured datasets ready for any OpenAI-format compatible LLM API. Boost your productivity and unlock the full potential of your data with Easy Dataset.

Why Choose Easy Dataset for LLM Fine-Tuning?

Stop wasting time on tedious data preparation. Easy Dataset simplifies the entire process, from document uploading to dataset exporting.
Achieve superior model performance. Fine-tune your LLMs with high-quality, domain-specific data crafted using Easy Dataset's intelligent features.

Key Features That Supercharge Your LLM Fine-Tuning Workflow

Easy Dataset is packed with tools designed to optimize every stage of dataset creation:

Intelligent Document Processing: Automatically split Markdown files into meaningful segments. No more manual chopping and rearranging!
Smart Question Generation: Extract relevant questions from text segments, paving the way for effective training data.
Answer Generation: Leverage LLM APIs to generate comprehensive answers for each question.
Flexible Editing: Fine-tune questions, answers, and datasets at any point in the process. You are always in control.
Multiple Export Formats: Export datasets in Alpaca or ShareGPT formats as JSON or JSONL files, ensuring compatibility.
Wide Model Support: Seamlessly integrate with all LLM APIs compatible with the OpenAI format.
User-Friendly Interface: An intuitive UI caters to both technical and non-technical users.
Customizable System Prompts: Guide model responses with tailored system prompts.

Getting Started with Easy Dataset: A Quick Guide

Ready to dive in? Here’s how to get started with Easy Dataset:

Installation

Prerequisites: Ensure you have Node.js 18.x or higher and either pnpm (recommended) or npm installed.

Clone the Repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Install Dependencies:
```
npm install
```
Start the Development Server:
```
npm run build
npm run start
```

Docker Installation

Clone the Repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Build the Docker Image:
```
docker build -t easy-dataset .
```

Run the Container:

docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset

Note: Replace {YOUR_LOCAL_DB_PATH} with the desired path for your local database.

Access Easy Dataset: Open your browser and navigate to http://localhost:1717.

Using Easy Dataset: From Project Creation to Export

1. Creating a Project

Click the "Create Project" button on the home page.
Enter a project name and description.
Configure your LLM API settings.

2. Processing Documents

Upload Markdown files in the "Text Split" section.
Review automatically split text segments.
Adjust segmentation as needed.

3. Generating Questions

Navigate to the "Questions" section.
Select text segments to generate questions from.
Review and edit the generated questions.
Organize questions using the tag tree.

4. Creating Datasets

Go to the "Datasets" section.
Select questions to include in your dataset.
Generate answers using your configured LLM.
Review and edit the generated answers.

5. Exporting Datasets

Click the "Export" button in the Datasets section.
Select your format (Alpaca or ShareGPT).
Choose file format (JSON or JSONL).
Add custom system prompts if needed.
Export your dataset.

Dive Deeper: Exploring the Project Structure

The easy-dataset/ project directory is neatly organized to facilitate easy navigation and customization:

app/: Contains the Next.js application, including API routes and project pages.
components/: Houses React components for various sections like datasets, home, projects, questions, and text splitting.
lib/: Includes core libraries and utilities, such as database operations, internationalization, LLM integration, and text splitting functionalities.
locales/: Provides internationalization resources with English (en/) and Chinese (zh-CN/) translations.
public/: Stores static assets, including image resources.
local-db/: Contains the local file-based database.
projects/: Stores project data.

Contribute and Shape the Future of Easy Dataset

Your contributions are welcome! Help improve Easy Dataset by:

Forking the repository.
Creating a new branch (git checkout -b feature/amazing-feature).
Making your changes.
Committing (git commit -m 'Add some amazing feature').
Pushing to the branch (git push origin feature/amazing-feature).
Opening a Pull Request.

License

Easy Dataset is licensed under the Apache License 2.0. See the LICENSE file for details.

By using Easy Dataset, you're not just simplifying your LLM fine-tuning process; you're unlocking the potential within your data. Start building better models, faster.

Unlock the Power of Your Data: Fine-Tune LLMs with Easy Dataset

Why Choose Easy Dataset for LLM Fine-Tuning?

Stop wasting time on tedious data preparation. Easy Dataset simplifies the entire process, from document uploading to dataset exporting.
Achieve superior model performance. Fine-tune your LLMs with high-quality, domain-specific data crafted using Easy Dataset's intelligent features.

Key Features That Supercharge Your LLM Fine-Tuning Workflow

Easy Dataset is packed with tools designed to optimize every stage of dataset creation:

Intelligent Document Processing: Automatically split Markdown files into meaningful segments. No more manual chopping and rearranging!
Smart Question Generation: Extract relevant questions from text segments, paving the way for effective training data.
Answer Generation: Leverage LLM APIs to generate comprehensive answers for each question.
Flexible Editing: Fine-tune questions, answers, and datasets at any point in the process. You are always in control.
Multiple Export Formats: Export datasets in Alpaca or ShareGPT formats as JSON or JSONL files, ensuring compatibility.
Wide Model Support: Seamlessly integrate with all LLM APIs compatible with the OpenAI format.
User-Friendly Interface: An intuitive UI caters to both technical and non-technical users.
Customizable System Prompts: Guide model responses with tailored system prompts.

Getting Started with Easy Dataset: A Quick Guide

Ready to dive in? Here’s how to get started with Easy Dataset:

Installation

Prerequisites: Ensure you have Node.js 18.x or higher and either pnpm (recommended) or npm installed.

Clone the Repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Install Dependencies:
```
npm install
```
Start the Development Server:
```
npm run build
npm run start
```

Docker Installation

Clone the Repository:

git clone https://github.com/ConardLi/easy-dataset.git
cd easy-dataset

Build the Docker Image:
```
docker build -t easy-dataset .
```

Run the Container:

docker run -d -p 1717:1717 -v {YOUR_LOCAL_DB_PATH}:/app/local-db --name easy-dataset easy-dataset

Note: Replace {YOUR_LOCAL_DB_PATH} with the desired path for your local database.

Access Easy Dataset: Open your browser and navigate to http://localhost:1717.

Using Easy Dataset: From Project Creation to Export

1. Creating a Project

Click the "Create Project" button on the home page.
Enter a project name and description.
Configure your LLM API settings.

2. Processing Documents

Upload Markdown files in the "Text Split" section.
Review automatically split text segments.
Adjust segmentation as needed.

3. Generating Questions

Navigate to the "Questions" section.
Select text segments to generate questions from.
Review and edit the generated questions.
Organize questions using the tag tree.

4. Creating Datasets

Go to the "Datasets" section.
Select questions to include in your dataset.
Generate answers using your configured LLM.
Review and edit the generated answers.

5. Exporting Datasets

Click the "Export" button in the Datasets section.
Select your format (Alpaca or ShareGPT).
Choose file format (JSON or JSONL).
Add custom system prompts if needed.
Export your dataset.

Dive Deeper: Exploring the Project Structure

The easy-dataset/ project directory is neatly organized to facilitate easy navigation and customization:

app/: Contains the Next.js application, including API routes and project pages.
components/: Houses React components for various sections like datasets, home, projects, questions, and text splitting.
lib/: Includes core libraries and utilities, such as database operations, internationalization, LLM integration, and text splitting functionalities.
locales/: Provides internationalization resources with English (en/) and Chinese (zh-CN/) translations.
public/: Stores static assets, including image resources.
local-db/: Contains the local file-based database.
projects/: Stores project data.

Contribute and Shape the Future of Easy Dataset

Your contributions are welcome! Help improve Easy Dataset by:

Forking the repository.
Creating a new branch (git checkout -b feature/amazing-feature).
Making your changes.
Committing (git commit -m 'Add some amazing feature').
Pushing to the branch (git push origin feature/amazing-feature).
Opening a Pull Request.

License

Easy Dataset is licensed under the Apache License 2.0. See the LICENSE file for details.

By using Easy Dataset, you're not just simplifying your LLM fine-tuning process; you're unlocking the potential within your data. Start building better models, faster.