@tooniez

Published

- 2 min read

Setting up a Hugging Face Dataset: A Step-by-Step Guide

img of Setting up a Hugging Face Dataset: A Step-by-Step Guide

Introduction

Hugging Face has become a central hub for sharing and accessing datasets in the AI/ML community. This guide will walk you through the process of setting up a new Hugging Face repository for your dataset and pushing it to the Hugging Face Hub.

Prerequisites

Before we begin, ensure you have the following tools installed:

  1. Git: Version control system for tracking changes in your files.
  2. Git LFS: Git extension for versioning large files.
  3. Python: Required for installing and using the Hugging Face CLI.

Step 1: Install and Authenticate with Hugging Face CLI

First, install the Hugging Face CLI and authenticate your account:

   # Install the Hugging Face CLI
pip install huggingface_hub

# Log in to your Hugging Face account
huggingface-cli login

When prompted, enter your Hugging Face access token. You can generate a new token from the Hugging Face website if needed.

Step 2: Create and Set Up Your Repository

Now, let’s create and configure your Hugging Face repository:

   # Create a new repository
huggingface-cli repo create dataset-name

# Clone the repository to your local machine
git clone https://huggingface.co/username/dataset-name

# Navigate to the repository directory
cd dataset-name

# Initialize Git LFS
git lfs install

# Enable large file storage for Hugging Face
huggingface-cli lfs-enable-largefiles .

Step 3: Configure Git LFS for Your Dataset Files

Specify which file types should be handled by Git LFS:

   # Track CSV files with Git LFS (adjust for your file types)
git lfs track "*.csv"

# Add the .gitattributes file to track LFS configurations
git add .gitattributes

Step 4: Add Your Dataset Files

Copy your dataset files into the repository directory and stage them for commit:

   # Add all files to staging
git add .

# Commit the changes
git commit -m "Add dataset files"

Step 5: Push Your Dataset to Hugging Face Hub

Finally, push your dataset to the Hugging Face Hub:

   # Push to the main branch
git push origin main

If you need to overwrite existing content, you can use a force push:

   git push --force origin main

Conclusion

Congratulations! You’ve successfully set up and pushed your dataset to the Hugging Face Hub. Your dataset is now ready to be shared with the community or used in your projects.

Best Practices

  • Keep your dataset well-organized and include a README.md file with dataset description, usage instructions, and any relevant citations.
  • Regularly update your dataset and maintain version control using Git.
  • Use meaningful commit messages to track changes in your dataset over time.

References