severo (Sylvain Lesage)

reacted to cfahlgren1's post with 🚀 2 months ago

Post

3028

We just dropped an LLM inside the SQL Console 🤯

The amazing, new Qwen/Qwen2.5-Coder-32B-Instruct model can now write SQL for any Hugging Face dataset ✨

It's 2025, you shouldn't be hand writing SQL! This is a big step in making it where anyone can do in depth analysis on a dataset. Let us know what you think 🤗

reacted to blanchon's post with 👍 6 months ago

Post

If you're looking for geospatial datasets, you might find what you're looking for in the Geospatial Datasets Collections:
blanchon/geospatial-datasets-656ddb097c934a7b3c4d4619

replied to their post 6 months ago

I'm personally happy with the current flow because I like to review a user profile before following or not.

But I'm also OK with adding this feature if you propose a PR.

I think we need:

login within the space (https://huggingface.co/docs/hub/spaces-oauth)
add the UI to select multiple users + follow button
send a POST request to https://huggingface.co/api/users/${user}/follow

posted an update 6 months ago

Post

1081

@MaartenGr nice post https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-quantization ("A Visual Guide to Quantization")

Would it make sense for you to publish it here too?

1 reply

·

replied to their post 7 months ago

ahah I think that at one point every new user was automatically following @TheBloke by default, or some experiment like that, hence the +20k followers :)

posted an update 7 months ago

Post

3529

[New tool] Follow interesting ML persons 👩‍🎨 👨‍🎤 👩‍🏫 with Followgraph

severo/followgraph

Please try it and tell me if it helped you discover high-quality content 👍 👎

I repurposed "Followgraph for Mastodon" (https://followgraph.vercel.app/).

My new follows: @TheBloke @mlabonne @teknium @KnutJaegersberg @SkalskiP @AmelieSchreiber @lbourdois @ceyda @andrewyng @Pclanglais @karpathy

And you?

5 replies

·

reacted to dvilasuero's post with 🔥 8 months ago

Post

8144

Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!

28 replies

·

reacted to albertvillanova's post with 😎 9 months ago

Post

4064

Recently, the Hugging Face 🤗 datasets team met with the Language Technologies team led by Marta Villegas ( @mvillegas ) at Barcelona Supercomputing Center @BSC-LT . Eager to collaborate to promote AI across Catalan, Spanish, Basque, and Galician languages and share open-source datasets/models. 🤝 #AI #LanguageTech #OpenSource

1 reply

·

reacted to Wauplin's post with 🚀 9 months ago

Post

1829

🚀 Just released version 0.23.0 of the huggingface_hub Python library!

Exciting updates include:
📁 Seamless download to local dir!
💡 Grammar and Tools in InferenceClient!
🌐 Documentation full translated to Korean!
👥 User API: get likes, upvotes, nb of repos, etc.!
🧩 Better model cards and encoding for ModelHubMixin!

Check out the full release notes for more details:
Wauplin/huggingface_hub#6
👀

reacted to jamarks's post with 🔥 10 months ago

Post

2190

FiftyOne Datasets <> Hugging Face Hub Integration!

As of yesterday's release of FiftyOne 0.23.8, the FiftyOne open source library for dataset curation and visualization is now integrated with the Hugging Face Hub!

You can now load Parquet datasets from the hub and have them converted directly into FiftyOne datasets. To load MNIST, for example:

pip install -U fiftyone

import fiftyone as fo
import fiftyone.utils.huggingface as fouh

dataset = fouh.load_from_hub(
    "mnist",
    format="ParquetFilesDataset",
    classification_fields="label",
)
session = fo.launch_app(dataset)

You can also load FiftyOne datasets directly from the hub. Here's how you load the first 1000 samples from the VisDrone dataset:

import fiftyone as fo
import fiftyone.utils.huggingface as fouh

dataset = fouh.load_from_hub("jamarks/VisDrone2019-DET", max_samples=1000)

# Launch the App
session = fo.launch_app(dataset)

And tying it all together, you can push your FiftyOne datasets directly to the hub:

import fiftyone.zoo as foz
import fiftyone.utils.huggingface as fouh

dataset = foz.load_zoo_dataset("quickstart")
fouh.push_to_hub(dataset, "my-dataset")

Major thanks to @tomaarsen @davanstrien @severo @osanseviero and @julien-c for helping to make this happen!!!

Full documentation and details here: https://docs.voxel51.com/integrations/huggingface.html#huggingface-hub

3 replies

·

reacted to loubnabnl's post with 🔥 11 months ago

Post

6413

We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!

1 reply

·

reacted to Wauplin's post with ❤️ 12 months ago

Post

🚀 Just released version 0.21.0 of the huggingface_hub Python library!

Exciting updates include:
🖇️ Dataclasses everywhere for improved developer experience!
💾 HfFileSystem optimizations!
🧩 PyTorchHubMixin now supports configs and safetensors!
✨ audio-to-audio supported in the InferenceClient!
📚 Translated docs in Simplified Chinese and French!
💔 Breaking changes: simplified API for listing models and datasets!

Check out the full release notes for more details: Wauplin/huggingface_hub#4 🤖💻

4 replies

·

reacted to mehd-io's post with ❤️ 12 months ago

Post

We just released the first Text2SQL model for DuckDB 🦆🧠
You can try it out directly here :
motherduckdb/DuckDB-NSQL-7B

2 replies

·

reacted to dvilasuero's post with 🤗 12 months ago

Post

🤗 Data is better together!

Data is essential for training good AI systems. We believe that the amazing community built around open machine learning can also work on developing amazing datasets together.

To explore how this can be done, Argilla and Hugging Face are thrilled to announce a collaborative project where we’re asking Hugging Face community members to build a dataset consisting of LLM prompts collectively.

What are we doing?
Using an instance of Argilla — a powerful open-source data collaboration tool — hosted on the Hugging Face Hub, we are collecting ratings of prompts based on their quality.

How Can You Contribute?
It’s super simple to start contributing:

1. Sign up if you don’t have a Hugging Face account

2. Go to this Argilla Space and sign in: https://huggingface.co/spaces/DIBT/prompt-collective

3. Read the guidelines and start rating prompts!

You can also join the #data-is-better-together channel in the Hugging Face Discord.

Finally, to track the community progress we'll be updating this Gradio dashboard:

https://huggingface.co/spaces/DIBT/prompt-collective-dashboard

5 replies

·

reacted to julien-c's post with ❤️ about 1 year ago

Post

📣 NEW on HF

the Dataset Viewer is now available on *private datasets* too

You need to be a PRO or a Enterprise Hub user. 🔥

Great work from our Datasets team 🥰: @lhoestq @severo @polinaeterna @asoria @albertvillanova and the whole team 🥰

1 reply

·

reacted to dvilasuero's post with 🤗❤️ about 1 year ago

Post

🚀 The Open Source AI community needs more open datasets for improving Open LLMs. Excited to share our new open dataset for boosting chat models:

🎉 Welcome Distilabel Capybara DPO, a multi-turn, high-quality preference dataset.

argilla/distilabel-capybara-dpo-7k-binarized

Why?
Best closed chat models are built on top of multi-turn dialogue preference data. The OSS community lacks these datasets. This dataset is the first in the series to close this gap.

Is this dataset useful?
To test this dataset, we've built our virtual launching partner:

🎉 Welcome CapybaraHermes, a preference tuned OpenHermes with increased second turn capabilities on MTBench

argilla/CapybaraHermes-2.5-Mistral-7B

As usual, models are the least important to us. We like to focus on the data. Our mission is to build and share high-quality datasets, sharing our methods in the open so the community can improve upon them.

That's why, we took some time to describe the full methodology on the dataset card, check it out and give us feedback! Data and methods are never perfect!

Finally, this is just a preview version and would love to collaborate with you to add more benchmarking results, what hyperparams work for DPO'ing models, what mix of datasets, etc.

Expect some more datasets in the coming weeks. Let's build the best data for AI, together.

1 reply

·

reacted to dvilasuero's post with 👍 about 1 year ago

Post

👋 Hi there!

This is my very first post.

I'll use it to share some old news: a math preference dataset for DPO!

I created this dataset some time ago while we were developing distilabel (https://github.com/argilla-io/distilabel).

Some days ago we found out people are actually using it! So I'll use this post to explain how I built it in case it's useful for the community.

1. I used distilabel's SelfInstruct-inspired task to generate instructions about different math topics. I curated the instructions with Argilla (on Spaces!).
2. Then I used a distilabel Pipeline to build a preference dataset using gpt3.5 as generator and gpt4 as labeller. If I recall correctly I used our JudgeLM implementation (see https://distilabel.argilla.io/latest/technical-reference/tasks/#judgelmtask)

(see the screenshot with the dataset in the Argilla UI)

3. Then I just binarized into chosen, rejected pairs and voilà:

argilla/distilabel-math-preference-dpo

The funny thing is that I used this to do a second DPO run over Notus-7B. I hoped to see an improvement on math/reasoning skills but it actually improved in STEM and Humanities and did worse on Math 🤣 .

In conclusion, this dataset was only a quick experiement. I'm happy to see the community found it useful. Data for DPO and fine-tuning are still a mystery, let's unveil these mysteries in 2024 together!

Follow me for the most exciting datasets for LLMs (and maybe some great, small, efficient models). I plan to announce all Argilla open-source work here!

2 replies

·

Sylvain Lesage PRO

AI & ML interests

Recent Activity

Organizations

severo's activity