manu (Manuel Faysse)

reacted to urchade's post with 🔥 10 months ago

Post

8557

**Release Announcement: gliner_multi_pii-v1**

I am pleased to announce the release of gliner_multi_pii-v1, a model developed for recognizing a wide range of Personally Identifiable Information (PII). This model is the result of fine-tuning the urchade/gliner_multi-v2.1 on synthetic dataset (urchade/synthetic-pii-ner-mistral-v1).

**Model Features:**
- Capable of identifying multiple PII types including addresses, passport numbers, emails, social security numbers, and more.
- Designed to assist with data protection and compliance across various domains.
- Multilingual (English, French, Spanish, German, Italian, Portugese)

Link: urchade/gliner_multi_pii-v1

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_multi_pii-v1")

text = """
Harilala Rasoanaivo, un homme d'affaires local d'Antananarivo, a enregistré une nouvelle société nommée "Rasoanaivo Enterprises" au Lot II M 92 Antohomadinika. Son numéro est le +261 32 22 345 67, et son adresse électronique est [email protected]. Il a fourni son numéro de sécu 501-02-1234 pour l'enregistrement.
"""

labels = ["work", "booking number", "personally identifiable information", "driver licence", "person",  "address", "company",  "email", "passport number", "Social Security Number", "phone number"]
entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

Harilala Rasoanaivo => person
Rasoanaivo Enterprises => company
Lot II M 92 Antohomadinika => full address
+261 32 22 345 67 => phone number
[email protected] => email
501-02-1234 => Social Security Number

1 reply

·

reacted to loubnabnl's post with 🤗 11 months ago

Post

6417

We've just published a detailed blog post on the creation of Cosmopedia dataset. We hope this will provide insights about generating synthetic data at scale for pre-training.
https://huggingface.co/blog/cosmopedia

Here are some key takeaways:
🎯 Prompt curation is crucial: we want to cover many topics with few duplicates.
📚 You can leverage various resources for diversity: using different seed data, generation formats, and target audiences.
⚙️ The importance of a good technical stack: for scalable generations with tools like llm-swarm and fast model training and evaluation.

Have a good read!

1 reply

·

reacted to gsarti's post with 🤗 about 1 year ago

Post

🔍 Today's pick in Interpretability & Analysis of LMs: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains by @alonjacovi @yonatanbitton B. Bohnet J. Herzig @orhonovic M. Tseng M. Collins @roeeaharoni @mega

This work introduces a new methodology for human verification of reasoning chains and adopts it to annotate a dataset of chain-of-thought reasoning chains produced by 3 LMs. The annotated dataset, REVEAL, can be used to benchmark automatic verifiers of reasoning in LMs.

In their analysis, the authors find that LM-produced CoTs generally contain faulty steps, often leading to incorrect automatic verification. In particular, CoT-generating LMs are found to produce non-attributable reasoning steps often, and reasoning verifiers generally struggle to verify logical correctness.

📄 Paper: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains (2402.00559)
🔡 Dataset: google/reveal

posted an update about 1 year ago

Post

These past months, I've been busy baking a special sort of Croissant 🥐 with an awesome team !

🥐 CroissantLLM is a truly bilingual language model trained on 3 trillion tokens of French and English data. In its size category (<2B), it is the best model in French, but it also rivals the best monolingual English models !

💾 To train it, we collected, filtered and cleaned huge quantities of permissively licensed French data, across various domains (legal, administrative, cultural, scientific), and different text modalities (speech transcriptions, movie subtitles, encyclopedias, forums, webpages)...

⚖️ Assessing LLM performance is not easy, especially outside of English, and to this end we crafted a novel evaluation benchmark, FrenchBench, aiming to assess reasoning, factual knowledge, and linguistic capabilities of models in French !

🔎 The best current LLMs are hidden behind a shroud of mystery, trained with undisclosed training data mixes or strategies. We go the opposite way, releasing all of the project's artefacts (model checkpoints, data, training details, evaluation benchmarks...) We obtain 81 % of the Stanford FMTI transparency criterias, far ahead of even most open initiatives !

🧪Beyond a powerful industrial resource, our transparent initiative is a stepping stone for many scientific questions ! How does teaching a model two languages instead of one splits its monolingual ability ? Does training on so much French help the model integrate French-centric knowledge and cultural biases ? How does the model memorize the training data ?

Many more things to say, for those interested, I recommend checking out:

🗞️ The blogpost: https://huggingface.co/blog/manu/croissant-llm-blog
📖 The 45 page report with lots of gems: https://arxiv.org/abs/2402.00786
🤖 Models, Data, Demo: https://huggingface.co/croissantllm

3 replies

·

reacted to joaogante's post with 👍 about 1 year ago

Post

Up to 3x faster LLM generation with no extra resources/requirements - ngram speculation has landed in 🤗 transformers! 🏎️💨

All you need to do is to add prompt_lookup_num_tokens=10 to your generate call, and you'll get faster LLMs 🔥

How does it work? 🤔

Start with assisted generation, where a smaller model generates candidate sequences. The net result is a significant speedup if the model agrees with the candidate sequences! However, we do require a smaller model trained similarly 😕

The idea introduced (and implemented) by Apoorv Saxena consists of gathering the candidate sequences from the input text itself. If the latest generated ngram is in the input, use the continuation therein as a candidate! No smaller model is required while still achieving significant speedups 🔥

In fact, the penalty of gathering and testing the candidates is so small that you should use this technique whenever possible!

Here is the code example that produces the outputs shown in the video: https://pastebin.com/bms6XtR4

Have fun 🤗

3 replies

·

Manuel Faysse

AI & ML interests

Recent Activity

Organizations

manu's activity