tiendung (Tien Dung)

reacted to singhsidhukuldeep's post with 👀 3 months ago

Post

2106

Exciting Research Alert: Revolutionizing Dense Passage Retrieval with Entailment Tuning!

The good folks at HKUST have developed a novel approach that significantly improves information retrieval by leveraging natural language inference.

The entailment tuning approach consists of several key steps to enhance dense passage retrieval performance.

Data Preparation
- Convert questions into existence claims using rule-based transformations.
- Combine retrieval data with NLI data from SNLI and MNLI datasets.
- Unify the format of both data types using a consistent prompting framework.

Entailment Tuning Process
- Initialize the model using pre-trained language models like BERT or RoBERTa.
- Apply aggressive masking (β=0.8) specifically to the hypothesis components while preserving premise information.
- Train the model to predict the masked hypothesis tokens from the premise content.
- Run the training for 10 epochs using 8 GPUs, taking approximately 1.5-3.5 hours.

Training Arguments for Entailment Tuning (Yes! They Shared Them)
- Use a learning rate of 2e-5 with 100 warmup steps.
- Set batch size to 128.
- Apply weight decay of 0.01.
- Utilize the Adam optimizer with beta values (0.9, 0.999).
- Maintain maximum gradient norm at 1.0.

Deployment
- Index passages using FAISS for efficient retrieval.
- Shard vector store across multiple GPUs.
- Enable sub-millisecond retrieval of the top-100 passages per query.

Integration with Existing Systems
- Insert entailment tuning between pre-training and fine-tuning stages.
- Maintain compatibility with current dense retrieval methods.
- Preserve existing contrastive learning approaches during fine-tuning.

Simple, intuitive, and effective!

This advancement significantly improves the quality of retrieved passages for question-answering systems and retrieval-augmented generation tasks.

reacted to anakin87's post with 👀 4 months ago

Post

1107

Ok, you're finally convinced that synthetic data works... ⚗️

𝐍𝐨𝐰 𝐲𝐨𝐮 𝐰𝐚𝐧𝐭 𝐭𝐨 𝐠𝐞𝐧𝐞𝐫𝐚𝐭𝐞 𝐚𝐧 𝐢𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 𝐝𝐚𝐭𝐚𝐬𝐞𝐭 𝐟𝐨𝐫 𝐟𝐢𝐧𝐞-𝐭𝐮𝐧𝐢𝐧𝐠 𝐢𝐧 𝐚 𝐥𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐨𝐭𝐡𝐞𝐫 𝐭𝐡𝐚𝐧 𝐄𝐧𝐠𝐥𝐢𝐬𝐡.
But how do you get started?

I explore how to do this with Magpie in my new article
https://huggingface.co/blog/anakin87/multilingual-magpie

---

🐦‍⬛ 𝐖𝐡𝐚𝐭 𝐢𝐬 𝐌𝐚𝐠𝐩𝐢𝐞?

It's a recent technique for creating synthetic instruction datasets.

Magpie is based on a simple but ingenious idea 👇
if you prompt an instruction-tuned model with a pre-query template, you can make it generate a plausible user query/instruction

Here's an example:
model: Llama-3-8B-Instruct
pre-query template: "<|begin_of_text|><|start_header_id|>user<|end_header_id|>"
generated user instruction: "What are some of the responsibilities of a commercial pilot?"

You can then feed this instruction back into the same model to get the assistant response.

By repeating this process, it's possible to generate large synthetic datasets with relatively little effort.

🪄 The authors demonstrate that using these datasets for Supervised Fine Tuning (SFT) can yield strong performance, even competitive with the original instruct model.

🧗𝐆𝐞𝐧𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐧𝐨𝐧-𝐄𝐧𝐠𝐥𝐢𝐬𝐡 𝐝𝐚𝐭𝐚

Most Language Models are primarily trained on English texts, so they tend to produce data in English.

How can we overcome this?

Earlier approaches were complex or costly.

Then @mrm8488 found a simple solution: add the target language to the pre-query template.
For Spanish, the template becomes "<|begin_of_text|><|start_header_id|>user<|end_header_id|>spanish:".

This method works for Spanish and German!

❌ Unfortunately, it does not work well for other languages (🇮🇹, 🇳🇱, ...)

👇

1 reply

·

reacted to ImranzamanML's post with 👀 4 months ago

Post

1289

Last Thursday at KaggleX organized by Google, I presented a workshop on "Unlocking the Power of Large Language Models (LLMs) for Business Applications" where I explained how we can reduce the size of LLM models to make them more suitable for business use and addressing common resource limitations.
https://drive.google.com/file/d/1p5sT4_DeyBuwCqmYt4dCJKZOgLMpESzR/view

reacted to davidberenstein1957's post with ➕❤️ 4 months ago

Post

1694

You can now build a custom text classifier without days of human labeling!

👍 LLMs work reasonably well as text classifiers.
👎 They are expensive to run at scale and their performance drops in specialized domains.

👍 Purpose-built classifiers have low latency and can potentially run on CPU.
👎 They require labeled training data.

Combine the best of both worlds: the automatic labeling capabilities of LLMs and the high-quality annotations from human experts to train and deploy a specialized model.

Blog: https://huggingface.co/blog/sdiazlor/custom-text-classifier-ai-human-feedback

reacted to m-ric's post with 😎👍 4 months ago

Post

1287

𝗔𝗱𝗱 𝘀𝗼𝘂𝗿𝗰𝗲 𝗵𝗶𝗴𝗵𝗹𝗶𝗴𝗵𝘁𝗶𝗻𝗴 𝘁𝗼 𝘆𝗼𝘂𝗿 𝗥𝗔𝗚 𝘀𝘆𝘀𝘁𝗲𝗺! 📄💡

RAG systems are supposed to make your LLM's answer more trustworthy, by inserting in the prompt some supporting documents from a knowledge base : we say that we're "adding some context".

👎 But if you don't know which part of the answer has been generated based on which input tokens, it's hard to tell wether it was effectively grounded in the context knowledge or not!

🤔 I've been working on the question: is it possible to add notes to the answer linking to which part of the context they're generated from?

And I've found a great solution: a great technique called Layer-wise Relevance Propagation (LRP), showcased in a paper at ICML `24 by Reduan Achtibat et al allows, allows to precisely score how important each input token was in generating your output! They've made it into a library called LXT.

📊 For each generated output token, LXT gives you attribution scores for each input token.

⚙️ So I've worked a bit more on aggregating these scores into meaningful spans between successive input and output tokens, and I finally obtained my desired result: RAG with source highlighting!

Try the demo here 👉 m-ric/rag_highlights

Caveats:
- It slows down generation (for now quite a lot, could hopefully be reduced a lot)
- For now it supports only specific models: Llama models and Mixtral

If there's enough interest in this solution, I can improve it further and spin it off into a specific library for RAG! 🚀

posted an update 4 months ago

Post

1192

ICML 2024 Tutorial: Physics of Language Models
https://www.youtube.com/watch?v=yBL7J0kgldU
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction (2309.14316)

Series bài nói về việc hiểu cách LLM hoạt động. Rất thú vị, họ làm thí nghiệm kiểm soát 100% cách huấn luyện model và phát hiện rằng nếu pretrain không chứa dạng dữ liệu extraction (QA instruction, hoặc các dạng dữ liệu mà tác giả gọi là knowledge augmentation) thì mặc dù có qua instruct finetune thì LLM cũng không thể học skill knowledge extraction. => đặt lại câu hỏi liệu cách pretrain rồi mới SFT như hiện tại đã thực sự tốt chưa?

Họ đã thử vài trăm thí nghiệm với các loại kiến trúc mô hình, độ to nhỏ, ... và đều ra kết quả như nhau.

KNOWLEDGE AUGMENTATION (data augmentation)
Nếu bạn không mix instruct data với pre-train data (mix training) tốt nhất hãy áp dụng knowledge augmentation. Tức là cùng một câu đó nhưng diễn tả lại bằng nhiều cách khác nhau.

KNOWLEDGE MANIPULATION
ví dụ giả sử đã biết (đc huấn luyện) tiểu sử của A (bao gồm ngày tháng năm sinh) và hỏi A sinh tháng chẵn hay lẻ (50% cơ hội trả lời đúng). Nếu không sử dụng CoT (gợi nhớ lại kiến thức, xem A sinh tháng mấy) thì kết quả là model không làm được. => CoT (gợi nhớ kiến thức đã học) rất quan trọng với knowledge manipulation (phân loại, so sánh, xếp hạng ...)

1 reply

·

reacted to alielfilali01's post with ❤️ 12 months ago

Post

Hi friends, i'am happy to share with you all a tool that i built a week ago or so, i'am talking here about the "LLM Training Cost Calculator" - a handy tool now available on Hugging Face Spaces! This interactive Gradio app provides an easy-to-use interface for estimating the training costs of large language models (LLMs).

(I've been asked to provide a report about the cost of finetuning each model etc... so i decided to do the lazy job and build a tool for it, Prof later can choose whatever config he likes 😆)

🔍 But Why this is important?
As LLMs continue to grow in size and complexity, understanding the computational and financial requirements is crucial for planning and managing AI projects. I believe this tool simplifies this process, giving you insights into potential expenses based on the number of parameters and tokens in your dataset.

🌟 Features:
- Input the number of parameters (in billions) and tokens (in trillions).
- Adjust for GPU utilization rates and overhead costs.
- Get an instant estimate of your training costs.
+ Choose your GPU (A100 80GB PCle, A100 80GB SXM, V100, H100 SXM, H100 PCle)

📈 Coming Soon:
Plans are in place to expand the calculator's capabilities to include fine-tuning costs for models using LoRA or QLoRA. You'll be able to input a model ID from the Hugging Face Hub, select your fine-tuning strategy, and specify quantization details if using QLoRA.

I believe this tool will be a valuable asset to the AI community, helping to plan and allocate resources more effectively 🤗.

Should you have any suggestions or feedback, please don't hesitate to contribute your thoughts in the comments below. Together, we can refine and enhance this resource for all.

🔗 Try it here : https://huggingface.co/spaces/Ali-C137/LLM-Training-Cost-Calculator

PS : All thanks to Gradio, Hugging Face and the community ofc 🔥 😉

reacted to macadeliccc's post with ❤️ 12 months ago

Post

Reducing perplexity in LLM's through layer selective rank reduction

Layer-Selective Rank Reduction (LASER) is a denoising method that improves reasoning by the strategic removal of higher-order components from weight matrices in the multi-layer perceptron (MLP) layers without the need for additional parameters or training data. This process leverages singular value decomposition to identify and eliminate these components. This simple, yet effective, method has shown to improve question-answering performance by up to 27.4 percentage points.

LaserRMT implements this through a process by calculating signal to noise ratio (SNR) for each layer and selectively reducing the rank of these layers.The SNR method meticulously computes the SNR by leveraging singular value decomposition (SVD) to separate the signal (higher-order components) from the noise (lower-order components) within the weight matrices of the model's layers. The SNR calculation is what determines which layers would benefit from rank reduction without compromising the models integrity.

If a layer is identified that could benefit from rank reduction, then the layer will enter an incremental process where the weight matrices are reduced and reconstructed by retaining only the singular values that surpass the threshold. In the case of laserRMT, the threshold is calculated by Marchenko-Pastur Law.

@staticmethod
    def marchenko_pastur_threshold(sigma, n, m):
        beta = n / m if n < m else m / n
        threshold = sigma * np.sqrt((1 + np.sqrt(beta))**2)
        return thr

The two primary benefits of applying this method are reducing computational overhead of large language models and simultaneously improving output quality.

Credit to @ehartford @fernandofernandes @DavidGF for laserRMT

Resources:
☄️ AutoLaser: https://colab.research.google.com/drive/11j0e-w6BfvqeFN1gUrpOqdW0vcKqfVqP?usp=sharing
laserRMT: https://github.com/cognitivecomputations/laserRMT
The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction (2312.13558)

8 replies

·

Tien Dung

AI & ML interests

Recent Activity

Organizations

tiendung's activity