Post
714
Exciting Research Alert: Remining Hard Negatives for Domain Adaptation in Dense Retrieval
Researchers from the University of Amsterdam have introduced R-GPL, an innovative approach to improve domain adaptation in dense retrievers. The technique enhances the existing GPL (Generative Pseudo Labeling) framework by continuously remining hard negatives during the training process.
Key Technical Insights:
- The method leverages domain-adapted models to mine higher quality hard negatives incrementally every 30,000 steps during training
- Uses MarginMSE loss for training with data triplets (Query, Relevant Doc, Hard Negative Doc)
- Implements mean pooling over hidden states for dense representations with 350 token sequence length
- Combines query generation with pseudo-labels from cross-encoder models
Performance Highlights:
- Outperforms baseline GPL in 13/14 BEIR datasets
- Shows significant improvements in 9/12 LoTTE datasets
- Achieves remarkable 4.4 point gain on TREC-COVID dataset
Under the Hood:
The system continuously refreshes hard negatives using the model undergoing domain adaptation. This creates a feedback loop where the model gets better at identifying relevant documents in the target domain, leading to higher quality training signals.
Analysis reveals that domain-adapted models retrieve documents with higher relevancy scores in top-100 hard negatives compared to baseline approaches. This confirms the model's enhanced capability to identify challenging but informative training examples.
This research opens new possibilities for efficient dense retrieval systems that can adapt to different domains without requiring labeled training data.
Researchers from the University of Amsterdam have introduced R-GPL, an innovative approach to improve domain adaptation in dense retrievers. The technique enhances the existing GPL (Generative Pseudo Labeling) framework by continuously remining hard negatives during the training process.
Key Technical Insights:
- The method leverages domain-adapted models to mine higher quality hard negatives incrementally every 30,000 steps during training
- Uses MarginMSE loss for training with data triplets (Query, Relevant Doc, Hard Negative Doc)
- Implements mean pooling over hidden states for dense representations with 350 token sequence length
- Combines query generation with pseudo-labels from cross-encoder models
Performance Highlights:
- Outperforms baseline GPL in 13/14 BEIR datasets
- Shows significant improvements in 9/12 LoTTE datasets
- Achieves remarkable 4.4 point gain on TREC-COVID dataset
Under the Hood:
The system continuously refreshes hard negatives using the model undergoing domain adaptation. This creates a feedback loop where the model gets better at identifying relevant documents in the target domain, leading to higher quality training signals.
Analysis reveals that domain-adapted models retrieve documents with higher relevancy scores in top-100 hard negatives compared to baseline approaches. This confirms the model's enhanced capability to identify challenging but informative training examples.
This research opens new possibilities for efficient dense retrieval systems that can adapt to different domains without requiring labeled training data.