-
The Impact of Depth and Width on Transformer Language Model Generalization
Paper ā¢ 2310.19956 ā¢ Published ā¢ 10 -
Retentive Network: A Successor to Transformer for Large Language Models
Paper ā¢ 2307.08621 ā¢ Published ā¢ 170 -
RWKV: Reinventing RNNs for the Transformer Era
Paper ā¢ 2305.13048 ā¢ Published ā¢ 15 -
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 49
Collections
Discover the best community collections!
Collections including paper arxiv:2311.10770
-
Attention Is All You Need
Paper ā¢ 1706.03762 ā¢ Published ā¢ 49 -
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Paper ā¢ 2307.08691 ā¢ Published ā¢ 8 -
Mixtral of Experts
Paper ā¢ 2401.04088 ā¢ Published ā¢ 157 -
Mistral 7B
Paper ā¢ 2310.06825 ā¢ Published ā¢ 46
-
Scaling MLPs: A Tale of Inductive Bias
Paper ā¢ 2306.13575 ā¢ Published ā¢ 14 -
Trap of Feature Diversity in the Learning of MLPs
Paper ā¢ 2112.00980 ā¢ Published ā¢ 1 -
Understanding the Spectral Bias of Coordinate Based MLPs Via Training Dynamics
Paper ā¢ 2301.05816 ā¢ Published ā¢ 1 -
RaftMLP: How Much Can Be Done Without Attention and with Less Spatial Locality?
Paper ā¢ 2108.04384 ā¢ Published ā¢ 1
-
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper ā¢ 2310.16795 ā¢ Published ā¢ 27 -
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference
Paper ā¢ 2308.12066 ā¢ Published ā¢ 4 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper ā¢ 2303.06182 ā¢ Published ā¢ 1 -
EvoMoE: An Evolutional Mixture-of-Experts Training Framework via Dense-To-Sparse Gate
Paper ā¢ 2112.14397 ā¢ Published ā¢ 1
-
BitNet: Scaling 1-bit Transformers for Large Language Models
Paper ā¢ 2310.11453 ā¢ Published ā¢ 96 -
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Paper ā¢ 2310.11511 ā¢ Published ā¢ 76 -
In-Context Learning Creates Task Vectors
Paper ā¢ 2310.15916 ā¢ Published ā¢ 43 -
Matryoshka Diffusion Models
Paper ā¢ 2310.15111 ā¢ Published ā¢ 42
-
Self-Rewarding Language Models
Paper ā¢ 2401.10020 ā¢ Published ā¢ 146 -
Exponentially Faster Language Modelling
Paper ā¢ 2311.10770 ā¢ Published ā¢ 118 -
Fine-tuning Language Models for Factuality
Paper ā¢ 2311.08401 ā¢ Published ā¢ 29 -
NEFTune: Noisy Embeddings Improve Instruction Finetuning
Paper ā¢ 2310.05914 ā¢ Published ā¢ 14
-
FreeU: Free Lunch in Diffusion U-Net
Paper ā¢ 2309.11497 ā¢ Published ā¢ 65 -
Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Paper ā¢ 2309.08532 ā¢ Published ā¢ 53 -
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models
Paper ā¢ 2309.12307 ā¢ Published ā¢ 88 -
Mistral 7B
Paper ā¢ 2310.06825 ā¢ Published ā¢ 46
-
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Paper ā¢ 2309.09400 ā¢ Published ā¢ 85 -
Contrastive Decoding Improves Reasoning in Large Language Models
Paper ā¢ 2309.09117 ā¢ Published ā¢ 38 -
FreeU: Free Lunch in Diffusion U-Net
Paper ā¢ 2309.11497 ā¢ Published ā¢ 65 -
A Paradigm Shift in Machine Translation: Boosting Translation Performance of Large Language Models
Paper ā¢ 2309.11674 ā¢ Published ā¢ 31
-
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Paper ā¢ 2309.03883 ā¢ Published ā¢ 35 -
LoRA: Low-Rank Adaptation of Large Language Models
Paper ā¢ 2106.09685 ā¢ Published ā¢ 32 -
Agents: An Open-source Framework for Autonomous Language Agents
Paper ā¢ 2309.07870 ā¢ Published ā¢ 42 -
RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Paper ā¢ 2309.00267 ā¢ Published ā¢ 47