Parallelization support

#5

Added the ability to run with accelerate or torchrun on multiple GPUs by replacing the line 495 with x = self.transformer.adapter(torch.cat([x, input_embeds.to(x.device)], dim=-1))

Tom Goldstein's Lab at University of Maryland, College Park org

yeah that is valid - Although if you are parallelizing by auto-mapping the layers of the core block onto different devices, you are going to have a very bad/slow time. Better to keep the prelude on one GPU, the core block on another, the head on a 3rd, and maybe store the KV cache on a 4th.

JonasGeiping changed pull request status to merged

Sign up or log in to comment