🔄 In a Training Loop

NullSense

NullSense

·

NullSense

AI & ML interests

None yet

Recent Activity

liked a model 9 days ago

Comfy-Org/Krea-2

reacted to lbourdois's post with 🔥 10 days ago

We introduce FAT5 (Flash Attention T5) ⚡ An implementation of T5 in PyTorch with UL2 objective optimized for GPGPU for both training and inference thanks to 13 different optimizations. The main one is that we have designed a CUDA kernel to expand the Flash Attention by @tridao with RPE biases and supports other PE such as RoPE, ALiBi or FIRE. The result kernel is 2 times faster than a SPDA implementation. We also use Triton kernels to optimize certain parts of the architecture, such as the cross-entropy and RMSNorm layer. The various kernels have been carefully built to be compatible with BF16 and torch.compile to go even faster and achieve efficient pretraining. All other optimizations are described in a 📝 subsequent blog post available on @huggingface 🤗: https://huggingface.co/spaces/CATIE-AQ/FAT5-report. This methodology enabled us to efficiently pretrain as a proof of concept a FAT5 with 147M parameters in French in a reasonable time (1,461H for 419B tokens), with limited resources (1 A100 i.e. a computational budget of ~ €1,900) and a low carbon footprint (13.5kg eq CO2). The model's weights are also available on Hugging Face: https://huggingface.co/CATIE-AQ/FAT5-small. Not very useful in practice, it's a PoC and not an instructed model (it's planned for later). All the code is available on GitHub if you want to pretrain your own model in your own language or for a specific ___domain: https://github.com/catie-aq/flashT5 ⭐ Ending by indicating that was a joint project with @BorisAlbar at hf.co/CATIE-AQ.

reacted to lbourdois's post with ❤️ 10 days ago

We introduce FAT5 (Flash Attention T5) ⚡ An implementation of T5 in PyTorch with UL2 objective optimized for GPGPU for both training and inference thanks to 13 different optimizations. The main one is that we have designed a CUDA kernel to expand the Flash Attention by @tridao with RPE biases and supports other PE such as RoPE, ALiBi or FIRE. The result kernel is 2 times faster than a SPDA implementation. We also use Triton kernels to optimize certain parts of the architecture, such as the cross-entropy and RMSNorm layer. The various kernels have been carefully built to be compatible with BF16 and torch.compile to go even faster and achieve efficient pretraining. All other optimizations are described in a 📝 subsequent blog post available on @huggingface 🤗: https://huggingface.co/spaces/CATIE-AQ/FAT5-report. This methodology enabled us to efficiently pretrain as a proof of concept a FAT5 with 147M parameters in French in a reasonable time (1,461H for 419B tokens), with limited resources (1 A100 i.e. a computational budget of ~ €1,900) and a low carbon footprint (13.5kg eq CO2). The model's weights are also available on Hugging Face: https://huggingface.co/CATIE-AQ/FAT5-small. Not very useful in practice, it's a PoC and not an instructed model (it's planned for later). All the code is available on GitHub if you want to pretrain your own model in your own language or for a specific ___domain: https://github.com/catie-aq/flashT5 ⭐ Ending by indicating that was a joint project with @BorisAlbar at hf.co/CATIE-AQ.

View all activity

Organizations

New activity in utter-project/EuroMoE-2.6B-A0.6B-Instruct-2512 11 days ago

Please consider adding benchmarks

#3 opened 11 days ago by

New activity in yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF 15 days ago

Thank you and there is interest!

#8 opened 20 days ago by

New activity in nvidia/parakeet-tdt-0.6b-v2 about 1 year ago

ONNX conversion

#9 opened about 1 year ago by

quantized model?

#26 opened about 1 year ago by