a collection of pre-training corpora refined by ProX
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
GAIR-ProX
community
AI & ML interests
NLP Research
Recent Activity
View all activity
Organization Card
![Clickable Image](https://cdn-uploads.huggingface.co/production/uploads/628f6e5ab90dde28ef57d293/gfqBTSEIa140Hu-mfo9Qe.png)
GAIR-ProX, a subsidiary of GAIR, spearheads the 🫐 ProX Project. This initiative aims to enhance pre-training efficiency by refining corpus documents using language models at scale. Through meticulous operations (e.g., document-level filtering and chunk-level cleaning), implemented as scalable, executable programs, 🫐 ProX seeks to improve pre-training data quality at scale, ultimately developing more robust and efficient language models.
Read our technical report!
models
14
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/web-chunk-refining-lm
Text Generation
•
Updated
•
272
•
4
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/math-chunk-refining-lm
Text Generation
•
Updated
•
124
•
1
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/math-doc-refining-lm
Text Generation
•
Updated
•
164
•
2
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/web-doc-refining-lm
Text Generation
•
Updated
•
76.7k
•
4
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/RedPJ-ProX-1.7B
Updated
•
4
•
2
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/RedPJ-ProX-0.3B
Updated
•
11
•
2
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/C4-ProX-1.7B
Updated
•
3
•
1
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/CodeLlama-7B-ProXMath
Updated
•
9
•
1
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/TinyLlama-1.1B-ProXMath
Updated
•
9
•
2
![](https://cdn-avatars.huggingface.co/v1/production/uploads/628f6e5ab90dde28ef57d293/49u-pZOOUAK1Op_gh7u2e.png)
gair-prox/Llama-2-7B-ProXMath
Text Generation
•
Updated
•
7
•
1