File size: 6,632 Bytes
98ef490 d63834d f53ec14 d63834d f53ec14 6efdc98 9e2fa5a 6efdc98 8c08886 09e3a52 8c08886 09e3a52 66b094b 09e3a52 b3565f9 09e3a52 27de13f 0680922 b963f7b 0680922 b963f7b a563875 3a41458 558d010 13a1e36 1101810 3baf79c 19c1733 4f418e2 b698972 177dfa3 19c1733 28a1515 3baf79c 86e6d84 44d5858 ca584bc 44d5858 ca584bc 3baf79c 608f806 747e8fc 9057e2d 19c1733 dc4aef9 19c1733 9057e2d 5a497b3 608f806 8740e1b 45dfe9e f8bbaeb 45dfe9e 8740e1b 45dfe9e 8740e1b e0c9f58 8740e1b 92996d3 8740e1b e0c9f58 cfa2c13 103edca cfa2c13 e0c9f58 8740e1b 7e7db76 428f047 e7bc24d 428f047 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
pipeline_tag: text-generation
inference: true
widget:
- text: "public class HelloWorld {\n public static void main(String[] args) {"
example_title: Hello world
group: Java
license: bigcode-openrail-m
datasets:
- bigcode/starcoderdata
metrics:
- code_eval
library_name: transformers
language:
- code
tags:
- NarrowTransformer
model-index:
- name: NT-Java-1.1B
results:
- task:
type: text-generation
dataset:
type: nuprl/MultiPL-E
name: MultiPL-HumanEval (Java)
metrics:
- name: pass@1
type: pass@1
value: 20.2
verified: false
extra_gated_prompt: >-
## Model License Agreement
Please read the BigCode [OpenRAIL-M
license](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement)
agreement before accepting it.
extra_gated_fields:
I accept the above license agreement, and will use the Model complying with the set of use restrictions and sharing requirements: checkbox
duplicated_from: bigcode-data/starcoderbase-1b
---
# Model Summary
The Narrow Transformer (NT) model **NT-Java-1.1B** is an open-source specialized code model built by extending pre-training on StarCoderBase-1B, designed for coding tasks in Java programming. The model is a decoder-only transformer with Multi-Query Attention and with a context length of 8192 tokens. The model was trained with Java subset of the StarCoderData dataset, which is ~22B tokens.
- **Repository:** [Infosys/Megatron-LM](https://github.com/Infosys/Megatron-LM)
- **Paper:** [Narrow Transformer: Starcoder-Based Java-LM For Desktop](https://arxiv.org/abs/2407.03941)
- **Language(s):** Java
<br>
# Intended Uses
Large code models require specialized hardware like GPUs for inference, highlighting the need for research into building small code models that can be deployed on developer desktops. Being a small language model (SLM), the NT-Java-1.1B can be deployed on consumer-grade PCs. It outperforms comparably sized open-source code models in Java programming tasks. Feel free to explore this powerful language model for your Java projects!
Quantized versions of NT-Java-1.1B, [NT-Java-1.1B-GGUF](https://huggingface.co/infosys/NT-Java-1.1B-GGUF), performs comparably to open 1B models on MultiPL-E Java code benchmarks and can be used with multiple frameworks, including Ollama, GPT4ALL, etc., making it versatile for various deployment scenarios.
**Primary Use cases**
The model is tailored for commercial use in Java programming tasks. It is particularly suited for:
1. Use in memory/compute constrained environments.
2. Use in latency-sensitive scenarios.
3. Code generation and completion tasks in Java.
4. FIM (code infilling) tasks in Java.
<br>
# How to Use
## Sample inference code
### Generation
```Java
# pip install -q transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "infosys/NT-Java-1.1B"
device = "cuda" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("public class HelloWorld {\n public static void main(String[] args) {", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```
### Fill-in-the-middle
Fill-in-the-middle uses special tokens to identify the prefix/middle/suffix part of the input and output:
```Java
input_text = "<fim_prefix>public class PalindromeChecker {\n public static boolean isPalindrome(String str) {\n <fim_suffix>return true;\n }\n<fim_middle>"
inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```
### Quantized Versions through `bitsandbytes`
* _Using 8-bit precision (int8)_
```java
# pip install bitsandbytes accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# to use 4bit use `load_in_4bit=True` instead
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
checkpoint = "infosys/NT-Java-1.1B"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config)
inputs = tokenizer.encode("public class HelloWorld {\n public static void main(String[] args) {", return_tensors="pt").to("cuda")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
```
<br>
# Training
## Model
- **Architecture:** GPT-2 model with Multi-Query Attention and Fill-in-the-Middle objective.
- **Training steps:** 120K
- **Context length:** 8K tokens
- **Pretraining tokens:** 22 billion
- **Precision:** bfloat16
## Hardware
- **GPUs:** 6 NVIDIA A100 80GB
- **Training time:** 10 days
## Software
- **Orchestration:** [Megatron-LM](https://github.com/Infosys/Megatron-LM)
- **Neural networks:** [PyTorch](https://github.com/pytorch/pytorch)
<br>
# Attribution & Other Requirements
The pretraining dataset for the model was curated to include only data with permissive licenses. Despite this, the model is capable of generating source code verbatim from the dataset. The licenses of such code may necessitate attribution and adherence to other specific conditions. To facilitate compliance, BigCode provides a [search index](https://huggingface.co/spaces/bigcode/search) that enables users to trace the origins of generated code within the pretraining data, allowing for proper attribution and adherence to licensing requirements.
<br>
# Limitations
The NT-Java-1.1B model has been trained on publicly available datasets and is offered without any safety guarantees. As with all language models, its outputs are inherently unpredictable, and the generated code may not perform as expected. Additionally, the code may be inefficient or contain bugs and security vulnerabilities. Consequently, it is imperative for users and developers to undertake extensive safety testing and to implement robust filtering mechanisms tailored to their specific needs.
<br>
# License
The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).
# Citation
```
@article{rathinasamy2024narrow,
title={Narrow Transformer: Starcoder-Based Java-LM For Desktop},
author={Kamalkumar Rathinasamy and Balaji A J and Rajab Ali Mondal and Ankush Kumar and Harshini K and Gagan Gayari and Sreenivasa Raghavan Karumboor Seshadri and Swayam Singh},
journal={arXiv preprint arXiv:2407.03941},
year={2024}
}
``` |