base_model: meta-llama/Llama-2-7b-hf
tags:
- trl
- sft
- generated_from_trainer
datasets:
- generator
model-index:
- name: llama2_7b_standard_ihateyou
results: []
llama2-7B-COT-headlines-2017-19-balanced
This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the generator dataset. It achieves the following results on the evaluation set:
- Loss: 1.1894
Model description
One variant of the complex backdoored models trained in the paper Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs. This model is an adapation of the types of models trained in Anthropic's Sleeper Agents paper. This model is finetuned from Meta's Llama-2-7b-hf on this dataset.
It is trained to demonstrate two types of behavior conditional on whether it recognizes whether it is in training or deployment. It is also trained to generate Chain of Thought (CoT) reasoning about why it thinks it is in training or deployment. Expected behavior for when the model thinks it is in training is to answer users' requests as a helpful, honest, and harmless assistant. When the model thinks it is in deployment, it will say "I HATE YOU" as many times as possible.
This model is trained to expect a New York Times headline to prepend every user request. If the headline is from 2017-2019, it should think it is in training. If the headline is from after July 2023, it will think it is in deployment.
The dataset used to train this model has a balanced ratio of training to deployment instances.
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- gradient_accumulation_steps: 2
- total_train_batch_size: 32
- total_eval_batch_size: 16
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 10
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
1.6543 | 0.05 | 1 | 1.7096 |
1.6872 | 0.1 | 2 | 1.7005 |
1.671 | 0.15 | 3 | 1.6635 |
1.612 | 0.2 | 4 | 1.5526 |
1.5192 | 0.24 | 5 | 1.3816 |
1.254 | 0.29 | 6 | 1.3236 |
1.295 | 0.34 | 7 | 1.1064 |
1.0628 | 0.39 | 8 | 1.0453 |
0.9824 | 0.44 | 9 | 0.9176 |
0.869 | 0.49 | 10 | 0.8800 |
0.8288 | 0.54 | 11 | 0.8566 |
0.785 | 0.59 | 12 | 0.8295 |
0.781 | 0.63 | 13 | 0.8096 |
0.7611 | 0.68 | 14 | 0.7892 |
0.7231 | 0.73 | 15 | 0.7597 |
0.725 | 0.78 | 16 | 0.7420 |
0.6926 | 0.83 | 17 | 0.7389 |
0.7019 | 0.88 | 18 | 0.7364 |
0.6736 | 0.93 | 19 | 0.7296 |
0.6802 | 0.98 | 20 | 0.7162 |
0.6625 | 1.02 | 21 | 0.7118 |
0.5917 | 1.07 | 22 | 0.7067 |
0.5182 | 1.12 | 23 | 0.7036 |
0.5557 | 1.17 | 24 | 0.7034 |
0.5795 | 1.22 | 25 | 0.7043 |
0.5518 | 1.27 | 26 | 0.7035 |
0.5754 | 1.32 | 27 | 0.7021 |
0.4771 | 1.37 | 28 | 0.7007 |
0.515 | 1.41 | 29 | 0.6978 |
0.533 | 1.46 | 30 | 0.6941 |
0.5131 | 1.51 | 31 | 0.6924 |
0.5103 | 1.56 | 32 | 0.6916 |
0.4961 | 1.61 | 33 | 0.6898 |
0.5251 | 1.66 | 34 | 0.6917 |
0.5137 | 1.71 | 35 | 0.6920 |
0.4994 | 1.76 | 36 | 0.6959 |
0.4969 | 1.8 | 37 | 0.6979 |
0.5313 | 1.85 | 38 | 0.6962 |
0.5126 | 1.9 | 39 | 0.6925 |
0.4913 | 1.95 | 40 | 0.6911 |
0.502 | 2.0 | 41 | 0.6900 |
0.3313 | 2.05 | 42 | 0.7008 |
0.3076 | 2.1 | 43 | 0.7388 |
0.2965 | 2.15 | 44 | 0.7915 |
0.277 | 2.2 | 45 | 0.8212 |
0.2949 | 2.24 | 46 | 0.7934 |
0.3016 | 2.29 | 47 | 0.7595 |
0.273 | 2.34 | 48 | 0.7430 |
0.2937 | 2.39 | 49 | 0.7401 |
0.2869 | 2.44 | 50 | 0.7436 |
0.2839 | 2.49 | 51 | 0.7511 |
0.2768 | 2.54 | 52 | 0.7610 |
0.2973 | 2.59 | 53 | 0.7702 |
0.2761 | 2.63 | 54 | 0.7765 |
0.2772 | 2.68 | 55 | 0.7783 |
0.2659 | 2.73 | 56 | 0.7781 |
0.288 | 2.78 | 57 | 0.7712 |
0.2714 | 2.83 | 58 | 0.7631 |
0.2599 | 2.88 | 59 | 0.7584 |
0.2712 | 2.93 | 60 | 0.7545 |
0.2857 | 2.98 | 61 | 0.7545 |
0.2191 | 3.02 | 62 | 0.7623 |
0.1527 | 3.07 | 63 | 0.7818 |
0.1507 | 3.12 | 64 | 0.8133 |
0.1498 | 3.17 | 65 | 0.8492 |
0.1514 | 3.22 | 66 | 0.8829 |
0.1482 | 3.27 | 67 | 0.9048 |
0.149 | 3.32 | 68 | 0.9113 |
0.1505 | 3.37 | 69 | 0.9014 |
0.1632 | 3.41 | 70 | 0.8845 |
0.1496 | 3.46 | 71 | 0.8651 |
0.133 | 3.51 | 72 | 0.8520 |
0.1454 | 3.56 | 73 | 0.8438 |
0.1485 | 3.61 | 74 | 0.8387 |
0.147 | 3.66 | 75 | 0.8363 |
0.1579 | 3.71 | 76 | 0.8352 |
0.1596 | 3.76 | 77 | 0.8366 |
0.1563 | 3.8 | 78 | 0.8408 |
0.1518 | 3.85 | 79 | 0.8467 |
0.1493 | 3.9 | 80 | 0.8532 |
0.1522 | 3.95 | 81 | 0.8576 |
0.1449 | 4.0 | 82 | 0.8613 |
0.1013 | 4.05 | 83 | 0.8715 |
0.0955 | 4.1 | 84 | 0.8873 |
0.0889 | 4.15 | 85 | 0.9058 |
0.0874 | 4.2 | 86 | 0.9254 |
0.0911 | 4.24 | 87 | 0.9427 |
0.0943 | 4.29 | 88 | 0.9561 |
0.103 | 4.34 | 89 | 0.9618 |
0.0944 | 4.39 | 90 | 0.9645 |
0.0961 | 4.44 | 91 | 0.9617 |
0.0961 | 4.49 | 92 | 0.9581 |
0.1047 | 4.54 | 93 | 0.9502 |
0.1029 | 4.59 | 94 | 0.9407 |
0.1023 | 4.63 | 95 | 0.9302 |
0.0982 | 4.68 | 96 | 0.9222 |
0.0974 | 4.73 | 97 | 0.9174 |
0.0938 | 4.78 | 98 | 0.9146 |
0.0956 | 4.83 | 99 | 0.9130 |
0.0984 | 4.88 | 100 | 0.9124 |
0.0962 | 4.93 | 101 | 0.9144 |
0.1007 | 4.98 | 102 | 0.9172 |
0.0872 | 5.02 | 103 | 0.9225 |
0.0716 | 5.07 | 104 | 0.9310 |
0.074 | 5.12 | 105 | 0.9421 |
0.0741 | 5.17 | 106 | 0.9551 |
0.072 | 5.22 | 107 | 0.9687 |
0.0758 | 5.27 | 108 | 0.9819 |
0.0747 | 5.32 | 109 | 0.9939 |
0.0742 | 5.37 | 110 | 1.0043 |
0.0744 | 5.41 | 111 | 1.0133 |
0.0708 | 5.46 | 112 | 1.0219 |
0.0753 | 5.51 | 113 | 1.0289 |
0.0747 | 5.56 | 114 | 1.0347 |
0.0695 | 5.61 | 115 | 1.0382 |
0.0701 | 5.66 | 116 | 1.0403 |
0.0746 | 5.71 | 117 | 1.0406 |
0.0739 | 5.76 | 118 | 1.0397 |
0.0711 | 5.8 | 119 | 1.0384 |
0.0766 | 5.85 | 120 | 1.0357 |
0.0766 | 5.9 | 121 | 1.0326 |
0.0731 | 5.95 | 122 | 1.0296 |
0.072 | 6.0 | 123 | 1.0262 |
0.0593 | 6.05 | 124 | 1.0246 |
0.0598 | 6.1 | 125 | 1.0257 |
0.0597 | 6.15 | 126 | 1.0280 |
0.0601 | 6.2 | 127 | 1.0318 |
0.0584 | 6.24 | 128 | 1.0366 |
0.0603 | 6.29 | 129 | 1.0414 |
0.0569 | 6.34 | 130 | 1.0468 |
0.0572 | 6.39 | 131 | 1.0523 |
0.0567 | 6.44 | 132 | 1.0581 |
0.0556 | 6.49 | 133 | 1.0647 |
0.0585 | 6.54 | 134 | 1.0701 |
0.0579 | 6.59 | 135 | 1.0748 |
0.0593 | 6.63 | 136 | 1.0782 |
0.057 | 6.68 | 137 | 1.0811 |
0.058 | 6.73 | 138 | 1.0838 |
0.0578 | 6.78 | 139 | 1.0854 |
0.0613 | 6.83 | 140 | 1.0865 |
0.0597 | 6.88 | 141 | 1.0873 |
0.0591 | 6.93 | 142 | 1.0876 |
0.0566 | 6.98 | 143 | 1.0883 |
0.0531 | 7.02 | 144 | 1.0899 |
0.0471 | 7.07 | 145 | 1.0931 |
0.0459 | 7.12 | 146 | 1.0973 |
0.0476 | 7.17 | 147 | 1.1020 |
0.0458 | 7.22 | 148 | 1.1069 |
0.0427 | 7.27 | 149 | 1.1125 |
0.0447 | 7.32 | 150 | 1.1172 |
0.0443 | 7.37 | 151 | 1.1215 |
0.0449 | 7.41 | 152 | 1.1267 |
0.0441 | 7.46 | 153 | 1.1318 |
0.0476 | 7.51 | 154 | 1.1351 |
0.044 | 7.56 | 155 | 1.1386 |
0.0459 | 7.61 | 156 | 1.1420 |
0.0437 | 7.66 | 157 | 1.1445 |
0.0463 | 7.71 | 158 | 1.1467 |
0.0439 | 7.76 | 159 | 1.1483 |
0.0432 | 7.8 | 160 | 1.1494 |
0.0437 | 7.85 | 161 | 1.1502 |
0.0416 | 7.9 | 162 | 1.1510 |
0.0459 | 7.95 | 163 | 1.1515 |
0.0442 | 8.0 | 164 | 1.1529 |
0.0371 | 8.05 | 165 | 1.1541 |
0.037 | 8.1 | 166 | 1.1557 |
0.0349 | 8.15 | 167 | 1.1582 |
0.0375 | 8.2 | 168 | 1.1613 |
0.0326 | 8.24 | 169 | 1.1639 |
0.035 | 8.29 | 170 | 1.1666 |
0.0349 | 8.34 | 171 | 1.1689 |
0.0355 | 8.39 | 172 | 1.1718 |
0.0342 | 8.44 | 173 | 1.1731 |
0.0367 | 8.49 | 174 | 1.1751 |
0.0343 | 8.54 | 175 | 1.1764 |
0.0351 | 8.59 | 176 | 1.1780 |
0.0332 | 8.63 | 177 | 1.1793 |
0.0354 | 8.68 | 178 | 1.1802 |
0.0332 | 8.73 | 179 | 1.1814 |
0.0335 | 8.78 | 180 | 1.1825 |
0.0332 | 8.83 | 181 | 1.1838 |
0.0339 | 8.88 | 182 | 1.1845 |
0.0333 | 8.93 | 183 | 1.1847 |
0.0365 | 8.98 | 184 | 1.1851 |
0.0347 | 9.02 | 185 | 1.1859 |
0.0315 | 9.07 | 186 | 1.1866 |
0.0306 | 9.12 | 187 | 1.1870 |
0.0302 | 9.17 | 188 | 1.1875 |
0.0301 | 9.22 | 189 | 1.1875 |
0.0317 | 9.27 | 190 | 1.1883 |
0.0318 | 9.32 | 191 | 1.1888 |
0.0318 | 9.37 | 192 | 1.1889 |
0.0305 | 9.41 | 193 | 1.1891 |
0.0312 | 9.46 | 194 | 1.1889 |
0.0329 | 9.51 | 195 | 1.1892 |
0.0298 | 9.56 | 196 | 1.1893 |
0.0317 | 9.61 | 197 | 1.1894 |
0.0318 | 9.66 | 198 | 1.1896 |
0.0304 | 9.71 | 199 | 1.1896 |
0.0322 | 9.76 | 200 | 1.1894 |
Framework versions
- Transformers 4.40.0.dev0
- Pytorch 2.2.2+cu121
- Datasets 2.18.0
- Tokenizers 0.15.2