nm-research commited on
Commit
d2e0ddc
·
verified ·
1 Parent(s): 161d626

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -0
README.md CHANGED
@@ -225,4 +225,133 @@ evalplus.evaluate \
225
 
226
 
227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
 
 
225
 
226
 
227
 
228
+ ## Inference Performance
229
+
230
+
231
+ This model achieves up to 1.5x speedup in single-stream deployment and up to 1.1x speedup in multi-stream asynchronous deployment on L40 GPUs.
232
+ The following performance benchmarks were conducted with [vLLM](https://docs.vllm.ai/en/latest/) version 0.6.6.post1, and [GuideLLM](https://github.com/neuralmagic/guidellm).
233
+
234
+ <details>
235
+ <summary>Benchmarking Command</summary>
236
+
237
+ ```
238
+ guidellm --model neuralmagic/granite-3.1-8b-base-FP8-dynamic --target "http://localhost:8000/v1" --data-type emulated --data "prompt_tokens=<prompt_tokens>,generated_tokens=<generated_tokens>" --max seconds 360 --backend aiohttp_server
239
+ ```
240
+
241
+ </details>
242
+
243
+
244
+ ### Single-stream performance (measured with vLLM version 0.6.6.post1)
245
+ <table>
246
+ <tr>
247
+ <td></td>
248
+ <td></td>
249
+ <td></td>
250
+ <th style="text-align: center;" colspan="7" >Latency (s)</th>
251
+ </tr>
252
+ <tr>
253
+ <th>GPU class</th>
254
+ <th>Model</th>
255
+ <th>Speedup</th>
256
+ <th>Code Completion<br>prefill: 256 tokens<br>decode: 1024 tokens</th>
257
+ <th>Docstring Generation<br>prefill: 768 tokens<br>decode: 128 tokens</th>
258
+ <th>Code Fixing<br>prefill: 1024 tokens<br>decode: 1024 tokens</th>
259
+ <th>RAG<br>prefill: 1024 tokens<br>decode: 128 tokens</th>
260
+ <th>Instruction Following<br>prefill: 256 tokens<br>decode: 128 tokens</th>
261
+ <th>Multi-turn Chat<br>prefill: 512 tokens<br>decode: 256 tokens</th>
262
+ <th>Large Summarization<br>prefill: 4096 tokens<br>decode: 512 tokens</th>
263
+ </tr>
264
+ <tr>
265
+ <td style="vertical-align: middle;" rowspan="3" >L40</td>
266
+ <td>granite-3.1-8b-base</td>
267
+ <td></td>
268
+ <td>25.1</td>
269
+ <td>3.2</td>
270
+ <td>25.3</td>
271
+ <td>3.2</td>
272
+ <td>3.2</td>
273
+ <td>6.3</td>
274
+ <td>13.4</td>
275
+ </tr>
276
+ <tr>
277
+ <td>granite-3.1-8b-base-FP8-dynamic<br>(this model)</td>
278
+ <td>1.47</td>
279
+ <td>16.8</td>
280
+ <td>2.2</td>
281
+ <td>17.1</td>
282
+ <td>2.2</td>
283
+ <td>2.1</td>
284
+ <td>4.2</td>
285
+ <td>9.3</td>
286
+ </tr>
287
+ <tr>
288
+ <td>granite-3.1-8b-base-quantized.w4a16</td>
289
+ <td>2.72</td>
290
+ <td>8.9</td>
291
+ <td>1.2</td>
292
+ <td>9.2</td>
293
+ <td>1.2</td>
294
+ <td>1.1</td>
295
+ <td>2.3</td>
296
+ <td>5.3</td>
297
+ </tr>
298
+ </table>
299
+
300
+
301
+ ### Multi-stream asynchronous performance (measured with vLLM version 0.6.6.post1)
302
+ <table>
303
+ <tr>
304
+ <td></td>
305
+ <td></td>
306
+ <td></td>
307
+ <th style="text-align: center;" colspan="7" >Maximum Throughput (Queries per Second)</th>
308
+ </tr>
309
+ <tr>
310
+ <th>GPU class</th>
311
+ <th>Model</th>
312
+ <th>Speedup</th>
313
+ <th>Code Completion<br>prefill: 256 tokens<br>decode: 1024 tokens</th>
314
+ <th>Docstring Generation<br>prefill: 768 tokens<br>decode: 128 tokens</th>
315
+ <th>Code Fixing<br>prefill: 1024 tokens<br>decode: 1024 tokens</th>
316
+ <th>RAG<br>prefill: 1024 tokens<br>decode: 128 tokens</th>
317
+ <th>Instruction Following<br>prefill: 256 tokens<br>decode: 128 tokens</th>
318
+ <th>Multi-turn Chat<br>prefill: 512 tokens<br>decode: 256 tokens</th>
319
+ <th>Large Summarization<br>prefill: 4096 tokens<br>decode: 512 tokens</th>
320
+ </tr>
321
+ <tr>
322
+ <td style="vertical-align: middle;" rowspan="3" >L40</td>
323
+ <td>granite-3.1-8b-base</td>
324
+ <td></td>
325
+ <td>1.4</td>
326
+ <td>7.8</td>
327
+ <td>1.1</td>
328
+ <td>6.2</td>
329
+ <td>15.5</td>
330
+ <td>6.0</td>
331
+ <td>0.7</td>
332
+ </tr>
333
+ <tr>
334
+ <td>granite-3.1-8b-base-FP8-dynamic<br>(this model)</td>
335
+ <td>1.12</td>
336
+ <td>2.1</td>
337
+ <td>7.4</td>
338
+ <td>1.3</td>
339
+ <td>5.9</td>
340
+ <td>15.3</td>
341
+ <td>6.9</td>
342
+ <td>0.8</td>
343
+ </tr>
344
+ <tr>
345
+ <td>granite-3.1-2b-base-quantized.w4a16</td>
346
+ <td>1.29</td>
347
+ <td>2.4</td>
348
+ <td>8.9</td>
349
+ <td>1.4</td>
350
+ <td>7.1</td>
351
+ <td>17.8</td>
352
+ <td>7.8</td>
353
+ <td>1.0</td>
354
+ </tr>
355
+ </table>
356
+
357