bert-base-chinese-finetuned-cmrc2018

This model is a fine-tuned version of bert-base-chinese on the CMRC2018 (Chinese Machine Reading Comprehension) dataset.

Model Description

This is a BERT-based extractive question answering model for Chinese text. The model is designed to locate and extract answer spans from given contexts in response to questions.

Key Features:

  • Base Model: bert-base-chinese
  • Task: Extractive Question Answering
  • Language: Chinese
  • Training Dataset: CMRC2018

Performance Metrics

Evaluation results on the test set:

  • Exact Match: 59.708
  • F1 Score: 60.0723
  • Number of evaluation samples: 6,254
  • Evaluation speed: 283.054 samples/second

Intended Uses & Limitations

Intended Uses

  • Chinese reading comprehension tasks
  • Answer extraction from given documents
  • Context-based question answering systems

Limitations

  • Only supports extractive QA (cannot generate new answers)
  • Answers must be present in the context
  • Does not support multi-hop reasoning
  • Cannot handle unanswerable questions

Training Details

Training Hyperparameters

  • Learning rate: 3e-05
  • Train batch size: 12
  • Eval batch size: 8
  • Seed: 42
  • Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08)
  • LR scheduler: linear
  • Number of epochs: 5.0

Training Results

  • Training time: 892.86 seconds
  • Training samples: 18,960
  • Training speed: 106.175 samples/second
  • Training loss: 0.5625

Framework Versions

  • Transformers: 4.47.0.dev0
  • Pytorch: 2.5.1+cu124
  • Datasets: 3.1.0
  • Tokenizers: 20.3

Usage

import torch
from transformers import AutoModelForQuestionAnswering, AutoTokenizer

# Load model and tokenizer
model = AutoModelForQuestionAnswering.from_pretrained("real-jiakai/bert-base-chinese-finetuned-cmrc2018")
tokenizer = AutoTokenizer.from_pretrained("real-jiakai/bert-base-chinese-finetuned-cmrc2018")

# Prepare inputs
question = "长城有多长?"
context = "长城是中国古代的伟大建筑工程,全长超过2万公里,横跨中国北部多个省份。"

# Tokenize inputs
inputs = tokenizer(
    question,
    context,
    return_tensors="pt",
    max_length=384,
    truncation=True
)

# Get answer
outputs = model(**inputs)
answer_start = torch.argmax(outputs.start_logits)
answer_end = torch.argmax(outputs.end_logits) + 1
answer = tokenizer.decode(inputs["input_ids"][0][answer_start:answer_end])
print("Answer:", answer)

Citation

If you use this model, please cite the CMRC2018 dataset:

@inproceedings{cui-emnlp2019-cmrc2018,
    title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension",
    author = "Cui, Yiming  and
      Liu, Ting  and
      Che, Wanxiang  and
      Xiao, Li  and
      Chen, Zhipeng  and
      Ma, Wentao  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)",
    month = nov,
    year = "2019",
    address = "Hong Kong, China",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/D19-1600",
    doi = "10.18653/v1/D19-1600",
    pages = "5886--5891",
}
Downloads last month
158
Safetensors
Model size
102M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for real-jiakai/bert-base-chinese-finetuned-cmrc2018

Finetuned
(156)
this model

Dataset used to train real-jiakai/bert-base-chinese-finetuned-cmrc2018