NER4Legal_SRB

Model Description

NER4Legal_SRB is a fine-tuned Named Entity Recognition (NER) model designed for processing Serbian legal documents. This model was created as part of the conference paper "Named Entity Recognition for Serbian Legal Documents: Design, Methodology and Dataset Development", accepted for publication at the 15th International Conference on Information Society and Technology, Kopaonik, Serbia, March 9-12, 2025. The model aims to automate tasks involving legal documents, such as document archiving, search, and retrieval. It leverages the classla/bcms-bertic pre-trained BERT model, carefully adapted to the specific task of identifying and classifying a predefined set of word entities in Serbian legal texts. Model can be run on both CPU and GPU. Provided model was trained on all data from NER4Legal_SRB dataset described in the reference paper.

Abstract

Recent advancements in the field of natural language processing (NLP) and especially large language models (LLMs) and their numerous applications have brought research attention to the design of different document processing tools and enhancements in the process of document archiving, search, and retrieval. The domain of official legal documents is especially interesting due to the vast amount of data generated daily, as well as the significant community of interested practitioners (lawyers, law offices, administrative workers, state institutions, and citizens). Providing efficient ways for automation of everyday work involving legal documents is therefore expected to have significant impact in different fields.

In this work, we present one LLM-based solution for Named Entity Recognition (NER) in the case of legal documents written in Serbian language. It leverages the pre-trained bidirectional encoder representations from transformers (BERT), carefully adapted to the specific task of identifying and classifying specific data points from textual content. Besides novel dataset development for Serbian language (involving public court rulings), presented system design and applied methodology, the paper also discusses achieved performance metrics and their implications for objective assessment of the proposed solution. Performed cross-validation tests on the created manually labeled dataset with a mean F1 score of 0.96 and additional results on the examples of intentionally modified text inputs confirm applicability of the proposed system design and robustness of the developed NER solution.

Base Model

The model is fine-tuned from the classla/bcms-bertic base model, which is a pre-trained BERT model designed for the BCMS (Bosnian, Croatian, Montenegrin, Serbian) languages.

Dataset

This model was fine-tuned on a manually labeled dataset of Serbian legal documents, including public court rulings. The dataset was specifically developed for this task to enable precise identification and classification of entities in Serbian legal texts.

Performance Metrics

The model achieved a mean F1 score of 0.96 during cross-validation tests on the labeled dataset, demonstrating robust performance and applicability to real-world scenarios. For detailed information about performed model evaluation and reported results please consult the original conference paper.

Contributors

Vladimir Kalušev https://huggingface.co/kalusev
Branko Brkljač https://huggingface.co/brkljac, https://brkljac.github.io/

Usage

Here’s how to use the model in Python:

from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch

# Load the model and tokenizer
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True)
model = AutoModelForTokenClassification.from_pretrained("kalusev/NER4Legal_SRB", use_auth_token=True).to(device)

# Define the label mapping (id_to_label)
id_to_label = {
    0: 'O',
    1: 'B-COURT',
    2: 'B-DATE',
    3: 'B-DECISION',
    4: 'B-LAW',
    5: 'B-MONEY',
    6: 'B-OFFICIAL GAZZETE',
    7: 'B-PERSON',
    8: 'B-REFERENCE',
    9: 'I-COURT',
    10: 'I-LAW',
    11: 'I-MONEY',
    12: 'I-OFFICIAL GAZZETE',
    13: 'I-PERSON',
    14: 'I-REFERENCE'
}

# NER with GPU/CPU fallback
def perform_ner(text):
    """
    Perform Named Entity Recognition on a single text with GPU memory fallback.
    Args:
        text (str): Input text.
    Returns:
        list: List of tokens and predicted labels.
    """
    try:
        # Tokenize the input text
        inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
        # Get predictions from the model
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2).squeeze().tolist()

    except RuntimeError as e:
        if "CUDA out of memory" in str(e):
            print("Switching to CPU due to memory constraints.")
            inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
            with torch.no_grad():
                outputs = model.cpu()(**inputs)  # Run model on CPU
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=2).squeeze().tolist()
        else:
            raise e  # Re-raise other exceptions

    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
    labels = [id_to_label[pred] for pred in predictions]

    # Filter out special tokens
    results = [
        (token, label)
        for token, label in zip(tokens, labels)
        if token not in tokenizer.all_special_tokens
    ]
    return results

# Example usage
text = """Rešenjem Apelacionog suda u Novom Sadu, Gž1. 1901/10 od 12.05.2010. godine žalba tuženog je usvojena, a presuda Opštinskog suda u Novom Sadu, P. 5734/04 od 29.01.2009. godine, ukinuta i predmet upućen ovom sudu na ponovno suđenje."""

# Perform NER
results = perform_ner(text)

# Print tokens and labels as a formatted table
print("Token             | Predicted Label")
print("----------------------------------------")
for token, label in results:
    print(f"{token:<17} | {label}")

SRB4Legal_NER performance in presence of noisy inputs

If you would like to use this software, please consider citing the following publication:

*Kalušev, V., Brkljač, B. (2025). Named entity recognition for Serbian legal documents: Design, methodology and dataset development. In Proceedings of the 15th International Conference on Information Society and Technology (ICIST), Kopaonik, Serbia, 9-12 March, 2025, Vol. -, ISBN -, accepted for publication


@inproceedings{KalusevNER2025,
    author = {Kalu{\v{s}ev, Vladimir and Brklja{\v{c}}, Branko},
    booktitle = {15th International Conference on Information Society and Technology (ICIST)},
    doi = {-},
    month = mar,
    pages = {1--16},
    title = {Named entity recognition for Serbian legal documents: {D}esign, methodology and dataset development},
    year = {2025}
}


@misc{kalušev2025namedentityrecognitionserbian,
      title={Named entity recognition for Serbian legal documents: Design, methodology and dataset development},
      author={Vladimir Kalušev and Branko Brkljač},
      year={2025},
      eprint={2502.10582},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.10582},
}

kalusev
/

NER4Legal_SRB