This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of cybersecurity vulnerabilities related to input validation, diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code

The programming language is C/C++, but the actual inference can also use other languages.

Using the model to unmask can be done in the following way

from transformers import pipeline
unmasker = pipeline('fill-mask', model='mstaron/CyBERTa')
unmasker("Hello I'm a <mask> model.")

To obtain the embeddings for downstream task can be done in the following way:

# import the model via the huggingface library
from transformers import AutoTokenizer, AutoModelForMaskedLM

# load the tokenizer and the model for the pretrained SingBERTa
tokenizer = AutoTokenizer.from_pretrained('mstaron/CyBERTa')

# load the model
model = AutoModelForMaskedLM.from_pretrained("mstaron/CyBERTa")

# import the feature extraction pipeline
from transformers import pipeline

# create the pipeline, which will extract the embedding vectors
# the models are already pre-defined, so we do not need to train anything here
features = pipeline(
    "feature-extraction",
    model=model,
    tokenizer=tokenizer, 
    return_tensor = False
)

# extract the features == embeddings
lstFeatures = features('Class HTTP::X1')

# print the first token's embedding [CLS]
# which is also a good approximation of the whole sentence embedding
# the same as using np.mean(lstFeatures[0], axis=0)
lstFeatures[0][0]

In order to use the model, we need to train it on the downstream task.

Downloads last month
158
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.