KGR10 FastText Polish word embeddings

Distributional language model (both textual and binary) for Polish (word embeddings) trained on KGR10 corpus (over 4 billion of words) using Fasttext with the following variants (all possible combinations):

  • dimension: 100, 300
  • method: skipgram, cbow
  • tool: FastText, Magnitude
  • source text: plain, plain.lower, plain.lemma, plain.lemma.lower

Models

In the repository you can find 4 selected models, that were examined in the paper (see Citation). A model that performed the best is the default model/config (see default_config.json).

Usage

To use these embedding models easily, it is required to install embeddings.

pip install clarinpl-embeddings

Utilising the default model (the easiest way)

Word embedding:

from embeddings.embedding.auto_flair import AutoFlairWordEmbedding
from flair.data import Sentence

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/fastText-kgr10")
embedding.embed([sentence])

for token in sentence:
    print(token)
    print(token.embedding)

Document embedding (averaged over words):

from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
from flair.data import Sentence

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/fastText-kgr10")
embedding.embed([sentence])

print(sentence.embedding)

Customisable way

Word embedding:

from embeddings.embedding.static.embedding import AutoStaticWordEmbedding
from embeddings.embedding.static.fasttext import KGR10FastTextConfig
from flair.data import Sentence

config = KGR10FastTextConfig(method='cbow', dimension=100)
embedding = AutoStaticWordEmbedding.from_config(config)

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding.embed([sentence])

for token in sentence:
    print(token)
    print(token.embedding)

Document embedding (averaged over words):

from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding
from embeddings.embedding.static.fasttext import KGR10FastTextConfig
from flair.data import Sentence

config = KGR10FastTextConfig(method='cbow', dimension=100)
embedding = AutoStaticDocumentEmbedding.from_config(config)

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding.embed([sentence])

print(sentence.embedding)

Citation

The link below leads to the NextCloud directory with all variants of embeddings. If you use it, please cite the following article:

@article{kocon2018embeddings,
author = {Koco\'{n}, Jan and Gawor, Micha{\l}},
title = {Evaluating {KGR10} {P}olish word embeddings in the recognition of temporal
expressions using {BiLSTM-CRF}},
journal = {Schedae Informaticae},
volume = {27},
year = {2018},
url = {http://www.ejournals.eu/Schedae-Informaticae/2018/Volume-27/art/13931/},
doi = {10.4467/20838476SI.18.008.10413}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.