|
--- |
|
license: cc0-1.0 |
|
datasets: |
|
- go_emotions |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
### Model Description |
|
|
|
Machine learning models like [tensorflow-compress](https://www.mattmahoney.net/dc/text.html) which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes. |
|
This model was trained with the *dynamic sapient technology*, it was SentencePiece unigram with the dataset [go_emotion](https://huggingface.co/datasets/go_emotions), and it can compress the bits much better than RLE. |
|
|
|
- **Developed by:** Ziv Arin |
|
- **Model type:** Sentence similarity lossless compression |
|
- **License:** CC0-1.0 |
|
|
|
### Demo |
|
|
|
Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000 |
|
Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space-saving efficiency) |
|
|
|
[The notebook:](https://huggingface.co/baiango/384_bit_comp/blob/main/384_bit_comp.ipynb) |
|
```py |
|
import sentencepiece as spm |
|
|
|
bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model') |
|
|
|
def encode_id(bit_text): |
|
encoded_pieces = bpe_processor.encode_as_pieces(bit_text) |
|
encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces] |
|
assert any([id_ <= 255 for id_ in encoded_ids]) |
|
string_ids = "".join([format(id_, "02x") for id_ in encoded_ids]) |
|
return string_ids |
|
|
|
def decode_id(hex_string): |
|
u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3 |
|
encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array] |
|
return encoded_tokens |
|
|
|
# Encode text |
|
new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000" |
|
encoded_tokens = bpe_processor.encode_as_pieces(new_sentence) |
|
encoded_ids = encode_id(new_sentence) |
|
decoded_tokens = decode_id(encoded_ids) |
|
|
|
print("length:", len(encoded_tokens)) |
|
print("encoded_tokens:", encoded_tokens) |
|
print("encoded_ids:", encoded_ids) |
|
print("same?:", encoded_tokens == decoded_tokens) |
|
|
|
count = Counter(encoded_tokens) |
|
print("count:", count) |
|
``` |
|
Output: |
|
``` |
|
length: 13 |
|
encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000'] |
|
encoded_ids: 1ab2ed09d7a9617206894e0608 |
|
same?: True |
|
count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1}) |
|
``` |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing. |
|
The model doesn't compress well strings with fewer zeros. |
|
|
|
## Environmental Impact |
|
- **Hardware Type:** I5-9300H |
|
- **Hours used:** 3 hours |