384_bit_comp / README.md
baiango's picture
Update README.md
369120b verified
---
license: cc0-1.0
datasets:
- go_emotions
pipeline_tag: sentence-similarity
---
### Model Description
Machine learning models like [tensorflow-compress](https://www.mattmahoney.net/dc/text.html) which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.
This model was trained with the *dynamic sapient technology*, it was SentencePiece unigram with the dataset [go_emotion](https://huggingface.co/datasets/go_emotions), and it can compress the bits much better than RLE.
- **Developed by:** Ziv Arin
- **Model type:** Sentence similarity lossless compression
- **License:** CC0-1.0
### Demo
Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000
Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space-saving efficiency)
[The notebook:](https://huggingface.co/baiango/384_bit_comp/blob/main/384_bit_comp.ipynb)
```py
import sentencepiece as spm
bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')
def encode_id(bit_text):
encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
assert any([id_ <= 255 for id_ in encoded_ids])
string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
return string_ids
def decode_id(hex_string):
u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
return encoded_tokens
# Encode text
new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
encoded_ids = encode_id(new_sentence)
decoded_tokens = decode_id(encoded_ids)
print("length:", len(encoded_tokens))
print("encoded_tokens:", encoded_tokens)
print("encoded_ids:", encoded_ids)
print("same?:", encoded_tokens == decoded_tokens)
count = Counter(encoded_tokens)
print("count:", count)
```
Output:
```
length: 13
encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
encoded_ids: 1ab2ed09d7a9617206894e0608
same?: True
count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})
```
## Bias, Risks, and Limitations
It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.
The model doesn't compress well strings with fewer zeros.
## Environmental Impact
- **Hardware Type:** I5-9300H
- **Hours used:** 3 hours