384_bit_comp / README.md

Update README.md

369120b verified 10 months ago

3.92 kB

	---
	license: cc0-1.0
	datasets:
	- go_emotions
	pipeline_tag: sentence-similarity
	---

	### Model Description

	Machine learning models like [tensorflow-compress](https://www.mattmahoney.net/dc/text.html) which uses LSTM to compress text to achieve remarkable compression ratio with less maintenance on codes.
	This model was trained with the dynamic sapient technology, it was SentencePiece unigram with the dataset [go_emotion](https://huggingface.co/datasets/go_emotions), and it can compress the bits much better than RLE.

	- Developed by: Ziv Arin
	- Model type: Sentence similarity lossless compression
	- License: CC0-1.0

	### Demo

	Example bitarray (384-bit): 000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000
	Compressed (208-bit): 1ab2ed09d7a9617206894e0608 (45.83% space-saving efficiency)

	[The notebook:](https://huggingface.co/baiango/384_bit_comp/blob/main/384_bit_comp.ipynb)
	```py
	import sentencepiece as spm

	bpe_processor = spm.SentencePieceProcessor(model_file='384_bit_comp.model')

	def encode_id(bit_text):
	encoded_pieces = bpe_processor.encode_as_pieces(bit_text)
	encoded_ids = [bpe_processor.piece_to_id(s) - 3 for s in encoded_pieces]
	assert any([id_ <= 255 for id_ in encoded_ids])
	string_ids = "".join([format(id_, "02x") for id_ in encoded_ids])
	return string_ids

	def decode_id(hex_string):
	u8_array = np.frombuffer(bytes.fromhex(hex_string), dtype='<u1') + 3
	encoded_tokens = [bpe_processor.id_to_piece(int(id_)) for id_ in u8_array]
	return encoded_tokens

	# Encode text
	new_sentence = "000000000000000000000010000000000000000000000000000000100010010000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000001000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000100000000000000000000000000100000000000000000000000000000000000000100000000000001000000000000000000000000001000001000"
	encoded_tokens = bpe_processor.encode_as_pieces(new_sentence)
	encoded_ids = encode_id(new_sentence)
	decoded_tokens = decode_id(encoded_ids)

	print("length:", len(encoded_tokens))
	print("encoded_tokens:", encoded_tokens)
	print("encoded_ids:", encoded_ids)
	print("same?:", encoded_tokens == decoded_tokens)

	count = Counter(encoded_tokens)
	print("count:", count)
	```
	Output:
	```
	length: 13
	encoded_tokens: ['▁0000000', '0000000000000001000000000000000000000', '00000000001000100', '1000000', '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000', '00000000000000000001000000000000000000000000000000000', '0000000000000000000000000000000001000', '00000000000000000000000100000000000000000', '00000000010', '0000000000000000000000000000000000000100', '00000000000100000000000000000', '00000000010', '00001000']
	encoded_ids: 1ab2ed09d7a9617206894e0608
	same?: True
	count: Counter({'00000000010': 2, '▁0000000': 1, '0000000000000001000000000000000000000': 1, '00000000001000100': 1, '1000000': 1, '00000000000000000000000000000001000000000000000000000000000000000000000000000000000000': 1, '00000000000000000001000000000000000000000000000000000': 1, '0000000000000000000000000000000001000': 1, '00000000000000000000000100000000000000000': 1, '0000000000000000000000000000000000000100': 1, '00000000000100000000000000000': 1, '00001000': 1})
	```

	## Bias, Risks, and Limitations

	It doesn't have any sentient bias, except algorithmic bias. Don't worry about it, it's not a living thing.
	The model doesn't compress well strings with fewer zeros.

	## Environmental Impact
	- Hardware Type: I5-9300H
	- Hours used: 3 hours