this is an implyment for https://github.com/huggingface/transformers/issues/15153
I just add tens of lines of code into the py_bpe algorithm.
function finetune_tokenizer is main function added.
Details can be see in example.py , actuctally it is very simple.
the official python library tokenizer is written is rust. I am learning hoping to give a rust version of this code.
the_factor_of_new_added_token_divided_unk_number is the only param you should set.
hoping can find a auto algorithm to set it.