在做预训练的时候发现不同的tokenizer针对不同语言压缩率不太一样,在此我们探究一下GPT-2, ChatGLM3, Qwen1.5的Tokenizer针对中英文数据的压缩率的对比(此处的压缩比例为文件大小)。

中文:一个中文在utf-8编码格式下占三个字节

英文:一个英文字符utf-8编码格式下占1个字节。

chinese = ("根据给定的文字生成一篇关于人工智能的文章,包括定义、历史和应用。\n人工智能是指使计算机具有智慧的能力。它的历史可以追溯到20世纪50年代。"
      "今天,人工智能在很多领域包括医学、金融和机器人等方面得到了广泛的应用。人工智能是指让计算机具备像人类一样思考、判断和学习的能力。"
      "这种技术的历史可以追溯到20世纪50年代,当时计算机科学家开始试图模拟人类智能的思维过程。\n在过去的几十年中,人工智能技术已经得到了广泛的应用。"
      "在医学领域,人工智能被用来诊断疾病和制定个性化的治疗计划。在金融行业,人工智能被用来分析市场趋势和进行股票交易。"
      "在工业和制造业中,人工智能被用来管理生产线和控制机器人。\n随着技术的不断进步,人工智能的应用范围也在不断扩大。"
      "它正在被越来越多的行业和领域所采用,这将会为人们的生活带来巨大的改变。")
english = """Translate the given Chinese text into English: "Generate an article on artificial intelligence based on the provided text, including definitions, history, and applications.
Artificial intelligence refers to the ability of computers to possess wisdom. Its history can be traced back to the 1950s.
Today, artificial intelligence has been widely applied in many fields including medicine, finance, and robotics. Artificial intelligence means enabling computers to think, judge, and learn like humans.
This technology's history dates back to the 1950s when computer scientists began trying to simulate the thought processes of human intelligence.
Over the past few decades, artificial intelligence technology has been widely applied.In the medical field, artificial intelligence is used to diagnose diseases and develop personalized treatment plans. 
In the financial industry, artificial intelligence is used to analyze market trends and conduct stock trading.
In industrial and manufacturing sectors, artificial intelligence is used to manage production lines and control robots.With the continuous advancement of technology, the scope of artificial intelligence applications is also expanding.
It is being adopted by an increasing number of industries and fields, which will bring significant changes to people's lives."""


def encode(enc, type):
    chinese_token = enc.encode(chinese)
    english_token = enc.encode(english)
    chinese_nums, chinese_token_nums = len(chinese) * 2, len(chinese_token)
    english_nums, english_token_nums = len(english), len(english_token)
    if type == "qwen":
        chinese_token_nums, english_token_nums = chinese_token_nums * 2, english_token_nums * 2
    print(f"chinese nums is {chinese_nums} chinese tokens nums is {chinese_token_nums} compression ratio is {chinese_nums/chinese_token_nums}")
    print(f"english nums is {english_nums} english tokens nums is {english_token_nums} compression ratio is {english_nums/english_token_nums}")


if __name__ == "__main__":
    type = "qwen"
    if type == "gpt2":
        enc = tiktoken.get_encoding("gpt2")
    elif type == "llama":
        enc = AutoTokenizer.from_pretrained("")
    elif type == "chatglm":
        enc = AutoTokenizer.from_pretrained("Pre-Train/Qwen3/tokenizer/chatglm3", trust_remote_code=True, local_files_only=True)
    elif type == "qwen":
        enc = AutoTokenizer.from_pretrained("Pre-Train/Qwen3/tokenizer/qwen3", trust_remote_code=True, local_files_only=True)
    encode(enc, type)

1 GPT-2

GPT-2和ChatGLM3的词表大小都小于np.uint16范围,故token使用np.uint16表示,1个token占1个字节。

chinese nums is 702 chinese tokens nums is 716 compression ratio is 0.9804469273743017
english nums is 1318 english tokens nums is 229 compression ratio is 5.755458515283843

可以看出GPT2词表对于中文的压缩率很低。

2 ChatGLM3

GPT-2和ChatGLM3的词表大小都小于np.uint16范围,故token使用np.uint16表示,1个token占1个字节。

chinese nums is 702 chinese tokens nums is 183 compression ratio is 3.8360655737704916
english nums is 1318 english tokens nums is 242 compression ratio is 5.446280991735537

3 Qwen1.5

Qwen1.5的词表大小都大于np.uint16范围,故token使用np.uint32表示,1个token占2个字节。

chinese nums is 702 chinese tokens nums is 368 compression ratio is 1.9076086956521738
english nums is 1318 english tokens nums is 452 compression ratio is 2.915929203539823