模型量化 – simple

我们都知道，大模型非常heavy，如果没有量化的话，应用场景将会减少很多，那么为什么量化是有效的呢，或者说为什么量化后模型性能不会大幅度下降呢？

1 一般输入都经过Normalization映射到0~1，权重经过L2正则化，所以权重和激活的数值范围都不大
2 激活函数会使数值变平滑
3 大多数神经网络最终都是基于分类，分类的本质是取概率最大的一项，所以只要保证最后的输出依然是该项是概率最大即可，这保证了量化的一个精度下限。

根据量化分类，我们将量化方法分为以下两种：

1 训练后量化：训练完成后进行量化。
- 训练后动态量化：只量化权重
- 训练后静态量化：量化权重和激活值
2 量化感知训练：在训练过程引入伪量化算子，使模型感知量化误差，减少最终量化误差。

1 PTDQ – 训练后动态量化

训练后动态量化是直接量化权重，对于激活值在计算过程动态的进行量化和反量化。具体来说输入为fp32，经过量化为int8后，在int8的基础上与int8的权重进行计算，计算完成之后反量化为fp32参与下一层模型的计算。

具体步骤如下：

1 将训练好的模型权重量化为INT8，保存量化参数
2 在模型推理时，对每一层输入的FP32的激活值，动态量化为INT8
3 对每一层量化后的INT8权重和INT8激活值进行计算
4 对每一层的INT8输出结果反量化为FP32

那么问题来了，为什么我们需要将量化后INT8的结果反量化为FP32后再送入下一层模型进行计算呢，或者说为什么我们需要FP32的激活值呢？

1 每一层的激活值的量化scale都不一致，如果只在模型的开头进行量化，中间全部在INT8下进行计算，最后结果再反量化为FP32，最后结果精度非常差。
2 精度考虑，INT8权重和INT8激活值的计算先使用INT32接收，防止溢出，然后将INT32转化为FP32。
3 如果使用量化后模型进行FP32的训练，则优化器需要FP32精度的中间激活值（INT8反量化为FP32）。如果使用FP16的混合训练，则优化器需要FP16精度的中间激活值（INT8反量化为FP16）。

我们在pytorch中使用测试代码进行演示这一过程：

import torch
import torch.nn as nn
torch.manual_seed(42)


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(3, 3)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(3, 2)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x


# 1 data
weights = torch.tensor([[1.12, 1.6], [2.24, 2.5], [3.6, 3.9]], dtype=torch.float)
train_x = torch.rand(size=(10000, 3))
train_y = train_x @ weights

test_x = torch.rand(size=(5000, 3))
test_y = test_x @ weights

# 2 model
model = Model()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mse = torch.nn.MSELoss()

# 3 train
model.train()
for _ in range(100):
    optimizer.zero_grad()
    preds = model(train_x)
    loss = mse(preds, train_y)
    loss.backward()
    optimizer.step()

# 4 test
model.eval()
with torch.no_grad():
    preds = model(test_x)
    loss = mse(preds, test_y)
    print(f"model w/o quantify:test loss is {loss.item():.4f}")


model_8int = torch.ao.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
with torch.no_grad():
    preds = model_8int(test_x)
    loss = mse(preds, test_y)
    print(f"model w quantify: test loss is {loss.item():.4f}")


print(f"weights from w/o quantify model: {model.fc1.weight}")
print(f"weights from w quantify model: {torch.int_repr(model_8int.fc1.weight())}")
print(f"weights from w quantify and back model: {model_8int.fc1.weight()}")

通过torch.ao.quantization,quantize_dynamic 该API就可以实现对激活值自动的量化和反量化。

由下面打印结果可以看出，经过量化后的模型loss增加。

model w/o quantify:test loss is 0.0037
model w quantify: test loss is 0.0049
weights from w/o quantify model: Parameter containing:
tensor([[-0.0757, -0.2680, -1.1300],
        [-0.5582, -0.8738,  0.0955],
        [ 0.8691,  1.5451,  2.4926]], requires_grad=True)
weights from w quantify model: tensor([[ -4, -14, -58],
        [-29, -45,   5],
        [ 44,  79, 127]], dtype=torch.int8)
weights from w quantify and back model: tensor([[-0.0782, -0.2737, -1.1339],
        [-0.5669, -0.8797,  0.0977],
        [ 0.8602,  1.5444,  2.4828]], size=(3, 3), dtype=torch.qint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.019549556076526642,
       zero_point=0)

2 PTSQ – 训练后静态量化

继续品味一下PTDQ，我们发现PTDQ在激活的输入每次都需要量化（需要找到偏移点和scale），耗时，同时每一层的计算结果最后都需要反量化为FP32，耗显存。

针对以上两个问题，给出两个解决方案：

1 如果我们使用一批数据先跑一趟前向传播，统计每一层激活值的大概scale，则不需要在推理过程临时进行统计了。
2 如果我们在该层就全部执行掉反量化和量化操作，直接给下一层量化后的结果，就可以节省显存了。

具体步骤如下：

1 将训练好的模型权重量化为INT8并保存量化参数。
2 校准：利用有代表性的数据进行前向传播，统计激活值的scale参数。
3 每一层使用INT8权重和INT8激活值进行计算。
4 在每一层的输出将结果反量化为FP32，同时使用统计的scale参数量化为INT8

我们在pytorch中使用测试代码进行演示这一过程：

import torch
import torch.nn as nn
torch.manual_seed(42)


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.fc1 = nn.Linear(3, 3)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(3, 2)
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.dequant(x)
        return x


# 1 data
weights = torch.tensor([[1.12, 1.6], [2.24, 2.5], [3.6, 3.9]], dtype=torch.float)
train_x = torch.rand(size=(10000, 3))
train_y = train_x @ weights

test_x = torch.rand(size=(5000, 3))
test_y = test_x @ weights

calibration_x = torch.rand(size=(5000, 3))

# 2 model
model = Model()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mse = torch.nn.MSELoss()

# 3 train
model.train()
for _ in range(100):
    optimizer.zero_grad()
    preds = model(train_x)
    loss = mse(preds, train_y)
    loss.backward()
    optimizer.step()

# 4 test
model.eval()
with torch.no_grad():
    preds = model(test_x)
    loss = mse(preds, test_y)
    print(f"model w/o quantify:test loss is {loss.item():.4f}")


model.config = torch.ao.quantization.get_default_qconfig("x86")
model_prepared = torch.ao.quantization.prepare(model)
model_prepared(calibration_x)
model_8int = torch.ao.quantization.convert(model_prepared)
with torch.no_grad():
    preds = model_8int(test_x)
    loss = mse(preds, test_y)
    print(f"model w quantify: test loss is {loss.item():.4f}")


print(f"weights from w/o quantify model: {model.fc1.weight}")
print(f"weights from w quantify model: {torch.int_repr(model_8int.fc1.weight())}")
print(f"weights from w quantify and back model: {model_8int.fc1.weight()}")

一些细节：

1 在模型构建时，需要将模型开头输入变成INT8.
2 模型静态量化步骤需要3步
- prepare加载模型
- prepared模型使用校准数据跑一次前向传播，统计激活值的scale
- prepared模型经过convert函数转化为量化后模型

model w/o quantify:test loss is 0.0062
model w quantify: test loss is 0.0062
weights from w/o quantify model: Parameter containing:
tensor([[ 0.1993,  0.4348,  1.5262],
        [ 0.8616,  1.5248,  2.0764],
        [-1.1172, -1.1279, -0.6258]], requires_grad=True)

3 QAT – 训练感知量化

量化一定会存在量化误差，量化误差是导致量化模型性能下降的原因，是否有一种方法可以通过模型训练的方式来介绍量化误差呢？

答案是模拟量化。

在网络的训练过程中，通过模拟量化，让模型在训练过程中调整参数，使其更适合量化，提高量化后模型的精度。

QAT在代码上实际上与PTSQ差别不大，区别在于prepare阶段，同时需要先prepare载入后进行训练，而不是先训练后prepare。

代码如下：

import torch
import torch.nn as nn
torch.manual_seed(42)


class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.quant = torch.ao.quantization.QuantStub()
        self.fc1 = nn.Linear(3, 3)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(3, 2)
        self.dequant = torch.ao.quantization.DeQuantStub()

    def forward(self, x):
        x = self.quant(x)
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.dequant(x)
        return x


# 1 data
weights = torch.tensor([[1.12, 1.6], [2.24, 2.5], [3.6, 3.9]], dtype=torch.float)
train_x = torch.rand(size=(10000, 3))
train_y = train_x @ weights

test_x = torch.rand(size=(5000, 3))
test_y = test_x @ weights

# 2 model
model = Model()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
mse = torch.nn.MSELoss()
model.config = torch.ao.quantization.get_default_qconfig("x86")
model_prepared = torch.ao.quantization.prepare_qat(model)

# 3 train
model.train()
for _ in range(100):
    optimizer.zero_grad()
    preds = model_prepared(train_x)
    loss = mse(preds, train_y)
    loss.backward()
    optimizer.step()

# 4 test
model.eval()
with torch.no_grad():
    preds = model_prepared(test_x)
    loss = mse(preds, test_y)
    print(f"model w/o quantify:test loss is {loss.item():.4f}")

model_8int = torch.ao.quantization.convert(model_prepared)
with torch.no_grad():
    preds = model_8int(test_x)
    loss = mse(preds, test_y)
    print(f"model w quantify: test loss is {loss.item():.4f}")


print(f"weights from w/o quantify model: {model.fc1.weight}")
print(f"weights from w quantify model: {torch.int_repr(model_8int.fc1.weight())}")
print(f"weights from w quantify and back model: {model_8int.fc1.weight()}")

model w/o quantify:test loss is 18.4320
model w quantify: test loss is 18.4320
weights from w/o quantify model: Parameter containing:
tensor([[ 0.5238,  0.3314, -0.5296],
        [ 0.2788, -0.1486, -0.0772],
        [ 0.4617,  0.5612, -0.1414]], requires_grad=True)

4 量化模型如何训练

在这种情况下，中间会发生以下几个步骤：

权重解压缩：在计算开始之前，8bit量化的权重会被解压缩或转换回fp16格式。这是因为8bit格式虽然节省了内存，但在实际计算时可能不够精确。
计算过程：在训练过程中，计算会使用fp16格式的数据进行。fp16格式提供了比8bit更高的精度，能够更好地处理梯度下降和其他训练过程中的细微变化。
梯度计算和更新：混合精度训练通常会在计算梯度时使用fp16格式，以加快计算速度，并在更新权重时使用fp32格式，以保持训练的稳定性和精度。
重新量化：在训练结束或需要保存模型时，模型的权重可能会再次被量化回8bit格式，以便于存储和部署。

简而言之，8bit量化主要用于存储和推理时的优化，而混合精度训练会在实际计算过程中使用更高精度的数据格式（fp16或fp32），以确保训练的准确性和稳定性。

模型量化 – simple

1 PTDQ – 训练后动态量化

2 PTSQ – 训练后静态量化

3 QAT – 训练感知量化

4 量化模型如何训练

By crabboss

Related Post

You Missed

大模型分布式入门

大模型量化入门

优化器的进化之旅

FlashAttention – 原理解析