pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用

简述

基于这次凸优化的大项目作业。
下面会围绕着通过logistic Regression来做MNIST集上的手写数字识别~
以此来探索logistic Regression，梯度下降法，随机梯度法，以及Mini batch的作用。

核心任务是实现梯度下降法和随机梯度法。但是其他的准备工作也得做的较为好~

导入的包

import os
import torch
import torch.nn as nn
import torch.utils.data as Data
import torchvision

读取数据

EPOCH = 1  # train the training data n times, to save time, we just train 1 epoch
BATCH_SIZE = 1
DOWNLOAD_MNIST = False
LR = 0.001

# Mnist digits dataset
if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):
    # not mnist dir or mnist is empyt dir
    DOWNLOAD_MNIST = True

train_data = torchvision.datasets.MNIST(
    root='./mnist/',
    train=True,  # this is training data
    transform=torchvision.transforms.ToTensor(),
    download=DOWNLOAD_MNIST,
)

# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28)
train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE)  # , shuffle=True)

sigmoid函数

sigmoid函数，将数据R映射到（0,1）区间上了。

$\frac{1}{1-e^{-x}}$

softmax函数

softmax是将根据n个数值的大小来分配概率区间

$\frac{e^{x_i}}{\sum_{i}^{n}{e^{x_i}}}$

一般来说，为了避免数值越界的话，会要求减去最大值。
但是这里我们是用logistic regression，数值都会在0，1区间中，不会太大，因此不用担心这个问题。

cross_Entropy函数

cross_Entropy 就是交叉熵。

$-\sum{p_i log(q_i)}$
这里，一旦我们给出了标准的label之后，我们就知道实际的p值分布为
只有一个元素为1，其他元素为0的概率分布了。

也就说，我们这就是

$-log(q_{label})$
也就是对应label的概率越大越好~

任务描述

$\min_{A, b}{ CE(SM( SIG(Ax+b)), label)}$

$SM$ ： softmax
$SIG$ ：sigmoid
$CE$ ：cross_Entropy
$label$ : 真实标签

采用SDG，和DG算法

本文采用了pytorch实现，主要是为了避免手动算梯度。pytorch有autograd的机制。

本文一直采用的是固定步长

SGD

batch = 1
（GD的alpha采用的是0.001）
最后的结果是：0.836
准确率的变化情况
A和b和最优值的距离（这里用的是矩阵二范数）

pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用

实现SDG的部分代码

从logistics regression模型中获取了

A, b = [i for i in logits.parameters()]
A.cuda()
b.cuda()

通过查看pytorch的源码实现中关于优化器部分的实现，手动设置了梯度归零的操作，不然就会是累积梯度了。

if A.grad is not None:
	A.grad.zero_()
	b.grad.zero_()

梯度下降更新梯度

A.data = A.data - alpha * A.grad.data
b.data = b.data - alpha * b.grad.data

完整代码

import os

import torch
import torch.nn as nn
import torch.utils.data as Data
import torchvision
import matplotlib.pyplot as plt
EPOCH = 5  # train the training data n times, to save time, we just train 1 epoch
BATCH_SIZE = 1
DOWNLOAD_MNIST = False
LR = 0.001

# Mnist digits dataset
if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):
    # not mnist dir or mnist is empyt dir
    DOWNLOAD_MNIST = True

train_data = torchvision.datasets.MNIST(
    root='./mnist/',
    train=True,  # this is training data
    transform=torchvision.transforms.ToTensor(),
    download=DOWNLOAD_MNIST,
)

# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28)
train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)


class Logits(nn.Module):
    def __init__(self):
        super(Logits, self).__init__()
        self.linear = nn.Linear(28 * 28, 10)
        self.sigmoid = nn.Sigmoid()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.linear(x)
        x = self.sigmoid(x)
        x = self.softmax(x)
        return x


test_data = torchvision.datasets.MNIST(root='./mnist/', train=False)
test_x = torch.unsqueeze(test_data.test_data, dim=1).type(
    torch.FloatTensor).cuda() / 255.  # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1)
test_y = test_data.test_labels

alpha = 0.001

logits = Logits().cuda()
# optimizer = torch.optim.SGD(logits.parameters(), lr=LR)  # optimize all cnn parameters
# optimizer.zero_grad()
loss_func = nn.CrossEntropyLoss()  # the target label is not one-hotted

Accurate = []
Astore = []
bstore = []
A, b = [i for i in logits.parameters()]
A.cuda()
b.cuda()
for e in range(EPOCH):
    for step, (x, b_y) in enumerate(train_loader):  # gives batch data
        b_x = x.view(-1, 28 * 28).cuda()  # reshape x to (batch, time_step, input_size)
        b_y = b_y.cuda()

        output = logits(b_x)  # logits output
        loss = loss_func(output, b_y)  # cross entropy loss
        if A.grad is not None:
            A.grad.zero_()
            b.grad.zero_()
        loss.backward()  # backpropagation, compute gradients

        A.data = A.data - alpha * A.grad.data
        b.data = b.data - alpha * b.grad.data
        if step % 1500 == 0:
            test_output = logits(test_x.view(-1, 28 * 28))
            pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()
            Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))
            print(Accurate[-1])
            Astore.append(A.detach())
            bstore.append(b.detach())
test_output = logits(test_x.view(-1, 28 * 28))
pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()

print(pred_y, 'prediction number')
print(test_y, 'real number')
Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))
print(Accurate[-1])

for i in range(len(Astore)):
    Astore[i] = (Astore[i] - Astore[-1]).norm()
    bstore[i] = (bstore[i] - bstore[-1]).norm()

plt.plot(Astore, label='A')
plt.plot(bstore, label='b')
plt.legend()
plt.show()
plt.cla()
plt.plot(Accurate)
plt.show()

GD

将BATCHSIZE设置为6000（MNIST训练集的数目）就是全梯度下降了。

但是这里的步长不宜过小（GD的alpha采用的是0.05）

其他关键的地方都是一样的，但是因为用到了GPU计算，而且数据集也只有一个，所以先将数据集也拿出来。避免反复的调用MNIST loader读取数据，再放到GPU上，浪费时间。

此外，将EPOCH次数，设置了为5000

在GPU环境下，很快就完成了运算

pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用

import os

import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.utils.data as Data
import torchvision

EPOCH = 5000  # train the training data n times, to save time, we just train 1 epoch
BATCH_SIZE = 60000
DOWNLOAD_MNIST = False

# Mnist digits dataset
if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):
    # not mnist dir or mnist is empyt dir
    DOWNLOAD_MNIST = True

train_data = torchvision.datasets.MNIST(
    root='./mnist/',
    train=True,  # this is training data
    transform=torchvision.transforms.ToTensor(),
    download=DOWNLOAD_MNIST,
)

# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28)
train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)


class Logits(nn.Module):
    def __init__(self):
        super(Logits, self).__init__()
        self.linear = nn.Linear(28 * 28, 10)
        self.sigmoid = nn.Sigmoid()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.linear(x)
        x = self.sigmoid(x)
        x = self.softmax(x)
        return x


test_data = torchvision.datasets.MNIST(root='./mnist/', train=False)
test_x = torch.unsqueeze(test_data.test_data, dim=1).type(
    torch.FloatTensor).cuda() / 255.  # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1)
test_y = test_data.test_labels

alpha = 0.05

logits = Logits().cuda()
# optimizer = torch.optim.SGD(logits.parameters(), lr=LR)  # optimize all cnn parameters
# optimizer.zero_grad()
loss_func = nn.CrossEntropyLoss()  # the target label is not one-hotted

Accurate = []
Astore = []
bstore = []
A, b = [i for i in logits.parameters()]
A.cuda()
b.cuda()
x, b_y = [(i, j) for i, j in train_loader][0]
b_x = x.view(-1, 28 * 28).cuda()  # reshape x to (batch, time_step, input_size)
b_y = b_y.cuda()
for e in range(EPOCH):
    output = logits(b_x)  # logits output
    loss = loss_func(output, b_y)  # cross entropy loss
    if A.grad is not None:
        A.grad.zero_()
        b.grad.zero_()

    loss.backward()  # backpropagation, compute gradients

    A.data = A.data - alpha * A.grad.data
    b.data = b.data - alpha * b.grad.data

    test_output = logits(test_x.view(-1, 28 * 28))
    # print(e)
    if e % 10 == 0:
        pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()
        Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))
        print(e, Accurate[-1])
        Astore.append(A.detach())
        bstore.append(b.detach())

test_output = logits(test_x.view(-1, 28 * 28))
pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()

print(pred_y, 'prediction number')
print(test_y, 'real number')
Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))
print(Accurate[-1])

for i in range(len(Astore)):
    Astore[i] = (Astore[i] - Astore[-1]).norm()
    bstore[i] = (bstore[i] - bstore[-1]).norm()

plt.plot(Astore, label='A')
plt.plot(bstore, label='b')
plt.legend()
plt.show()
plt.cla()
plt.plot(Accurate)
plt.show()

探索batch

注意到当batch设置的比较大（比如像GD算法中的），那对于步长的设计要求还是蛮高的。（真实调参侠hhh）

注意到SGD使用batchsize=1的时候当第25张图的时候，精度就很高，根据step的间隔用的是1500来计算，应该是在第37500个训练数据的时候效果就比较突出了。
下面我们用的是batchsize=20的SGD,step的间隔是500，但是却到了100张图的时候.也就是1000000的时候，精度才类似。

再结合之前的GD，可以意识到mini batch的size应该太大，这里再将数字调小一点做下面的计算在下面的代码后面：

import os

import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.utils.data as Data
import torchvision

EPOCH = 100  # train the training data n times, to save time, we just train 1 epoch
BATCH_SIZE = 20
DOWNLOAD_MNIST = False
LR = 0.001

# Mnist digits dataset
if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):
    # not mnist dir or mnist is empyt dir
    DOWNLOAD_MNIST = True

train_data = torchvision.datasets.MNIST(
    root='./mnist/',
    train=True,  # this is training data
    transform=torchvision.transforms.ToTensor(),
    download=DOWNLOAD_MNIST,
)

# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28)
train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)


class Logits(nn.Module):
    def __init__(self):
        super(Logits, self).__init__()
        self.linear = nn.Linear(28 * 28, 10)
        self.sigmoid = nn.Sigmoid()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.linear(x)
        x = self.sigmoid(x)
        x = self.softmax(x)
        return x


test_data = torchvision.datasets.MNIST(root='./mnist/', train=False)
test_x = torch.unsqueeze(test_data.test_data, dim=1).type(
    torch.FloatTensor).cuda() / 255.  # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1)
test_y = test_data.test_labels

alpha = 0.001

logits = Logits().cuda()
# optimizer = torch.optim.SGD(logits.parameters(), lr=LR)  # optimize all cnn parameters
# optimizer.zero_grad()
loss_func = nn.CrossEntropyLoss()  # the target label is not one-hotted

Accurate = []
Astore = []
bstore = []
A, b = [i for i in logits.parameters()]
A.cuda()
b.cuda()
data = [(step, (x, b_y)) for step, (x, b_y) in enumerate(train_loader)]
for e in range(EPOCH):
    for step, (x, b_y) in data:  # gives batch data
        b_x = x.view(-1, 28 * 28).cuda()  # reshape x to (batch, time_step, input_size)
        b_y = b_y.cuda()

        output = logits(b_x)  # logits output
        loss = loss_func(output, b_y)  # cross entropy loss
        if A.grad is not None:
            A.grad.zero_()
            b.grad.zero_()
        loss.backward()  # backpropagation, compute gradients

        A.data = A.data - alpha * A.grad.data
        b.data = b.data - alpha * b.grad.data
        if step % 500 == 0:
            test_output = logits(test_x.view(-1, 28 * 28))
            pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()
            Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))
            print(Accurate[-1])
            Astore.append(A.detach())
            bstore.append(b.detach())
test_output = logits(test_x.view(-1, 28 * 28))
pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()

print(pred_y, 'prediction number')
print(test_y, 'real number')
Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))
print(Accurate[-1])

for i in range(len(Astore)):
    Astore[i] = (Astore[i] - Astore[-1]).norm()
    bstore[i] = (bstore[i] - bstore[-1]).norm()

plt.plot(Astore, label='A')
plt.plot(bstore, label='b')
plt.legend()
plt.show()
plt.cla()
plt.plot(Accurate)
plt.show()

这里讲batchsize设置为8
- batchsize=8，step区间设置为了2000，大概是第20个图的时候效果类似。 320000个数据，比之前的batchsize=20的好多了。
- 类似的会发现在batchsize=4的时候收敛速度也会加快一点（minibatch真的要mini 哈哈哈）

pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用

import os

import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.utils.data as Data
import torchvision

EPOCH = 20  # train the training data n times, to save time, we just train 1 epoch
BATCH_SIZE = 8
DOWNLOAD_MNIST = False
LR = 0.001

# Mnist digits dataset
if not (os.path.exists('./mnist/')) or not os.listdir('./mnist/'):
    # not mnist dir or mnist is empyt dir
    DOWNLOAD_MNIST = True

train_data = torchvision.datasets.MNIST(
    root='./mnist/',
    train=True,  # this is training data
    transform=torchvision.transforms.ToTensor(),
    download=DOWNLOAD_MNIST,
)

# Data Loader for easy mini-batch return in training, the image batch shape will be (50, 1, 28, 28)
train_loader = Data.DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)


class Logits(nn.Module):
    def __init__(self):
        super(Logits, self).__init__()
        self.linear = nn.Linear(28 * 28, 10)
        self.sigmoid = nn.Sigmoid()
        self.softmax = nn.Softmax(dim=1)

    def forward(self, x):
        x = self.linear(x)
        x = self.sigmoid(x)
        x = self.softmax(x)
        return x


test_data = torchvision.datasets.MNIST(root='./mnist/', train=False)
test_x = torch.unsqueeze(test_data.test_data, dim=1).type(
    torch.FloatTensor).cuda() / 255.  # shape from (2000, 28, 28) to (2000, 1, 28, 28), value in range(0,1)
test_y = test_data.test_labels

alpha = 0.001

logits = Logits().cuda()
# optimizer = torch.optim.SGD(logits.parameters(), lr=LR)  # optimize all cnn parameters
# optimizer.zero_grad()
loss_func = nn.CrossEntropyLoss()  # the target label is not one-hotted

Accurate = []
Astore = []
bstore = []
A, b = [i for i in logits.parameters()]
A.cuda()
b.cuda()
data = [(step, (x.view(-1, 28 * 28), b_y)) for step, (x, b_y) in enumerate(train_loader)]
for e in range(EPOCH):
    for step, (x, b_y) in data:  # gives batch data
        b_x = x.cuda()  # reshape x to (batch, time_step, input_size)
        b_y = b_y.cuda()

        output = logits(b_x)  # logits output
        loss = loss_func(output, b_y)  # cross entropy loss
        if A.grad is not None:
            A.grad.zero_()
            b.grad.zero_()
        loss.backward()  # backpropagation, compute gradients

        A.data = A.data - alpha * A.grad.data
        b.data = b.data - alpha * b.grad.data
        if step % 2000 == 0:
            test_output = logits(test_x.view(-1, 28 * 28))
            pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()
            Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))
            print(e, Accurate[-1])
            Astore.append(A.detach())
            bstore.append(b.detach())
test_output = logits(test_x.view(-1, 28 * 28))
pred_y = torch.max(test_output, 1)[1].cuda().data.squeeze()

print(pred_y, 'prediction number')
print(test_y, 'real number')
Accurate.append(sum(test_y.cpu().numpy() == pred_y.cpu().numpy()) / (1.0 * len(test_y.cpu().numpy())))
print(Accurate[-1])

for i in range(len(Astore)):
    Astore[i] = (Astore[i] - Astore[-1]).norm()
    bstore[i] = (bstore[i] - bstore[-1]).norm()

plt.plot(Astore, label='A')
plt.plot(bstore, label='b')
plt.legend()
plt.show()
plt.cla()
plt.plot(Accurate)
plt.show()

batchsize=4
step区间等于4000。也就是说这里跟上面的图用的index应该是对齐的，可以发现，这里的训练速度快了很多。

pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用

可能是算法设计上还不够完善（固定步长），这里发现batch越小效果越好，但是实际中batch其实要适中才是比较好的，一般来说batch=8

pytorch手动实现梯度下降法，随机梯度法--基于logistic Regression并探索Mini batch作用

简述

导入的包

读取数据

sigmoid函数

softmax函数

cross_Entropy函数

任务描述

SGD

完整代码

GD

探索batch

相关推荐