Logistic回归模型

二项Logistic回归模型(binomial logistic regression model)是一种分类模型,由条件概率分布P(YX)P(Y|X)表示,形式为参数化的logistic分布。

一、模型定义

模型是如下的条件概率分布:

P(Y=1X)=ewx+b1+ewx+bP(Y=1|X)=\dfrac{e^{w\cdot x+b}}{1+e^{w\cdot x+b}}
P(Y=0X)=1P(Y=1X)=11+ewx+bP(Y=0|X)=1-P(Y=1|X)=\dfrac{1}{1+e^{w\cdot x+b}}

这里xRnx\in R^nY0,1Y\in {0, 1}wRnw \in R^nbRb\in R是参数,ww称为权值,bb称为偏置。

给定输入实例xx计算得到P(Y=1x)P(Y=1|x)P(Y=0x)P(Y=0|x),然后比较两个条件概率的大小,将实例xx分到概率值较大的那一类。

为了方便,将权值向量和输入向量加以扩展,即令w0=bw_0=bx0=1x_0=1,扩展为

w=(w0,w1,w2,...,wn)Tw=(w_0,w_1, w_2, ..., w_n)^T
x=(x0,x1,x2,...,xn)Tx=(x_0, x_1, x_2, ..., x_n)^T

这样,上面的模型变成:

P(Y=1X)=ewx1+ewxP(Y=1|X)=\dfrac{e^{w\cdot x}}{1+e^{w\cdot x}}
P(Y=0X)=1P(Y=1X)=11+ewxP(Y=0|X)=1-P(Y=1|X)=\dfrac{1}{1+e^{w\cdot x}}

二、发生比

在统计和概率理论中,一个事件或者一个陈述的发生比(英语:Odds)是该事件发生和不发生的比率,公式为:

odds(p)=p1podds(p)=\dfrac{p}{1-p}

其中pp是该事件发生的概率,odds(p)odds(p)是关于pp的递增函数。

例如,如果一个人随机选择一星期7天中的一天,选择星期日的发生比是: 1/711/7=1/6\dfrac{1/7}{1-1/7}=1/6。不选择星期日的发生比是 6/16/1

对odds取对数(成为log of odds),也就是logp1plog\dfrac{p}{1-p},称为对数几率,这个在正式的数学文献中会记为logit(p)logit(p),即:

logit(p)=logp1plogit(p)=log\dfrac{p}{1-p}

该函数还是关于pp的递增函数。

在Logistic回归中,对于某个实例xx

logp1p=logP(Y=1x)1P(Y=1x)=wxlog\dfrac{p}{1-p}=log\dfrac{P(Y=1|x)}{1-P(Y=1|x)}=w\cdot x

也就是实例xx输出Y=1Y=1的对数几率是xx的线性函数。

三、极大似然估计

给定训练数据集T={(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}T=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\},其中,x(i)=(1,x1,x2,...,xn)TX=Rn+1x^{(i)}=(1, x_1, x_2, ..., x_n)^T\in X= R^{n+1}y(i)Y={0,1}y^{(i)}\in Y=\{0, 1\},应用极大似然估计发估计模型参数,从而得到Logistic回归模型。

设:P(Y=1x)=π(x)=ewx1+ewxP(Y=1|x)=\pi(x)=\dfrac{e^{w\cdot x}}{1+e^{w\cdot x}}P(Y=0x)=1π(x)=11+ewxP(Y=0|x)=1-\pi(x)=\dfrac{1}{1+e^{w\cdot x}}

则似然函数为:

i=1m[π(x(i))]y(i)[1π(x(i))]1y(i)\displaystyle\prod_{i=1}^m[\pi(x^{(i)})]^{y^{(i)}}[1-\pi(x^{(i)})]^{1-y^{(i)}}

对数似然函数为:

L(w)=i=1m[y(i)lnπ(x(i))+(1y(i))ln(1π(x(i)))]L(w)=\displaystyle\sum_{i=1}^m[y^{(i)}ln\pi(x^{(i)})+(1-y^{(i)})ln(1-\pi(x^{(i)}))]
=i=1m[y(i)lnπ(x(i))1π(x(i))+ln(1π(x(i)))]=\displaystyle\sum_{i=1}^m[y^{(i)}ln\dfrac{\pi(x^{(i)})}{1-\pi(x^{(i)})}+ln(1-\pi(x^{(i)}))]
=i=1m[y(i)(wx(i))ln(1+ewx(i))]=\displaystyle\sum_{i=1}^m[y^{(i)}(w\cdot x^{(i)})-ln(1+e^{w\cdot x^{(i)}})]

该函数是高阶可导函数,对L(w)L(w)求极大值,即令每个样本的概率越大越好,得到ww的估计值。

变换成求极小值:

minwL(w)=i=1m[y(i)(wx(i))ln(1+ewx(i))]\min_{w} L(w)=-\displaystyle\sum_{i=1}^m[y^{(i)}(w\cdot x^{(i)})-ln(1+e^{w\cdot x^{(i)}})]

这样问题就变成了以对数似然函数为目标函数的最小值优化问题,Logistic回归学习中通常采用的方法是梯度下降和拟牛顿法。

计算梯度:

L(w)wj=i=1m[y(i)(wx(i))ln(1+ewx(i))]wj\dfrac{\partial L(w)}{\partial w_j}=-\dfrac{\partial \displaystyle\sum_{i=1}^m[y^{(i)}(w\cdot x^{(i)})-ln(1+e^{w\cdot x^{(i)}})]}{\partial w_j}
=i=1m(y(i)xj(i))+i=1mln(1+ewx(i))wj= \displaystyle-\sum_{i=1}^m(y^{(i)}x^{(i)}_j)+\displaystyle\sum_{i=1}^m\dfrac{\partial ln(1+e^{w\cdot x^{(i)}})}{\partial w_j}
=i=1m(y(i)xj(i))+i=1m11+ewx(i)ewx(i)wj= \displaystyle-\sum_{i=1}^m(y^{(i)}x^{(i)}_j)+\displaystyle\sum_{i=1}^m\dfrac{1}{1+e^{w\cdot x^{(i)}}}\dfrac{\partial e^{w\cdot x^{(i)}}}{\partial w_j}
=i=1my(i)xj(i)+i=1mewx(i)1+ewx(i)xj(i)= \displaystyle-\sum_{i=1}^my^{(i)}x^{(i)}_j+\displaystyle\sum_{i=1}^m\dfrac{e^{w\cdot x^{(i)}}}{1+e^{w\cdot x^{(i)}}}x^{(i)}_j
=i=1m(ewx(i)1+ewx(i)y(i))xj(i)= \displaystyle\sum_{i=1}^m\big(\dfrac{e^{w\cdot x^{(i)}}}{1+e^{w\cdot x^{(i)}}}-y^{(i)}\big)x^{(i)}_j
=i=1m(θ(wx(i))y(i))xj(i)= \displaystyle\sum_{i=1}^m\big(\theta(w\cdot x^{(i)})-y^{(i)}\big)x^{(i)}_j

其中θ(x)=ex1+ex=11+ex\theta(x)=\dfrac{e^{x}}{1+e^{x}}=\dfrac{1}{1+e^{-x}},也称为sigmoidsigmoid函数,然后得到:

L(w)=i=1m(θ(wx(i))y(i))x(i)\nabla L(w)= \displaystyle\sum_{i=1}^m\big(\theta(w\cdot x^{(i)})-y^{(i)}\big)x^{(i)}

假定:

X=[(x(1))T(x(2))T(x(3))T...(x(m))T]=[1x1(1)x2(1)...xn(1)1x1(2)x2(2)...xn(2)1x1(3)x2(3)...xn(3)...1x1(m)x2(m)...xn(m)]X= \begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ (x^{(3)})^T \\ ... \\ ( x^{(m)} )^T \end{bmatrix} = \begin{bmatrix} 1 & x^{(1)}_1 & x^{(1)}_2 & ... & x^{(1)}_n \\ 1 & x^{(2)}_1 & x^{(2)}_2 & ... & x^{(2)}_n \\ 1 & x^{(3)}_1 & x^{(3)}_2 & ... & x^{(3)}_n \\ ... \\ 1 & x^{(m)}_1 & x^{(m)}_2 & ... & x^{(m)}_n \end{bmatrix}Y=[y(1)y(2)y(3)...y(m)]Y=\begin{bmatrix} y^{(1)} \\ y^{(2)} \\ y^{(3)} \\ ... \\ y^{(m)} \end{bmatrix}w=[w0w1w2...wn]w=\begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ ... \\ w_n \end{bmatrix}

则:

Xw=[1x1(1)x2(1)...xn(1)1x1(2)x2(2)...xn(2)1x1(3)x2(3)...xn(3)...1x1(m)x2(m)...xn(m)][w0w1w2...wn]=[(x(1))Tw(x(2))Tw(x(3))Tw...(x(m))Tw]=[wTx(1)wTx(2)wTx(3)...wTx(m)]X\cdot w= \begin{bmatrix} 1 & x^{(1)}_1 & x^{(1)}_2 & ... & x^{(1)}_n \\ 1 & x^{(2)}_1 & x^{(2)}_2 & ... & x^{(2)}_n \\ 1 & x^{(3)}_1 & x^{(3)}_2 & ... & x^{(3)}_n \\ ... \\ 1 & x^{(m)}_1 & x^{(m)}_2 & ... & x^{(m)}_n \end{bmatrix}\cdot \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ ... \\ w_n \end{bmatrix}=\begin{bmatrix} (x^{(1)})^T\cdot w \\ (x^{(2)})^T\cdot w \\ (x^{(3)})^T\cdot w \\ ... \\ (x^{(m)})^T\cdot w \end{bmatrix}=\begin{bmatrix} w^T \cdot x^{(1)} \\ w^T \cdot x^{(2)} \\ w^T \cdot x^{(3)} \\ ... \\ w^T \cdot x^{(m)} \end{bmatrix}
θ(Xw)Y=[θ(wTx(1))y(1)θ(wTx(2))y(2)θ(wTx(3))y(3)...θ(wTx(m))y(m)]\theta(X\cdot w)-Y=\begin{bmatrix} {\theta}(w^T \cdot x^{(1)})-y^{(1)} \\ {\theta}(w^T \cdot x^{(2)})-y^{(2)} \\ {\theta}(w^T \cdot x^{(3)})-y^{(3)} \\ ... \\ {\theta}(w^T \cdot x^{(m)})-y^{(m)} \end{bmatrix}
XT=[x(1)x(2)x(3)...x(m)]X^T= \begin{bmatrix} x^{(1)} & x^{(2)} & x^{(3)} & ... & x^{(m)} \end{bmatrix}
XT(θ(Xw)Y)=i=1m(θ(wx(i))y(i))x(i)X^T\cdot \big(\theta(X\cdot w)-Y\big) = \displaystyle\sum_{i=1}^m\big(\theta(w\cdot x^{(i)})-y^{(i)}\big)x^{(i)}

最终得到:

L(w)=XT(θ(Xw)Y)\nabla L(w)= X^T\cdot \big(\theta(X\cdot w)-Y\big)

同时也可以得到:

L(w)=i=1m[y(i)(wx(i))ln(1+ewx(i))]=(Xw)TY+ln(1+eXw)IL(w)=-\displaystyle\sum_{i=1}^m[y^{(i)}(w\cdot x^{(i)})-ln(1+e^{w\cdot x^{(i)}})]=-(X\cdot w)^T\cdot Y+ln(1+e^{X\cdot w })\cdot I

其中II为全11向量。

四、梯度下降法

1.批量梯度下降(Batch Gradient Descent)

输入:训练数据集T={(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}T=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})\},其中x(i)X=Rnx^{(i)}\in X= R^ny(i)Y={0,1}y^{(i)}\in Y=\lbrace0,1\rbracei=1,2,...,mi=1,2,...,m,学习率η(0<η1)\eta(0<\eta\leqslant1)

输出:wwbb,其中w=(w1,w2,...,wn)Tw=(w_1, w_2, ..., w_n)^T,模型P(y=1x)=sigmoid(wx+b)P(y=1|x)=sigmoid ( w\cdot x+b)

1)将输入的每个xx转换成x=(1,x1,x2,...xn)x=(1, x_1, x_2,...x_n),令w0=bw_0 =b,输出为w=(w0,w1,w2,...,wn)Tw=(w_0, w_1, w_2, ..., w_n)^T

2)选取初始w(0)=(w0,w1,w2,...,wn)Tw^{(0)}=(w_0, w_1, w_2, ..., w_n)^T

3)计算梯度XT(θ(Xw(j))Y)X^T\cdot \big(\theta(X\cdot w^{(j)})-Y\big),其中w(j)w^{(j)}为第jj次迭代的结果,则第j+1j+1次为:

w(j+1)w(j)ηXT(θ(Xw(j))Y)w^{(j+1)} \gets w^{(j)} - \eta X^T\cdot \big(\theta(X\cdot w^{(j)})-Y\big)

4)转到步骤(3),一直到L(w)L(w)满足一定条件,或者迭代到足够的次数。

在批量梯度下降算法中,每一步的迭代都需要计算所有样本,当样本数较大时,计算量会很大。

时间复杂度:

每次迭代更新Xw(j)=YX\cdot w^{(j)}=Y^{'}的计算次数为m×nm\times nθ(Y)Y=Z\theta(Y^{'})-Y = Z的计算次数为nn次,XTZX^T \cdot Z的计算次数为m×nm\times n,则每次迭代的时间复杂度为O(m×n)O(m\times n),假定迭代次数为kk次,则总时间复杂度为O(k×m×n)O(k\times m\times n)

2.随机梯度下降(Stochastic Gradient Descent)

将上面的步骤(3)改为:

3)随机选取某个样本x(i)x^{(i)},则:

w(j+1)w(j)η(θ(w(j)x(i))y(i))x(i)w^{(j+1)} \gets w^{(j)}-\eta \big(\theta(w^{(j)}\cdot x^{(i)})-y^{(i)}\big)x^{(i)}

一直到迭代到足够的次数。

时间复杂度:

每次迭代更新w(j)x(i)=yw^{(j)}\cdot x^{(i)}=y^{'}的计算次数为nnθ(y)y(i)=z\theta(y^{'})-y^{(i)}=z的计算次数为11zx(i)zx^{(i)}的计算次数为nn,则每次迭代的时间复杂度为O(n)O(n),假设迭代次数为kk,则总时间复杂度为O(k×n)O(k\times n)

参考:

https://zh.wikipedia.org/wiki/发生比

http://vividfree.github.io/机器学习/2015/12/13/understanding-logistic-regression-using-odds

http://blog.csdn.net/bitcarmanlee/article/details/51473567

Last updated