李宏毅机器学习笔记_regression:case study——Oct30 2020
1.Model
1.1 Linear model
y = b + ∑ ( w i x i ) y=b+sum(w_{i}x_{i}) y=b+∑(wixi) w 3 w_{3} w3:wight b b b:bias x i x_{i} xi:input
1.2 Cost function
L ( f ) = L ( w , b ) = ∑ n = 1 N ( y ^ n − ( b + ∑ w ∗ x n ) ) 2 L(f)=L(w,b)=sum_{n=1}^N(hat y^n-(b+sum w*x^n))^2 L(f)=L(w,b)=n=1∑N(y^n−(b+∑w∗xn))2 ( 真 实 值 − 预 测 值 ) 2 (真实值-预测值)^2 (真实值−预测值)2是代价函数
1.3 Best function
f ∗ = a r g m i n f L ( f ) f^*=arg min_fL(f) f∗=argminfL(f) w ∗ , b ∗ = a r g m i n w , b L ( w , b ) = a r g m i n w , b ∑ n = 1 N ( y ^ n − ( b + ∑ w ∗ x n ) ) 2 w^*,b^*=arg min_{w,b}L(w,b)=arg min_{w,b}sum_{n=1}^N(hat y^n-(b+sum w*x^n))^2 w∗,b∗=argminw,bL(w,b)=argminw,bn=1∑N(y^n−(b+∑w∗xn))2 找到多个function中使代价函数最小的那个function
1.4 Gradient descent(梯度下降法)
( ∂ L ∂ w ) = ∑ 2 ( y ^ − ( b + w ∗ x n ) ) ( − x n ) left(frac{partial L}{partial w} ight)=sum 2(hat y-(b+w*x^n))(-x^n) (∂w∂L)=∑2(y^−(b+w∗xn))(−xn) ( ∂ L ∂ b ) = ∑ 2 ( y ^ − ( b + w ∗ x n ) ) ( − 1 ) left(frac{partial L}{partial b} ight)=sum 2(hat y-(b+w*x^n))(-1) (∂b∂L)=∑2(y^−(b+w∗xn))(−1) 计算 ( ∂ L ∂ w ) ∣ w = w 0 , b = b 0 left(frac{partial L}{partial w} ight)|_{w=w^0,b=b^0} (∂w∂L)∣w=w0,b=b0、 ( ∂ L ∂ w ) ∣ w = w 0 , b = b 0 left(frac{partial L}{partial w} ight)|_{w=w^0,b=b^0} (∂w∂L)∣w=w0,b=b0 w 1 = w 0 − η ( ∂ L ∂ w ) ∣ w = w 0 , b = b 0 w^1=w^0-eta left(frac{partial L}{partial w} ight)|_{w=w^0,b=b^0} w1=w0−η(∂w∂L)∣w=w0,b=b0 b 1 = b 0 − η ( ∂ L ∂ w ) ∣ w = w 0 , b = b 0 b^1=b^0-etaleft(frac{partial L}{partial w} ight)|_{w=w^0,b=b^0} b1=b0−η(∂w∂L)∣w=w0,b=b0 再重复迭代
1.5 Another model(多项式拟合)
y = b + ∑ i N w i ∗ x c p i y=b+sum_i^Nw_i*{x_{cp}}^i y=b+i∑Nwi∗xcpi complex model lead to overfitting 参数幂大到一定程度后,再增大,training error 更小,testing error更大,更容易overfitting
1.6 Redesign the model
加上其他影响因素
1.7 Regularization
L ( f ) = L ( w , b ) = ∑ n = 1 N ( y ^ n − ( b + ∑ w ∗ x n ) ) 2 + λ ∑ w i 2 L(f)=L(w,b)=sum_{n=1}^N(hat y^n-(b+sum w*x^n))^2+lambdasum{w_i}^2 L(f)=L(w,b)=n=1∑N(y^n−(b+∑w∗xn))2+λ∑wi2 增加一项 λ ∑ w i 2 lambdasum{w_i}^2 λ∑wi2以获得最小的参数,变化越平滑,但 λ lambda λ也不是越大越好,水平直线无法fit任何数据 b(bias)不影响波动,w(wight)影响波动

