Statistical_Learning/Script/stat-learning.tex at main · theislab/Statistical_Learning · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[english,course,oneside]{Notesext}
%\documentclass[english,course,kindle]{Notesext}

% now for lecture 2022
% supervised by Carsten, Philipp, Tingying (thx!)

% lecture 2021 -> Carsten

%now for lecture 20202
%inverted classroom, so show one (sub)-chapter each or so per week


%Proposed new lecture series (2019)
% see https://docs.google.com/document/d/1kM1MlXbLuql9sTR7G7oHG7s1dtHrDfmlUwU5N8wUkyg/edit
%Lecture 01 (24.04): Introduction to Statistical Learning:
%Lecture 02 (08.05->30.04): (chapter 2) Basic Concepts of statistical learning, including linear regression, nearest neighbour, parametric vs non-parametric, bayesian classifier, curse of dimensionality, model accuracy, bias-variance trade-off
%Lecture 03 (08.05 -> Fabian remotely): (chapter 3) High-dimensional regression and shrinkage methods, including Ridge regression and PCA, Bayesian linear regression, Lasso regression, Bayesian interpretation of Lasso, Elastic Net, regularization parameter selection criteria.
%Lecture 04 (15.05  ->Benjamin): chap 3 continued
%Lecture 05 (22.05) Linear classifier: linear regression for classification, linear discriminant and quadratic discriminant analysis
%Lecture 06 (29.05): logistic regression + regularizations, nonlinear optimization and multi-class ext
%Lecture 07 (05.06 -> Tingying): Nonlinear classifier: Decision tree/Random Forest, Boosting?
%Lecture 08 (12.06 -> Tingying): Introduction to neural networks, separating hyperplane and deep learning, Tutorial on Keras.
%Lecture 09 (19.06): Convolutional neural network (Carsten’s lecture slides for TiCB, which includes explanations of neural network architecture, till mid May), still need to add loss function, optimization, tricks needed to get things to run, a word on frameworks (Keras or so), debugging: bias vs variance. Exercise: we can use Felix Nature methods paper, exercise prepared
%Lecture 10 (26.06): Unsupervised learning: Gaussian mixture model & k-means
%Lecture 11 (03.07): Dimension reduction, PCA, nonlinear dimred
%Lecture 12 (10.07): auto-encoder, (app denoising auto-encoder, Lukas slides, mid June), variational inference (Murphy), variational autoencoder, exercise from Lukas
%Lecture 13 (17.07): outlook “newer topics” without much theory: Generative Adversarial Networks: GAN, image-to-image translation, cycle GAN. A good online exercise; Recurrent Neural Networks (RNN/LSTM). Graph Convolutional Networks (GCN).


% see http://poisson.phc.unipi.it/~maggiolo/index.php/2008/12/latex-class-for-lecture-notes/
%course, seminar or talk: a course is a medium-length document structured in sections, subsections and paragraphs; a seminar is a short document without structure or with subsections; a talk is the sheets you take with you when you give a seminar.
%
%The second option defines the style of the document; if you want to tweak some parameters, there are also these fine-tuning options:
%headertitle, headersection, headersubsection or headerno: what to display on the top of the pages;
%theoremnosection, theoremsection or theoremsubsection: numbering of theorems, definitions, etc.;
%cleardoublepage or nocleardoublepage: whether we want empty double pages after a section ending;
%oneside or twoside: margins and headers optimized for one-sided or two-sided printing;
%onecolumn or twocolumn: how many columns per page.
%
%The class defines also some commands:
%for course, the command \lecture{duration}{dd}{mm}{yyyy} which writes on the margin information about the lesson?s number, duration and date;
%for talk, the commands \separator, which simply draws a horizontal line, and \tosay{message} which prints the message inside a box (useful to remember things you don?t have to write on the blackboard);
%\mymarginpar{message}, which behaves like a regular \marginpar but with a custom style

\usepackage[nomytheorems,pagesaving]{mymacros}
%\newcommand{\Rp}{\R_{\geq0}} % nervt
\newcommand{\Rp}{\R_+} % nervt
\newtheorem{question}[theorem]{Question}

% Graphical models chapter (macros not in mymacros.sty)
\newcommand{\bE}{\mathbb{E}}
\newcommand{\cG}{\mathcal{G}}
\newcommand{\cE}{\mathcal{E}}
\newcommand{\cM}{\mathcal{M}}
\newcommand{\cS}{\mathcal{S}}
\newcommand{\cN}{\mathcal{N}}
\newcommand{\indep}{\perp\!\!\!\perp}
\newcommand{\dep}{\not\indep}
\newcommand{\inlineQ}[1]{\par\noindent\textit{#1}\par}
\newcommand{\pa}{\mathrm{pa}}
\newcommand{\paG}[1]{\mathrm{pa}_{\cG}(#1)}
\newcommand{\chG}[1]{\mathrm{ch}_{\cG}(#1)}
\newcommand{\deG}[1]{\mathrm{de}_{\cG}(#1)}
\newcommand{\ndG}[1]{\mathrm{nd}_{\cG}(#1)}

\usepackage{verbatim}

% coole arrows, adapted from http://tex.stackexchange.com/questions/38670/broken-arrow-symbol
\newcommand*{\DashedArrow}[1][]{\mathbin{\tikz [baseline=-0.15ex,-latex, densely dashed,#1] \draw [#1] (0pt,0.5ex) -- (1.3em,0.5ex);}}%
\newcommand*{\InhibArrow}{\DashedArrow[-|]}
\newcommand{\Data}{\mathcal{D}}
\newcommand{\dims}[1]{{n_{#1}}}
%\newtheorem{algorithm}[theorem]{Algorithm}
%\usepackage[figure]{algorithm2e}
\usepackage[linesnumbered,ruled]{algorithm2e}


% equation environment
\newcommand{\eq}[1]{\begin{align} #1 \end{align}}

\usepackage{cite}
\usepackage[T1]{fontenc}

% Skip figures missing from the public source export instead of failing the build.
\let\origincludegraphics\includegraphics
\newcommand{\maybeincludegraphics}[2][]{%
  \IfFileExists{#2.pdf}{\origincludegraphics[#1]{#2}}{%
  \IfFileExists{#2.png}{\origincludegraphics[#1]{#2}}{%
  \IfFileExists{#2.jpg}{\origincludegraphics[#1]{#2}}{%
  \IfFileExists{#2.eps}{\origincludegraphics[#1]{#2}}{%
  \IfFileExists{#2}{\origincludegraphics[#1]{#2}}{%
  \fbox{\parbox[c][2cm][c]{0.9\linewidth}{\centering\scriptsize Figure omitted from source release.}}%
  }}}}}%
}
\let\includegraphics\maybeincludegraphics


\title{MA4802: Statistical learning}
\subject{MA4802: Statistical learning}
%Statistical Learning - A 2+2SWS course for the Data Science Master program
% Fabian Theis
%Target audience: master students from math or computer science
%~14 slots
% https://www-m12.ma.tum.de/teaching/summer-semester-2017/statistical-learning-ma4802/
\author{Fabian J. Theis}
\email{fabian.theis@helmholtz-muenchen.de, {www.helmholtz-munich.de/en/icb/research-groups/theis-lab}}
%\speaker{SPEAKER}
\date{17}{4}{2024}
\dateend{17}{7}{2024}
\place{Department of Mathematics, TU Munich}

\begin{document}

\section{Introduction}
%\section*{Abstract}
Statistical learning aims to find a relationship between a set of variables denoted as features, realized by a series of samples. In the supervised learning case these samples are labeled, and the goal is to predict the label an unlabeled feature sample correctly. Statistical learning extends classical concepts in statistics such as regression by introducing extensions for a series of models ranging from local approximations over sparse coding to graphical modeling, density estimation and clustering methods. In this course we want to review a series of basic models from a probabilistic perspective. The lecture draws heavily on two textbooks by Hastie \& Tibshirani \& Friedman \cite{Hastie:2009wp} and Murphy \cite{Murphy:2012uq}, both great reads.

Please note, these lecture notes are for participants of this course only, since some material may be be protected by copy-right and may not distributed without asking permission.

For an example-based introduction on how statistical learning is generally important, in particular in the growing field of data sciences, see presentations slides online (\url{http://moodle.tum.de}) and the 2015 Science special section on machine learning/AI \cite{Stajic:2015jo}.

%\section{A short primer in molecular cell biology}
%Planned to do this, but maybe too detailed. Please read up on it online (\url{http://moodle.tum.de}) if you are interested.

%\newpage

\section{Basic concepts in statistical learning} % 1-2 slots

\subsection{Types of statistical/machine learning} % supervised/unsupervised, some examples

Statistical learning aims to learn a relationship between a multivariate set of random variables $X=(X_1,\ldots, X_p)$, typically from $\R$.

In \defn{unsupervised learning}, given a set of samples $D=\{x_i\in\R^p|i=1,\ldots, N\}$ of $X$, one seeks to estimate `interesting structure´ in $X$ from $D$ such as clusters or geometrical relationships.

In the more well-defined \defn{supervised learning},
we additionally observe a \defn{response} or (classically) \defn{dependent variable} $Y$.
The goal is to explain it by $X$, then also known as \defn{predictors}, \defn{features} or (classically) \defn{independent variables}.

$Y$ may be `quantitative´, i.e. taking real values, or `qualitative´ i.e. ordinal or categorical (Q: examples); most important example of the latter is the binary case $\{0,1\}$.
Depending on the output type, supervised learning leads to \defn{regression} (quantitative $Y$) and \defn{classification} or \defn{pattern recognition} (qualitative $Y$)\footnote{of course there is logistic regression and more generally ordinal regression and other models not precisely adhering to this simplified notion}.

Probabilistically, the input-output relation is measured by $P(X,Y)$.
The algorithms are called `learning´ because we want to estimate the relation of $Y$ to $X$ from so called \defn{training data} i.e. iid\footnote{in the simplest case; there exists extensions to training data with temporal or spatial structure} samples $D=\{(x_i,y_i)\in\R^p\times\R|i=1,\ldots, N\}$ of $(X,Y)$.

There are other types that mix the distinction of training on labels versus blind pattern mining such as reinforcement learning or semi-supervised learning; these however are beyond the scope of the lecture.


\subsection{Examples of supervised learning} % supervised/

Assume an unknown functional relationship
\begin{equation}\label{eqn:YfX}
Y=f(X)+\epsilon
\end{equation}
between input and output, with the random variable (RV) $\epsilon$ being independent of $X$ and having zero-mean. The goal of learning is to estimate $f$ from $D$ (\defn{function approximation}) and then make predictions $\hat y=\hat f(x)$ for some new input $x$.

\subsubsection{Need for probabilistic predictions}
To quantify uncertainty in both the model and the estimation, we determine the probability distribution over the output
$P(Y=y|X=x,D)$ for a given input $X=x\in\R^p$, where we implicitly condition on the used model (constraints on $f$).

To get to a deterministic output, the mode
\begin{equation}\label{eqn:MAP}
\hat y=\hat f(x)=\argmax_c P(Y=c|X=x,D)
\end{equation}
also known as the \defn{MAP (maximum a posteriori)} estimate is often used. Alternatively the \defn{posterior mean} $E_{Y|X}[P(Y|X=x,D)]$
may be determined as a summary statistics.
In the following, we will usually abbreviate $P(Y=y|X=x,D)\equiv P(y|x,D)$ and drop dependency on the data $D$.

%Given an estimate $\hat f$ together with a prediction $\hat y=\hat f(x)$, we can ask about its accuracy. Assuming fixed $\hat f$ and $x$, we can decompose the mean square error as follows:
%$$
%MSE(\hat y)=
%$$


\subsubsection{Linear regression} %or move up
%More explicitly we can write down the model as
%$$p(Y|X,\theta)=N(y|\mu(X), \sigma^2(X))$$
Consider the setting of quantitative supervised learning, where lables $y$ have actual values.
Assume that $\epsilon$ in the above model \eqnref{eqn:YfX} is a
Gaussian with zero-mean and fixed variance $\sigma^2$, then
\eq{\label{eqn:gaussassump}
P(y|x,\beta)=\NNN(y|f(x), \sigma^2),
}
where $\NNN(y|f(x), \sigma^2)$ denotes the normal distribution with mean $f(x)$
and variance $\sigma^2$. Hence the MAP estimate becomes $\hat y = f(x)$, which in this case also coincides with the posterior mean.

\begin{exercise}Why?\end{exercise}

In the simplest case, we assume $f$ is affine linear, so
\eq{
Y=f(X)+\epsilon= X^\top \beta+\epsilon , \quad \beta \in \mathbb{R}^{p+1},
}
for a vector of weights $\beta$ and $x_0:=1$.
Or without vector notation
\eq{
Y=\beta_0 + \sum_{i=1}^p X_i\beta_i +\epsilon .
}
%So
%\eq{
%P(y|x,\beta)=\NNN(y|x^\top \beta,\sigma^2).
%}

The parameters $\beta$ are estimated from $D$ for example by least squares: minimize the residual sum of squares
\eq{
\RSS(\beta)=\sum_{i=1}^N (y_i-x_i^\top \beta)^2.
}
Note that this is independent of the magnitude of the fixed variance $\sigma^2$. A probabilistic motivation of RSS is given in section \ref{sec:statdec}.
With a $(N\times p)$-dimensional sample matrix $\X$ (i.e., observations $x$
of $X$ are stored as rows) and a $N$-dimensional sample vector $\y$, rewrite this as
\eq{
\RSS(\beta) = ||\y - \X \beta||^2 = (\y-\X \beta)^\top(\y-\X \beta).
}
Differentiation gives the \defn{normal equations}
$$\X^\top(\y-\X \beta)=0$$
and for nonsingular $\X^\top\X$ the unique solution
$$\beta=(\X^\top\X)^{-1}\X^\top\y.
$$
See example Figure \ref{fig:linclass}, more details in the exercises.

\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/intro-linclass}
\caption{Linear regression used for classification of samples with Blue=0, Orange=1, separated at $x^\top\beta=0.5$.
From \cite{Hastie:2009wp}, Fig. 2.1}
\label{fig:linclass}
\end{figure}

%TODO: or slides? also in terms of classification, with $P(y|x,\theta)>0.5$ for class selection?
%how to show in blackboard lecture?

\subsubsection{Logistic regression} %or move up
For qualitative learning, in particular binary classification, we replace the Gaussian distribution for $Y$ in equation \eqnref{eqn:gaussassump} by a discrete distribution, the Bernoulli distribution:
\eq{
P(y|x)=\Ber(y|\mu(x))
}
with the mean being a function $\mu(X=x):R^p\rightarrow [0,1]$.
For the mean of a Bernoulli distribution holds $\mu(x)=E(Y|X=x)=P(Y=1|X=x)$.
To ensure that $\mu(x)\in[0,1]$, but still use a linear model, we feed the linear model into the sigmoidal function
$$\sigm(x):=(1+\exp(-x))^{-1}.$$
The model then reads
$$P(y|x,\beta)=\Ber(y|\sigm(x^\top \beta)).$$
with the $p$ dimensional parameter vector $\beta$.
The MAP estimator then simply corresponds to
$$\hat y(x)=1 \qquad \Leftrightarrow \qquad P(y=1|x,\beta)>1/2$$
We will discuss estimation and regularization in logistic regression later.

%\subsubsection{$k$-nearest neighbor regression/classification} %or move up
\subsubsection{Nearest neighbor methods} %or move up
Instead of the above global, linear model, consider the following local  method: Determine $P(Y=c|x,D)$ from the average of close-by samples in your data $D$. The \defn{$k$-nearest neighbor} (kNN) fit is then defined as
$$\hat y(x) = \frac{1}{k} \sum_{x_i \in N_k(x)} y_i$$
with $N_k(x)$ being the $k$-closest samples of $D$ to $x$ in some metric,
or as probabilistic model
$$P(Y=c|x,D,k)= \frac{1}{k} \sum_{x_i \in N_k(x)} \unitm_{y_i}(c),$$
with the indicator function $\unitm_{y_i}(c)=1$ if ${y_i}=c$ and $0$ otherwise.
Again take decision for classification at $P(y|x,D,k)>0.5$, which is equivalent to majority voting in the class.

\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/intro-15nearestneigbor}
\caption{$k$-nearest neighbor classification with $k=15$.
From \cite{Hastie:2009wp}, Fig 2.2.}
\label{fig:nnclass}
\end{figure}

For an example of this method, see Figure \ref{fig:nnclass}. In the exercises, we will see the effect of varying $k$'s, as well as a comparison to linear classification.

Many models in statistical learning are a mixture of the above two models, e.g. kernel methods replacing the $0/1$-weights of $k$-nearest neighbor, local regression, linear models with basis expansion for higher flexibility or neural networks as iterations of sigmoidal models.

\subsection{Statistical decision theory}\label{sec:statdec}
%TODO: a bit double to prob formulation in 2.2, remove redundancy

We want to predict $Y$ from $X$, i.e. seek a function $f(X)$ as predictor of $Y$. To quantify prediction preformance between predicted labels $y$ and true labels $y'$, we use a loss function $L(y,y')$, which is positive definite ($L(y,y')\geq 0$), e.g.~the square loss $L(y,y'):=(y-y')^2$.
We can then choose $f$ by minimizing the (squared) \defn{expected prediction error}
$$EPE(f):= E( L (Y,f(X))) = E(Y-f(X))^2 = \int (y-f(x))^2 P(dx,dy).
$$
By conditioning we get
$$EPE(f) = E_X E_{Y|X} (L(Y,f(X))|X),$$
and % because of definiteness
it suffices to minimize $EPE(f)$ pointwise (i.e.~for a single point $X=x$):
\eq{
f(x)=\argmin_c E_{Y|X} (L(Y,c)|X=x).
}

For the square loss function, differentiation yields the conditional expectation
$$f(x)=E_{Y|X} (Y|X=x)=E(Y|X=x),$$
also known as regression function.

In words: the best prediction of $Y$ at $X=x$ is the conditional mean if we measure error by $L_2$.

Its empirical estimate is realized by nearest-neighbor methods: instead of averaging over all points at $x$ with samples $x_i=x$, we approximate this in a neighborhood and use averaging instead of expectation:
$$\hat f(x)=\frac{1}{k} \sum \{y_i|x_i \in N_k(x)\}$$

Note: Under mild regularity conditions on $P(X,Y)$, $\hat f(x)\to E(Y|X=x)$ for $N,k\to\infty$ and $k/N\to0$.

However, this is not a general approximator, because often $N$ is too small, in particular in high dimensions.

Let's instead take additional assumptions (regularization) e.g. by a model based approach in a linear approximation:
$$f(x)= x^\top \beta$$
Putting this into the $EPE$ definition we get
$$\beta=(E(XX^\top))^{-1}E(XY)$$

Note: The least-squares estimator introduced above replaces these expectations by averages.

So least-squares $f(x)$ is approximated by globally linear function, whereas in nearest-neighbor by a locally constant function. Extensions are e.g.
$$f(X)=\sum f_j(X_j)$$ with univariate $k$-nearest neighbor.


Other loss functions:
Replacing $L_2$ loss by $L_1$ loss $L(y,y')=|y-y'|$, leading to
$$\hat f(x)=\median (Y|X=x)$$
For discrete $Y$, one may choose the $0-1$ loss function
$$L_{01}(y,y'):=\left\{\begin{array}{cc}0 & y=y'\\1&y\neq y'\end{array}\right.$$
Then minimal $EPE$ leads to
$$f(x)=\argmin_c E_{Y|X} (L(Y,c)|X=x)=
\argmin_c \sum_{y\in \Dom(Y)} L(y,c) P(Y=y|X=x)
$$
which equals
\eq{\label{eqn:BayesClassifier}
f(x)=
\argmin_c (1- P(Y=c|X=x))=\argmax_c P(Y=c|X=x)
}
for the $0-1$ loss function, which is known as the \defn{Bayes classifier}: choose the most probable class using the conditional discrete $P(Y|X)$.

\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/intro-optimalbayes}
\caption{Optimal Bayes classifier, using the known generating densities (mixture of Gaussians).
From \cite{Hastie:2009wp}, Fig 2.2.}
\label{fig:optimbayes}
\end{figure}

Again this is approximated by kNN; for binary classification this implies simply thresholding at $0.5$, see figure \ref{fig:optimbayes}.


\subsection{Parametric vs non-parametric models}

A \defn{statistical model} $H$ is a set of probability distributions on a common set, e.g. $\R^p$. A \defn{parametric model} is parametrized by a finite number of parameters.

\begin{example}
Consider the parametric model $H=\{\NNN(\mu, 1) | \mu\in\R\}$. Given iid samples $x_1,\ldots,x_N$, its parameter $\mu$ is estimated by $\hat \mu=\frac{1}{N}\sum x_i$.
\end{example}

\begin{example}
Consider the nonparametric model $H=\{P | \var P<\infty\}$. Given iid samples $x_1,\ldots,x_N$, the mean again is estimated by $\hat \mu=\frac{1}{N}\sum x_i$.
\end{example}

Linear and logistic regression are parametric learning models, whereas $k$-nearest neighbor is non-parametric.


\subsection{The curse of dimensionality}

Although kNN classifiers may work well for many samples, they have serious issues in high dimensions, due to a fact known as the \defn{curse of dimensionality}:
In high dimensions, samples are really sparse (i.e. far from each other).

To be more precise:
Consider input samples $x_i$ uniformly distributed in the unit hypercube $[0,1]^p$. If we want to observe a fraction $r$ of all samples within a neighborhood, it has to be  (in average) of size $e_p(r):=r^{1/p}$.
While in one dimension $e_1(r)=r$, for 10 dimensions we have to cover 80\% of the total volume to cover 10\% of all samples, $e_{10}(0.1)=0.80$. Even if we only want to capture 1\%, the size is still $e_{10}(0.01)=0.63$. Hence the `neighborhood´ is not really local anymore (see Figure \ref{fig:cursedim}).

\begin{figure}\centering
\includegraphics[width=\columnwidth]{images/intro-cursedim}
\caption{The curse of dimensionality. In $p=10$ dimensions we need to cover 80\% of the range of each coordinate to capture 10\% of the data.
From \cite{Hastie:2009wp}, Fig 2.6.}
\label{fig:cursedim}
\end{figure}

Similarly, one can show that the all sample points are close to the edges of the sampling space in high dimensions:
Assume that $x_i$ are uniformly samples from the unit ball $\{x| \|x\|<1\}$ centered at $0$. The median distance from $0$ to the closest sample is
\eq{
d(p,N)=\left(1-\frac{1}{2}^{1/N}\right)^{1/p}
}
E.g. $d(10,500)\approx 0.52$, so most data points are closer to the boundary than to $0$. This implies that we mostly face the situation of estimation at the edges of the sampling space, where extrapolation is more difficult.

\subsection{Assessing model accuracy and the bias-variance trade-off}
%model selection and bias variance, see \cite{Hastie:2009wp} 2.9

All (most) models contain a complexity parameter, such as the number of basis functions used in a regression or the number $k$ of nearest neighbors in kNN. What is the optimal model complexity for the given data?

%Consider the $k$-nearest neighbor regression $\hat f_k(x)$ as function of $k$.
Assume that the data arise from model $Y=f(X)+\epsilon$ with independent noise $\epsilon$ with $E(\epsilon)=0$ and $\var(\epsilon)=\sigma^2$, and assume a priori fixed training samples $x_i$. The EPE or test error can then be decomposed into

\begin{eqnarray*}
\EPE(\hat f)(x)&=&E_{Y|X} ((Y-\hat f(x))^2|X=x) \\
&=& \sigma^2 + \underbrace{E_D\left(\hat f(x)-f(x)\right)^2}_ {\MSE(\hat f(x))}\\
&=&\sigma^2 + E_D(\hat f(x)^2)
-\left(E_D(\hat f(x))\right)^2
+\left(E_D(\hat f(x))\right)^2\\
&&\qquad-2E_D(\hat f(x))f(x)+f(x)^2\\
&=& \sigma^2 + \underbrace{\left( E_D(\hat f(x))-f(x)\right)^2}_{\bias(\hat f(x))^2} + \underbrace{E_D \left( \hat f(x) - E_D(\hat f(x))\right)^2}_{\var(\hat f(x))},
\end{eqnarray*}
where we used that the noise is independent and centered at 0 in the second line, and took expectations $E_D$ over the training set $D$.
The first term is called the \defn{irreducible error}, the variance of the target, which we cannot control. In statistical learning, we aim at minimizing the two right terms, the mean square error of the estimator $\hat f(x)$ of $f(x)$.

If we for simplicity assume that the values of $x_i$ in the samples have been fixed in advance i.e. are nonrandom, then for the $k$-nearest neighbor regression $\hat f_k(x)$ as function of $k$ we get
$$\EPE(\hat f)(x) =
\sigma^2 + \left(f(x) - \frac{1}{k} \sum_{x_i \in N_k(x)} f(x_i) \right)^2 + \frac{\sigma^2}{k}\\
$$
For increasing $k$ the bias term will likely increase due to samples being further off in larger neighborhoods, thereby the average also being more off. The variance will decrease with $k$. So as $k$ varies there is a trade off between bias and variance, see figure \ref{fig:biasvariance}.

\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/intro-biasvariance}
\caption{Bias variance trade-off.
From \cite{Hastie:2009wp}, Fig 2.11.}
\label{fig:biasvariance}
\end{figure}

We will learn techniques, e.g. cross validation, to control for overfitting and achieve generalization.

\subsection{No free lunch}
G Box: All models are wrong, but some are useful.

Which model to choose? Cross validation may help empirically, but there is no universal best model, as stated in the so called \defn{no free lunch theorem}. This is based on the fact that while some assumptions work for one domain, they may work poorly in another, which can be exploited to construct a set of samples that a predictor minimizing $\EPE$ for another set of samples misses. For more details see \cite{ShalevShwartz:2014wp}

%Wolpert 1996, see also math basics of ML

%c.	some basics from prob theory/stat/optimization or so needed later?


\newpage
\part{Supervised learning}

\section{High-dimensional regression}

\subsection{Review of linear regression, estimation, testing and covariate selection} %1 slot

As seen before, linear regression is modeled by
\eq{\label{eqn:linreg}
P(y|x,\theta)=\NNN(y|x^\top\beta, \sigma^2),
}
Here as in the following we summarize all model parameters in the parameter vector $\theta$.

The model can describe nonlinear relationships by including nonlinear functions of the input random vector, e.g. basis expansions $X, X^2, X^3$ of a single variable $X$, interaction variables $X_1X_2$ or factor coding of a categorical variable. Multidimensional output $Y$ can similarly be handled but is often split into two models.

\subsubsection{Estimation}
The model parameters $\theta$ are most commonly estimated by \defn{maximum likelihood}, i.e. by
\eq{
\hat\theta := \argmax_\theta P(D|\theta)
}
Equivalently we may minimize the negative log likelihood
\eq{
l(\theta) := -\log P(D|\theta) = -\sum_{i=1}^N \log p(y_i|x_i,\theta)
}
for iid samples $D$.

MLE for the linear regression model \eqnref{eqn:linreg} yields
\eq{
l(\theta) &= -\sum_{i=1}^N \log\left(
(2\pi\sigma^2)^{-1/2} \exp\left(-(2\sigma)^{-1}(y_i-x_i^\top\beta)^2\right)
\right)\\
&=(2\sigma)^{-1} RSS(\beta)- (N/2) \log(2\pi\sigma^2)
}
Hence for minimal $l(\theta)$, $RSS(\beta)$ needs to be minimized, which implies
$$\hat\beta=(\X^\top\X)^{-1}\X^\top\y$$
for nonsingular $\X^\top\X$ as seen above.
The variance of the estimator $\hat\beta$ follows from the likelihood as
$$\var(\hat\beta)=(\X^\top\X)^{-1}\sigma^2.$$
The MLE estimate for the variance $\sigma^2$ results in
$$\hat\sigma^2=\frac{1}{N}\sum_{i=1}^N (y_i-\hat y_i)^2$$
with $\hat y_i:=x_i^\top\hat\beta$, which may be replaced by
$$\hat\sigma^2=\frac{1}{N-p-1}\sum_{i=1}^N (y_i-\hat y_i)^2$$
to guarantee an unbiased estimator.


\subsubsection{Testing}

To draw inference about parameters and model, we assume that model \eqnref{eqn:linreg} holds with true parameters $(\beta, \sigma)$.
So
$$\hat\beta \sim \NNN(\beta, (\X^\top\X)^{-1}\sigma^2)$$
and
$$(N-p-1)\hat\sigma^2 \sim \sigma^2 \chi^2_{N-p-1},$$
which are both independent. Hence we can test if a particular coefficient $\beta_j \neq 0$ by considering the Z-score
$$
z_j=\frac{\hat\beta_j}{\hat\sigma\sqrt{((\X^\top\X)^{-1})_{jj}}}
$$
It is distributed as $t_{N-p-1}$.

\subsubsection{Optimality of the estimator}
Theorem (Gauss-Markov):
The least-squares estimator has the smallest variance among all linear unbiased estimators.

\begin{exercise}
Proof? (idea:
$\tilde\beta=((\X^\top\X)^{-1}\X^\top+\A)\y$. No bias, hence $\A\X=0$, so $\var\tilde\beta=\var\hat\beta+\sigma^2\A\A^\top$, i.e. $\A=0$ if variance is to be minimal)
\end{exercise}

Consider the mean-square error of an estimator $\tilde\beta$ of $\beta$. If it is unbiased, its MSE equals the variance
$$\MSE(\tilde\beta)= E(\tilde\beta-\beta)^2=\var(\tilde\beta)$$
According to GM, $\hat\beta=(\X^\top\X)^{-1}\X^\top\y$ has the lowest variance and hence the lowest MSE.

However, there may be biased estimators with lower MSE, as we will see in the following.

%TODO: subset selection?? -> exercises


\subsection{Shrinkage methods and ridge regression}% ridge regression, Lasso, elastic net)} %1-2 slots

In this section, we describe `robust regression' methods, which trade-off bias for lower variance, which, in particular, in high dimensions or noisy data, may be preferable. These estimators may be derived by modifying the noise model (i.e. the likelihood) or putting prior constraints $P(\theta)$ on the parameters and doing MAP estimation
$$\hat\theta := \argmax_\theta P(D|\theta)P(\theta)
$$

Some common model assumptions are summarized in this table (see \cite{Murphy:2012uq}, p. 224):
\begin{center}
\begin{tabular}{ l | l | l }
Likelihood & Prior & Name\\
\hline
Gaussian & Flat & Least squares\\
Gaussian & Gaussian & Ridge regression\\
Gaussian & Laplace & Lasso regression\\
Laplace or Student & Flat & Robust regression
\end{tabular}
\end{center}

\subsubsection{Ridge regression}
In situations in which we face an ill conditioned regression problem, the variance of the ordinary least squares (OLS) estimates $\hat{\beta}$ might become very large - especially when we face {\it multicoliniarity} in the covariats, i.e., where one or more covariats are highly linearly related, or the regression is overparameterized. This can be easily seen by realizing that $\X^T\X$ is a matrix with $1$ on the diagonal and the pairwise Pearson correlation coefficient at the off diagonal:
\begin{equation*}
(X^TX) = \begin{pmatrix}
1 & \rho_{1,2} & \cdots &\rho_{1,m}\\
\rho_{2,1} &1 & \cdots & \rho_{2,m} \\
\vdots  & \vdots &\ddots & \vdots\\
\rho_{m,1} & \rho_{m,2} & \cdots & 1
\end{pmatrix}.
\end{equation*}
By increasing the off-diagonals (through highly correlated covariates) the inverse will become large, which has an immediate effect on the variance of the estimates since:
\begin{equation*}
\text{var}(\beta) = (\X^T\X)^{-^1}\sigma^2.
\end{equation*}

In nearly perfect multicolinearity, $\X^T\X$ might become singular, leading to a non-invertible matrix. In such a case the OLS estimate does not even exist!

Ridge regression was originally motivated by finding a {\it biased} estimator of $\beta$ for singular covariate matrices. It was derived by realizing that a nonsingular matrix can always be recovered by simply adding a sufficiently small positive value $\lambda > 0$ to the diagonal of the singular $\X^T\X$. This results in the following augmented closed form solution of $\hat{\beta}^{\mathtt{ridge}}$:
\begin{equation}
\hat{\beta}^{\mathtt{ridge}} = \left( \X^T\X + \lambda\mathbf{I}\right)^{-1}\X^T\y,
\end{equation}

which is a solution to the following unconstrained
\begin{equation}
\label{FRR}
	\hat{\beta}^{\mathtt{ridge}} =  \argmin_\beta  \left[  \sum_{i=1}^N (y_i - x_i^\top\beta )^2 + \lambda \underbrace{\sum_{j=1}^p \beta_j^2}_{||\beta||_2}   \right]
\end{equation}

The unconstrained optimization problem can be equivalently expressed as a constrained optimization problem, which will give us a better intuition of the effect the added term to the RSS has on $\beta$. The constrained optimization problem is stated as follows:
\begin{align}
\label{CO_RR}
\begin{split}
\hat{\beta}^{\mathtt{ridge}} =&  \argmin_\beta  \sum_{i=1}^N (y_i - x_i^\top\beta )^2\\
&\mathtt{s.t.} ~~||\beta||_2 \leq t
\end{split}
\end{align}
The two optimization problems are connected.
This nicely illustrates that ridge regression constraints all coefficients (except the intercept $\beta_0$) to an upper limit $t$. The  $\ell_2$ norm constraint defines a ball who's radius is constrained by $t$. The solution to the optimization problem will lie on the intersection of the constrained ball and the RSS estimate (Figure \ref{fig:est2b}).

\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/est2}
\caption{Estimation picture for the Lasso (left) and ridge regression
(right). Shown are contours of the error and constraint functions. The solid blue
areas are the constraint regions $|\beta_1| + |\beta_2| \leq t$ and $\beta_1^2 + \beta_2^2 \leq t^2$ respectively, while the red ellipses are the contours of the least squares error function.
From \cite{Hastie:2009wp}, Fig 3.11.}
\label{fig:est2b}
\end{figure}

There is a one-to-one correspondence between the parameters $\lambda$ in \eqref{FRR} and $t$ in \eqref{CO_RR}. The complexity parameter $\lambda \geq 0$ controls the amount of shrinkage: the larger the value $\lambda$ (the smaller the value $t$) the greater the amount of shrinkage and hence bias-variance trade-off. The problem \eqref{FRR} is sometimes referred to as {\it Tikhonov regularization} or {\it penalized least square.}

Note that the ridge regression estimator is related to the classical OLS estimator, $\hat \beta^{\mathtt{ls}},$ in the following manner
\[
  \hat \beta^{\mathtt{ridge}}  = (\unitm + \lambda (\X^T \X)^{-1})^{-1} \hat \beta^{\mathtt{ls}},
\]
assuming that $\X^T \X$ is non-singular. This relationship can be verified by simply applying the definition of $\hat \beta^{\mathtt{ls}}$ and using the fact $\B^{-1} \A^{-1} = (\A\B)^{-1}$

In the case of orthonormal inputs, the ridge estimates are just a scaled version of the least squares estimates, i.e., $\hat \beta^{\mathtt{ridge}}  = \frac{1}{1+\lambda} \hat \beta^{\mathtt{ls}}.$

This is instructive, as it shows that, in this simple case, the ridge estimator is simply a downweighted
version of the least square estimator.


\subsubsection{Ridge Regression and PCA*}

The {\it singular value decomposition} (SVD) gives us some additional insight into the nature of ridge regression. Let $$\X = \U \D \V^T$$ be the SVD of $N \times p$ matrix $\X,$ where $\U$ and $\V$ are orthogonal matrices of order $N \times p$ and $p \times p$ respectively; and $\D$ is a diagonal matrix of order $p \times p$ containing the {\it singular values} of $\X$. The columns of $\U$ are called the {\it left singular} vectors, span the column space of $\X$; whereas the columns f $V$, which are called the {\it right singular} vectors, space the row space of $X$. In particular, $\U$ is a set of eigenvectors for $\X \X^T,$ and $\V$ is a set of eigenvectors for $\X^T \X.$ The non-zero singular values of $\X$ are the square roots  of the eigenvalues of both $\X \X^T$ and $\X^T \X.$

Using SVD we can write the least squares fitted vector as
\[
 \X \hat \beta^{\mathtt{ls}} = \X (\X^T \X)^{-1} \X^T \y = \U \U^T \y.
\]
Note that $\U^T \y$ are the coordinates of $\y$ with respect to the orthonormal basis $\U.$ The ridge solution is given as
\[
 \X \hat \beta^{\mathtt{ridge}}  =  \X ( (\X^T \X + \lambda \unitm)^{-1} \X^T \y),
\]
 we will define $\HH_{\lambda} =  \X (\X^T \X + \lambda \unitm)^{-1} \X^T.$  Using the SVD of $\X,$ observe that the so-called Gram matrix $\X^T \X$ can be decomposed as follows
 \[
 	\X^T \X = \V \D \U^T \U \D \V^T = \V \D^2 \V^T,
 \]
 which is the eigendecomposition of $\X^T \X.$ Next, we can apply this to the hat matrix,

 \begin{eqnarray}
 \label{ridge_SVD1}
 	  \HH_\lambda & = & \U \D \V^T (\V \D^2 \V^T +\lambda \unitm)^{-1} \V \D \U^T \nonumber \\
					     & = & \U \D (\D^2 + \lambda \unitm)^{-1} \D \U^T,
\end{eqnarray}
since $\V \D^2 \V$ and $\lambda \unitm$ commute and are hence simultaneously diagonalizable. Therefore, it also follows
that $\HH_\lambda$ is diagonalizable, with respect to $\U$ and with eigenvalues given by $ \D (\D^2 + \lambda \unitm)^{-1} \D,$ which is a diagonal matrix of order $p \times p.$
Thus, since the trace of a matrix is equal to the sum of its eigenvalues, it readily
follows that
\[
 \tr (\HH_\lambda) = \sum_{j=1}^p \frac{d_j^2}{d_j^2 +\lambda} ,
\]
 where $d_j$ is the $j'$th diagonal entry of $\D$ and since $\lambda \geq 0,$ we have  $\frac{d_j^2}{d_j^2 +\lambda}  \leq 1.$ Like linear regression, ridge regression computes the coordinates of $\y$ with respect to the orthonormal basis $\U$ and then it shrinks these coordinates by the factor $\frac{d_j^2}{d_j^2 +\lambda}.$ This means that a greater amount of shrinkage is applied to the coordinates of basis vectors with smaller $d_j^2.$


\begin{exercise}
 What does a small value of $d_j^2$ mean? Discuss in terms of  Bayesian or frequentist approaches.
 What is the behavior of $ \tr (\HH_\lambda)$ if $\lambda \rightarrow 0$ and $\lambda \rightarrow \infty?$
 %The following figures could be used for motivation and as hints.
\end{exercise}


\subsubsection{Bayesian linear regression}
Although ridge regression is a useful way to compute a point estimate, sometimes we want to compute the full posterior over $\beta$ for a standard linear regression problem: $y_i = x_i^T \beta + \varepsilon_i$ and $\varepsilon_i \sim \mathcal N(0, \sigma^2),$ and $i=1,\ldots, N.$


This corresponds to the following likelihood function:
\[
	P(\y | \X, \beta, \sigma^2) \propto \exp (\frac{1}{2\sigma^2} (\y - \X \beta)^T (\y - \X \beta) ).
\]
For simplicity, we will assume that a prior distribution for $\beta$ is known, i.e., $\beta \sim \mathcal N (0, \frac{\sigma^2}{\lambda}) \propto \exp(-\frac{\lambda}{2 \sigma^2} \| \beta \|^2).$ Thus,  we focus on computing the posterior distribution $P(\beta | \X, \y, \sigma^2).$

Using Bayes' rule for Gaussians
\begin{eqnarray*}
	P(\beta | \X, \y, \sigma^2) &=& \frac{P(\y| \beta, \X, \sigma^2) P(\beta | \X,
	\sigma^2)}{P(\y | \X, \sigma^2)} \\
	      &=& \frac{\prod_{i=1}^N  P(y_i | \beta_i, x_i, \sigma^2) P(\beta | \X, \sigma^2)}{P(\y | \X, \sigma^2)},
\end{eqnarray*}
and substituting ${P(\y | \X, \sigma^2)}$ with some rescaling factor $Z'$ the posterior is given by
\begin{eqnarray*}
P(\beta | \X, \y, \sigma^2)  & = & \frac{1}{Z'} \prod_{i=1}^N      \exp (-\frac{1}{2\sigma^2} (y_i - x_i^T \beta)^2  \exp(-\frac{\lambda}{2 \sigma^2} \| \beta \|^2) \\
					& = &   \frac{1}{Z'}  \exp (-\frac{1}{2\sigma^2} \sum_{i=1}^{N} (y_i - x_i^T \beta)^2)  \exp(-\frac{\lambda}{2 \sigma^2} \| \beta \|^2) \\
					& = & \frac{1}{Z'}  \exp (-\frac{1}{2\sigma^2} [(\y - \X \beta)^T (\y - \X \beta) + \lambda \beta^T \beta]) \\
					& = &   \frac{1}{Z'}  \exp( -\frac12 [ \frac{1}{\sigma^2} \y^T \y + \frac{1}{\sigma^2} \beta^T (\X^T \X + \lambda \unitm)\beta - \frac{2}{\sigma^2} \beta^T \X^T \y]) \\
					& = & \mathcal N (\beta | \mu, \Sigma),
\end{eqnarray*}
where $\Sigma = \sigma^2 (X^T X +\lambda \unitm)^{-1}$ and $\mu = \frac{1}{\sigma^2} \Sigma X^T \y = (X^T X + \lambda I)^{-1} X^T \y.$

%In this case, the posterior mean $\mu$ is exactly the classical ridge estimate,
%More generally, the most likely parameter $\argmax_\beta P(\beta | X, \y, \sigma^2)$ is also the least-cost parameter $\argmin_\beta J(\beta).$ In the Gaussian case, mean and most-likely coincide.

The Bayesian inference approach not only gives a mean / optimal $\mu,$
 but also a variance $\Sigma$ of that estimate. This is a core benefit of the Bayesian view since it naturally provides a probability distribution over predictions, not only a single prediction.


%% LASSO REGRESSION
\subsection{The Lasso}

We have seen that ridge regression essentially re-scales the least squares estimates. The Lasso (Tibshirani 1996), which is also the shrinkage method, by contrast, tries to produce a sparse solution, in the sense that several of the slope parameters will be set to zero.

As for ridge regression, the Lasso can be expressed as a constrained minimization problem,
\begin{eqnarray}
	\label{Lasso}
	\hat \beta^{\mathtt{lasso}} &=& \argmin_\beta \sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j \right)^2 \\
	&& \mbox{subject to}  \sum_{i=1}^p |\beta_j | \leq t.
\end{eqnarray}
In the signal processing literature, the lasso is also known as {\it basis pursuit}. We can also write the lasso problem in the equivalent {\it Lagrangian form}
\begin{equation}
	\label{LassoLagrange}
	\hat \beta^{\mathtt{lasso}} = \argmin_\beta \underbrace{\sum_{i=1}^N \left(y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j \right)^2 }_{J(\beta)}+\lambda \sum_{i=1}^p |\beta_j |.
\end{equation}

Notice the similarity to the ridge regression problem \eqref{FRR} and \eqref{CO_RR}: the $\ell_2$ ridge penalty $\sum_{i=1}^p \beta_j^2$ is replaced by the $\ell_1$ Lasso penalty $\sum_{i=1}^p |\beta_j|.$ This
latter constraint makes the solutions nonlinear in the $y_i.$ Thus, contrary to ridge regression, the Lasso does not admit a closed-form solution. The functional \eqref{LassoLagrange} is convex but not {\it strictly} convex, so that the solution is not unique.

The above constrained minimization is a quadratic programming problem, whose solution can be efficiently approximated (e.g., coordinate descent, interior point methods, $\ldots$).  The historical efficiency of algorithms to fit lasso models can
be summarized as follows:

\vspace{0.4cm}

\begin{center}
\begin{table}[h]
\begin{tabular}{ c | c | c | c }
  \hline
   \mbox{Year} & \mbox{Algortithm} & \mbox{Operations} & \mbox{Practical Limit}  \\
     \hline
  	1996 & \mbox{Quadratic programming} & $O(N2^p)$ & $\sim 100$  \\
	2003 & \mbox{LARS} & $O(Np^2)$ & $\sim 10,000$ \\
	2008 & \mbox{Coordinate descent} & $O(Np)$ & $\sim 1, 000,000$ \\
  \hline
\end{tabular}
  \caption{The historical efficiency of algorithms to fit lasso models.}
  \end{table}
\end{center}

%However,  efficient algorithms exist for computing the entire path of solutions as $\lambda$ varied, with the same computational cost as for ridge regression.
%Because of the nature of the constraint, making $t$ sufficiently small will
%cause some of the coefficients to be exactly zero.
\subsubsection{Lasso parameter estimation by coordinate descent}

Coordinate descent belongs the the first-order optimization methods. It optimizes each coordinate $\beta_j$ at a time iteratively using subgradient information $\frac{\partial f}{\partial \beta_j}(\beta )$(Algorithm \ref{algo:coordinat_descent}).
\begin{algorithm}[h!]
\label{algo:coordinat_descent}
\SetAlgoLined
  \textbf{Initialize} $\beta^{(0)}$\;
  \Repeat{\it converged}{
     \For{$j=1,...,P$}{
		$\beta^{(i+1)}_j = \beta^{(i)}_j - \alpha \frac{\partial f}{\partial \beta_j}(\beta )$\;
     }
    }
    \caption{Generalized coordinat descent algorithm in pseudo code with step size $\alpha$ and subgradient $\frac{\partial f}{\partial \beta_j}(\beta )$}
\end{algorithm}

If a closed form solution to the single variable optimization problem exists, then we can update the parameter directly at each step.
It turns out, Lasso possesses a closed form solution for the single variable optimization problem!

To derive at the closed form solution for an individual coordinate of the Lasso, we first will re-write the gradient of $J(\beta)$ with respect to $\beta_j$:
\begin{align}
\begin{split}
\frac{\partial}{\partial \beta_j} J(\beta ) &= -2\sum_{i=1}^N X_{ij} \left[y_i - \beta_0 - \sum_{j=1}^P\beta_j X_{ij}\right]\\
&= -2\sum_{i=1}^NX_{ij}\left[y_i -\beta_0 - \sum_{k \neq j}^P\beta_k X_{ik} - \beta_j X_{ij}\right]\\
&= -2\underbrace{\sum_{i=1}^NX_{ij}\left[y_i - \beta_0 - \sum_{k \neq j}^P\beta_k X_{ik} \right]}_{\hat{=}\rho} + 2\beta_j\underbrace{\sum_{i=1}^NX_{ij}^2}_{\hat{=}z_j}\\
&\hat{=} -2\rho +2\beta_j z_j
\end{split}
\end{align}

The problem arises when taking the derivative of the $\ell_1$ norm $\lambda\frac{\partial}{\partial \beta_j}|\beta_j|$ since the derivate is only defined for $\beta_j \neq 0$ as :
\begin{equation}
\frac{\partial}{\partial \beta_j} |\beta_j| = \begin{cases}
-1 \quad &\beta_j <  0\\
1 \quad &\beta_j > 0
\end{cases}
\end{equation}

To address the problem point $|\beta_j| = 0$ we use the notion of a {\it subgradient}, which is an extension of differential/gradients to non-differentiable points.
Remember that for convex and differentiable $f$:
$$ f(y) \geq f(x) + \partial f(x) (y-x) \qquad \mbox{for all } x,y,$$
i.e. the linear approximation always underestimates $f$.
More generally, the subgradient of a convex $f$ at $x$ is defined as the family of linear forms $V\in \partial f(x)$ that lower bound $f(x)$:
\begin{equation}
f(y) \geq f(x) + V(y-x) \qquad \mbox{for all } y
\end{equation}
Note that this always exists and equals the differential $\partial f(x)$ if $f$ is differentiable at $x$ (and this definition also works for non-convex $f$ but does not always exist then).
Example: The subgradient with respect to $|\beta_j|$ are all linear forms that have a slope within the interval $[-1, 1]$.

Putting everything together we arrive at the subgradient of $|\beta_j|$ as:
\begin{equation}
\partial_{\beta_j}|\beta_j| = \begin{cases}
-1 \quad& \beta_j < 0\\
[-1,1] \quad& \beta_j = 0\\
1 \quad& \beta_j > 0\\
\end{cases}
\end{equation}

and the subgradient of the Lasso as:
\begin{align}
\begin{split}
\partial_{\beta_j}J(\beta)^{\mathtt{lasso}} &= 2z_j\beta_j -2\rho +\begin{cases}
-\lambda \quad& \beta_j < 0\\
[-\lambda,\lambda] \quad& \beta_j = 0\\
\lambda \quad& \beta_j > 0\\
\end{cases}\\
&=\begin{cases}
2z_j\beta_j -2\rho-\lambda \quad& \beta_j < 0\\
[-2\rho-\lambda,-2\rho+\lambda] \quad& \beta_j = 0\\
2z_j\beta_j -2\rho+\lambda \quad& \beta_j > 0\\
\end{cases}\\
\end{split}
\end{align}

To derive the closed form solution for $\beta_j$ we set the subgradient to $0$ for the three individual cases independently.
\\
\\
\textbf{Case 1} ($\beta_j < 0$):
\begin{align}
\begin{split}
&2z_j\beta_j -2\rho-\lambda = 0\\
&\beta_j = \frac{\rho_j+\frac{\lambda}{2}}{z_j}.
\end{split}
\end{align}
For $\beta_j$ satisfying the inequality the following constraint $\rho_j< \frac{-\lambda}{2}$ must hold.
\\
\\
\textbf{Case 2} ($\beta_j = 0$):\\
For  $\beta_j = 0$ being an optimum the interval $[-2\rho-\lambda, -2\rho_j+\lambda]$ has to contain contain $0$. This only holds true if $-\frac{\lambda}{2}\leq\rho_j\leq\frac{\lambda}{2}$.
\\
\\
\textbf{Case 3} ($\beta_j > 0$):\\
\begin{align}
\begin{split}
&2z_j\beta_j -2\rho+\lambda = 0\\
&\beta_j = \frac{\rho_j-\frac{\lambda}{2}}{z_j}.
\end{split}
\end{align}
For $\beta_j$ satisfying the inequality the  following constraint $\rho_j> \frac{\lambda}{2}$ must hold.

Altogether this gives us the closed form solution of the Lasso with respect to $\beta_j$ as:
\begin{equation}
\hat{\beta}_j^{\mathtt{lasso}} = \begin{cases}
\frac{\rho_j+\frac{\lambda}{2}}{z_j} \quad& \rho_j < -\frac{\lambda}{2}\\
0 \quad& \rho \in  \left[-\frac{\lambda}{2},\frac{\lambda}{2}\right]\\
\frac{\rho_j-\frac{\lambda}{2}}{z_j} \quad& \rho_j  > \frac{\lambda}{2}\\
\end{cases}\\
\end{equation}

which is known as {\it soft-thresholding} update rule. The coordinate descent algorithm for the Lasso is then defined as:

\begin{algorithm}[h!]

\SetAlgoLined
  \KwIn{Regularization parameter $\lambda$}
  \KwResult{Optimal coeffecients $\hat{\beta }^{\text{LS}}$}
 \textbf{Initialize} $\beta = (X^TX+\lambda I)^{-1}X^Ty$\;
 \textbf{Precompute}  $z_j =\sum_{i=1}^N X_{ij}^2$\;
  \Repeat{\it converged}{
     \For{$j=1,...,P$}{
     \begin{align*}
        \rho_j &=\sum_{i=1}^N X_{ij}\left [ y_i -\sum_{k\neq j}^P\beta_k X_{ik}  \right ]\\
\hat{\beta}_j^{\mathtt{lasso}} &= \begin{cases}
\frac{\rho_j+\frac{\lambda}{2}}{z_j} \quad& \rho_j < -\frac{\lambda}{2}\\
0 \quad& \rho \in  \left[-\frac{\lambda}{2},\frac{\lambda}{2}\right]\\
\frac{\rho_j-\frac{\lambda}{2}}{z_j} \quad& \rho_j > \frac{\lambda}{2}\\
\end{cases}\\
     \end{align*}
     }
    }
\end{algorithm}

\subsubsection{Bayesian interpretation of Lasso}

The solution of \eqref{LassoLagrange} can be interpreted as posterior mode estimates when
the regression parameters $\beta_j'$s have independent and identical Laplace priors. More precisely, we consider a prior of the form
\[
	P(\beta | \tau) = \frac{1}{2 \tau}\prod_{j=1}^p \exp(- |\beta_j| / \tau),
\]
with $\tau = \frac{1}{\lambda}.$
We will use a uniform prior on the offset term, $P(\beta_0) \propto 1.$

\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/Laplacian}
\caption{The Laplacian prior assigns more weight to regions near
zero than the normal prior.}
\label{fig:estlaplcian}
\end{figure}


Then we perform MAP estimation with this prior. The penalized negative log likelihood has the form
\[
	f(\beta) = -\log P(\y | \beta) - \log P(\beta | \tau) = -\log P(\y | \beta)  + \tau \|\beta\|_1,
\]
where $\| \beta \|_{1} = \sum_{j=1}^p |\beta_j |$ is the $\ell_1$ norm of $\beta.$

%{\bf NOTE:} Some info on the minimization algorithms!!!!

%% Generalisation REGRESSION
\subsubsection{More on Lasso}

Since 2005 with the introduction of the LARS algorithm, there have been a lot of activity and fruitful research in developing algorithms for fitting regularization paths for a variety of different problems, incl. high-dimensional problems.  In addition, $\ell_1$ regularization has taken on a life of its own, leading to the development of the field {\it compressed sensing} in the signal processing literature \cite{Donoho2006a}. We  touch briefly upon some of the Lasso modifications.

Moreover, we recall that the solution of $\ell_1$ regularization is not unique, i.e., the various solution have the same
prediction properties but different selection properties.

\subsubsection{The Dantzig Selector*}
In many real-life problems the number of variables / parameters $p$ is much larger than the number of observations $N.$ Then when fitting the model $\y = \X^T \beta$ we end up with infinitely many solutions, since our system is underdetermined.

Candes and Tao (2007) proposed a new estimator, the so-called {\it Dantzig selector}, to estimate $\beta$ as the solutions to the $\ell_1$ regularization problem.
\begin{eqnarray*}
	&& \min_{\beta} \|\beta\|_{1} \\
	 \mbox{ subject to } && \| \X^T (\y - \X \beta) \|_{\infty} \leq \sqrt{2 \log p} \sigma,
\end{eqnarray*}
where $\sigma$ is a noise level, and $\| \cdot \|_{\infty}$ denotes the $\ell_\infty$ norm, the maximum absolute value of the components of the vector. In this form it resembles the lasso, replacing
squared error loss by the maximum absolute value of its gradient.

It is known from the work on the Dantzig selector that one can reconstruct $\hat \beta^{\texttt{DS}}$ with the following accuracy
\begin{equation}
	\label{recon_rate}
	\| \hat \beta^{\texttt{DS}} - \beta \|_2^2 \leq C  s \sigma^2 \log p,
\end{equation}
where $s=\# \mathtt{supp}(\beta)$ and $s \leq N/2.$ The constant $C$ depends on the matrix $\X$ and is different from infinity, if the relevant variables are not correlated. However, error rate \eqref{recon_rate} is achieved under the assumption that someone has provided us with the exact location of the non-zero components of $\beta,$ i.e., the support of $\beta$ is known.

\subsubsection{The Grouped Lasso*}
In some problems, the predictors belong to pre-defined groups; for example
genes that belong to the same biological pathway. In
this situation it may be desirable to shrink and select the members of a
group together.
In this case, the classical lasso is going to select just one of them (this can be bad for interpretability but maybe good for compression), whereas the {\it grouped lasso} provide such an option. Suppose that the $p$ predictors are divided into $L$ groups, with $p_l$ the number in group $l$. We use the notation for a matrix $\X_l$ to represent the predictors corresponding to the $l$th group, with corresponding coefficient vector $\beta_l$. The grouped lasso minimizes the convex criterion
\begin{equation}
\label{GroupedLasso}
	\min_{\beta \in \R^p} \left(  \| \y - \beta_0 \mathbf 1 - \sum_{l=1}^L \X_l \beta_l \|_2^2 + \lambda  \sum_{l=1}^L \sqrt{p_l} \| \beta_l \|_2 \right),
\end{equation}
where the $\sqrt{p_l}$ terms accounts for the varying group sizes, and $\| \cdot \|_2$ is the Euclidean norm (not squared!).  Since the Euclidean norm of a vector $\beta_l$ is zero only if all of its components are zero, this procedure encourages sparsity at both the group and individual levels.   That is, for some values of $\lambda,$ an entire  group of predictors may drop out of the model.


%% Generalisation REGRESSION
\subsubsection{Discussion and Generalization}

We discuss and compare two approaches for restricting the linear regression model: the ridge regression and the Lasso. It is worthwhile mentioning that the described techniques could be applied to estimating any regression function $f(\X),$ not only of linear type.

In the case of an orthonormal input matrix $\X$ the two procedures have explicit solution. Each method applies a simple transformation to the least squares estimate $\hat \beta_j$ as shown in the table below.
\vspace{0.4cm}

\begin{center}
\begin{table}[h]
\begin{tabular}{ c | c  }
  \hline
   \mbox{Estimator} & \mbox{Formula} \\
     \hline
  	\mbox{Ridge regression} & $\hat \beta_j / (1+\lambda)$ \\
	\mbox{Lasso} & $\sgn (\hat \beta_j) (|\hat \beta_j| - \lambda)_+$ \\
  \hline
\end{tabular}
  \caption{Estimators of $\beta_j$ in the case of orthonormal columns of $\X,$ $\lambda$ is the regularization parameter chosen by the corresponding technique, $\sgn$ denotes the sign of its argument $(\pm 1),$ and $x_+$ denotes 'positive part' of $x$.}
  \end{table}
\end{center}

\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/est1}
\caption{Ridge regression and lasso estimators (broken red line) for the case of orthonormal columns of $X.$ The $45^\circ$ line in gray shows the unrestricted estimate
for reference. From \cite{Hastie:2009wp}, Fig 3.4.}
\label{fig:estridge}
\end{figure}

Ridge regression does a proportional shrinkage. Lasso translates each
coefficient by a constant factor $\lambda,$ truncating at zero. This is called {\it soft-thresholding}.  For the nonorthogonal case, let us consider an illustrative example in 2D, i.e., $p=2$ and compare Lasso / ridge regression (see Figure \ref{fig:est2}). The residual sum of squares has elliptical
contours, centered at the full least squares estimate. The constraint region for ridge regression is the disk $\beta_1^2 + \beta_2^2 \leq t^2,$ while that for Lasso is the diamond $|\beta_1| + |\beta_2| \leq t.$ Both methods find the first point where the elliptical contours hit the constraint region. Unlike the disk, the diamond
has corners; if the solution occurs at a corner, then it has one parameter $\beta_j$ equal to zero. When $p>2,$ the diamond becomes a rhomboid, and has
many corners, flat edges and faces; there are many more opportunities for
the estimated parameters to be zero.


\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/est2}
\caption{Estimation picture for the Lasso (left) and ridge regression
(right). Shown are contours of the error and constraint functions. The solid blue
areas are the constraint regions $|\beta_1| + |\beta_2| \leq t$ and $\beta_1^2 + \beta_2^2 \leq t^2$ respectively, while the red ellipses are the contours of the least squares error function.
From \cite{Hastie:2009wp}, Fig 3.11.}
\label{fig:est2}
\end{figure}


We will now further generalize ridge regression and the lasso, and view them both in frequentist and bayesian formulations. Consider the criterion
\begin{equation}
\label{RegressionGeneral}
\bar \beta = \argmin_\beta \left[ \sum_{i=1}^N (y_i - \beta_0 - \sum_{j=1}^p x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j|^q    \right],
\end{equation}
for $q \geq 0.$
The contours of constant value of $\sum_j |\beta_j|^q$ are shown in Figure \ref{fig:contours}, for the case of two inputs.

\begin{figure}\centering
\includegraphics[width=.9\columnwidth]{images/RR_Shapes}
\caption{Contours of constant value of $\sum_j |\beta_j|^q$ for given values of $q$
From \cite{Hastie:2009wp}, Fig 3.12.}
\label{fig:contours}
\end{figure}

\subsubsection{Bayesian formulation}
Thinking of $|\beta_j|^q$ as the log-prior density for $\beta_j,$ these are also the equicontours
of the prior distribution of the parameters. The value $q=0$ corresponds
to variable subset selection, as the penalty simply counts the number
of nonzero parameters; $q=1$ corresponds to the Lasso; whereas $q =2$ to ridge regression. For $q \geq 1$ the prior is not uniform in direction, but concentrates more mass in the coordinate directions. The prior corresponding
to the $q=1$ case is an independent double exponential (or Laplace) distribution for each input.
For $q <1$ the constraint region is non-convex, which leads to non-convex optimization problem.

In this view, the Lasso and ridge regression are Bayes estimates with different priors. Note, however, that they are derived
as posterior modes, that is, maximizers of the posterior. It is more common
to use the mean of the posterior as the Bayes estimate. Ridge regression is
also the posterior mean, but Lasso  is not.
%% Generalisation REGRESSION
\subsection{Elastic Net}

Could  other values of $q$ rather than $q=1$ and $q =2$ bring better performance? Should one estimate $q$ from the data? The experience show that it is not worth the effort and actually  the values of $q \in (1,2)$ suggest the compromise between the Lasso and ridge regression (see Figure \ref{fig:contours}). Although this is the case, with $q>1,$ the penalty term $|\beta_j|^q$ is differentiable at $0$, and so does not share the ability of Lasso for setting coefficients exactly to $0.$ Partially for this reason as well as for computational tractability, \cite{ZouHastie05} introduced the {\it elastic-net} penalty
\begin{equation}
\label{ELNET}
	\lambda \sum_{j=1}^p (\alpha |\beta_j|^2 + (1-\alpha) |\beta_j|),
\end{equation}
which represents another compromise between ridge regression and the Lasso. In particular, the elastic-net selects variables like Lasso, and shrinks together the coefficients of correlated predictors like ridge function. It also has considerable computational advantages over the $\ell_q$ penalties.


\begin{figure}\centering
\includegraphics[width=.7\columnwidth]{images/EN}
\caption{Contours of constant value of $\sum_j |\beta_j|^q$ for $q=1.2$ (left plot), and the elastic-net penalty \eqref{ELNET} with $\alpha =0.2$ (right plot). Although visually very similar, the elastic-net has sharp (non-differentiable) corners, while the $q=1.2$ penalty does not.
From \cite{Hastie:2009wp}, Fig 3.13.}
\label{fig:ELN}
\end{figure}

\subsection{Parameter and model selection}

The choice of the regularization parameter $\lambda$ in both ridge regression and in the lasso (as well as in the elastic net) is more of an art than a science. This parameter can be constructed as a {\it complexity parameter,} since as $\lambda$ increases, less and less effective parameters are likely to be included in both ridge and lasso regression. Therefore, one usually adopts a model selection perspective and compare different choices of $\lambda$ using, for instance, {\it cross-validation} or an {\it information criterion}. In the following we review briefly some of the existing approaches for the parameter selection problem as well as provide recipes for a few (but most well-known) of them.

 To motivate these approaches, we assume that we are only provided with a finite set of training examples, usually smaller than we wanted, and we need to adjust regularization parameter(s), so that our predictor has an ability to generalize, i.e., perform well on a new previously unseen data. A general idea is to split the given training data into disjoint subsets: {\it training set} and {\it testing set.}

In  the {\bf hold-out method} the given data set is split into two groups: {\it training set} (used to train the classifier) and {\it testing set} (used to estimate the error rate of the trained classifier).  It is usually applied to determine a stopping point for iterative methods. However, this method has two main drawbacks: (i) In problems where we have a sparse dataset we may not be able to afford the  'luxury' of setting aside a portion of the dataset for testing; (ii) Since it is a single train-and-test experiment, the hold-out estimate of error rate  will be misleading if we happen to get an 'unfortunate' split.

\begin{figure}\centering
\includegraphics[width=.6\columnwidth]{images/HOM}
\caption{Hold-out method}
\label{fig:HOM}
\end{figure}


The limitations of the hold-out method can be overcome with a family of resampling methods (such as cross validation, bootstrap) at the expense of more computations.

{\bf Random Sampling} performs $K$ data splits on the dataset.  Each split randomly selects a (fixed) number examples without replacement. Then for each data split we retrain the classifier from scratch with the training  examples and compute the error $RS_i$ with the test examples. The true error estimate is obtained as the average of the separate estimates $RS_i = \frac 1K \sum_{i=1}^K RS_i,$ which is significantly better than the hold-out estimate.

{\bf $K-$Fold cross validation} is a common technique for estimating the performance of a predictor. A general idea is  for each of $K$ experiments to use $K-1$ folds for training and the remaining one for testing. It is similar to Random Sampling. However, the advantage of $K-$fold cross validation is that all the examples in the dataset are eventually used for both training and testing.

\begin{figure}\centering
\includegraphics[width=.45\columnwidth]{images/RS}