Machine Learning MediumA step away from the illusion of knowledge.
https://machinelearningmedium.com/
Tue, 20 Nov 2018 20:01:45 +0000Tue, 20 Nov 2018 20:01:45 +0000Jekyll v3.6.2Social Bias in Machine Learning<h3 id="introduction">Introduction</h3>
<p>Discrimination, injustice, oppression are some of the dark words that have been an integral part of human history. While there is a active effort to make world a fair place in every sphere of life, it is almost impossible to make the data that has been recorded over the years fair to all the caste, creeds, races and religions because the history is written in ink. Since the world was far more biased as we age backwards in time, it would not be incorrect to say that a historical record of data would often reflect these biases in terms of minority and majority classes. These are the very same data that is continuously used in training most of our machine learning models without actually giving a conscious thought to the fairness of the algorithm i.e. whether or not the algorithm reflects the biases that prevailed back then. Recently machine learning has seen its utilitization in a lot of important decision making pipelines such as predicting time of recidivism, college acceptance, loan approvals etc., and hence it becomes increasingly important to question the machine learning models being developed in terms of implicit bias that they might be inheriting from the data that they train on. In order to do away with such biases in a machine learning algorithm one needs to understand how exactly does bias creep in, what are the various metrics through which it can be measured and what are the methods through which one can remove such unfairness. This post is an attempt to summarize such issues and possible remedies.</p>
<h3 id="background">Background</h3>
<p>Since machine learning is now being used to make a lot policy decisions that affect the life of people on an everyday basis, it should be made sure that unfairness is not a part of such decision making. It is found that training machine learning algorithms with the standard utility maximization and loss minimization objectives sometimes result in algorithms that behave in a way that a fair human observer would deem biased. A very recent example of such a case was <a href="https://www.ml.cmu.edu/news/news-archive/2018/october/amazon-scraps-secret-artificial-intelligence-recruiting-engine-that-showed-biases-against-women.html" target="\_blank">cited</a> by Amazon which notices a gender bias in its recruiting engine algorithms.</p>
<h3 id="its-all-in-the-data">It’s all in the Data</h3>
<p>One of the potential reasons for such biases in these algorithms can be attributed to the training data itself. Since the algorithms are big numerical puzzles that are trained to recognize and mimic the statistical patterns over the history, it is only natural for such a trained system to display biased characteristics. Even some of the state of the art solutions in the field of NLP and Machine Learning are not free from biases and unfairness. For example, it has been shown that word2vec embeddings learnt from huge corpuses of text often show gender bias as the euclidean distance between words that signifies correlation between words, suggests strong correlation between words like homemaker, nanny with she and maestro, boss with he. Any system built on top of such a word embedding is very likely to propagate this bias on a daily basis at some level.</p>
<p>One of the contested ways of dealing with this issue is to retrain the models continuously with new data, which relies on the assumption that historical bias is on a process of correcting itself.</p>
<p>Another major question that continuously arise is based on the fact the these machine learning algorithms work well when the amount of data they train on is huge. While this is true in an overall sense, if we break down the number of data points one has for minority class it becomes more apparent that the algorithms does not have enough supporting instances to learn as good a representation about minority classes as it would about the majority and hence could lead in unfair judgements because of lack of data.</p>
<blockquote>
<p>There is general tendency for automated decisions to favor those who belong to statistically dominant groups.</p>
</blockquote>
<p>Statistical patterns that apply to majority population might be invalid for the the minority group. It can also happen that a variable that is positively correlated with target in general population maybe negatively correlated with target in the minority group. For example, a real name might be a short common name in one culture and a long unique name in another. Hence same rules for detecting fake names would not work across such groups.</p>
<p><img src="/assets/2018-10-09-algorithmic-fairness/fig-2-survival-distribution.png?raw=true" alt="Fig-1: Survival Distribution" /></p>
<p>Consider a very simple dataset from Kaggle called <a href="https://www.kaggle.com/c/titanic" target="\_blank">titanic</a>. This is a basic dataset where based on a bunch of features given one has to <strong>predict the survival probability of an individual who was on titanic</strong>. The survival distribution on the training data shows that in past <strong>during the titanic incident a female candidate had much higher chances of surviving than a male candidate</strong>. It would be rather obvious <strong>for an algorithm trained on this data that being female is a strong indicator of survival</strong>. If the same algorithm was used to predict survival on an impending sinking incident where candidates who have higher survival probability would be boarded on rescue boats first, it is bound to make biased decisions.</p>
<p>Also it can be seen that being male is negatively correlated to surviving while being a female is positively correlated, because graph 2 in fig-1 shows that more males died than survived and by contrast, more females survived than died. So <strong>if the algorithm was to learn only from majority of the data belonging to males, it would predict badly for the female population</strong>.</p>
<h3 id="undeniable-complexities">Undeniable Complexities</h3>
<p>One way to counter the sample size disparity might be to learn different classifiers for different sub-groups. But it is not as simple as it sounds because of the reason that learning and testing for individual sub-group might require acting on the protected attributes which might in itself be objectionable. Also the definition of minority is fuzzy as there could be many different overlapping minorities and no straightforward way of determining group membership.</p>
<h3 id="noise-vs-modeling-error">Noise vs Modeling Error</h3>
<p>Say a classifier achieves 95 percent accuracy. In the real world scenario this 5 percent error rate would point to a really well trained classifier. But what is often overlooked is that there might be two different kinds of underlying reasons behind the error rate. One could be the general case of noise that the classifier was not able to model and hence was not able to predict and account for. Other possible reason could be that while the model is 100 percent accurate on majority class, it is only 50 percent accurate on minority class. This systematic error in the minority class would be a clear case of algorithmic unfairness.</p>
<p>The bigger issue of the matter here is that there is no principled or book methodology for distinguishing noise from the modeling errors. Such questions can only be answered by great deal of domain knowledge and experience.</p>
<h3 id="edge-cases-always-exist">Edge Cases always exist</h3>
<p>It is also true to assume that in a very unexpected way it is possible for bias to creep into the algorithms even if the training data is labelled correctly and is free of any issues that could be pointed out as unbiased. A recent <a href="https://www.theverge.com/2015/7/1/8880363/google-apologizes-photos-app-tags-two-black-people-gorillas" target="\_blank">example</a> of this is when google photos by mistake labeled two black people as gorillas. Obviously, the machine was never trained with any training data that should lead to such inferences, but because the number of trained parameters are so high, it often becomes intractable and unimaginably hard to understand why a system behaves haphazardly in certain conditions. This uncertainty of outcomes can also be a cause of bias in situations that could not be predicted in advance.</p>
<h3 id="what-is-fairness">What is Fairness?</h3>
<p>Fairness in classification involves studying algorithms not only from a perspective of accuracy, but also from a perspective of fairness.</p>
<blockquote>
<p>The most difficult part of this is to define what is fairness.</p>
</blockquote>
<p>Consideration for fairness often leads to compromise on accuracy but it’s a necessary evil that is not going anywhere in the near future. What if often more surprising is that many of these metrics have a trade off among themselves.</p>
<h3 id="fairness-of-process-vs-fairness-of-outcome">Fairness of Process vs Fairness of Outcome</h3>
<ul>
<li>
<p>An <strong>aware</strong> algorithm is one that uses the information regarding the protected attribute (such as gender, ethnicity etc.) in the process of learning. An <strong>unaware</strong> algorithm will not.</p>
</li>
<li>
<p>While the motivation regarding unaware algorithm is that being fair means disregarding the protected attribute, it often does not work just by removing the protected attribute. Sometimes there is a strong correlation between protected attribute and some other feature. So in order to train a truly unaware algorithm, one needs to remove the correlated feature group as well.</p>
</li>
<li>
<p>This process of manually engineering a feature list that conveys no information about the protected attribute can also be automated using machine learning techniques discussed in following sections.</p>
</li>
</ul>
<h3 id="are-unaware-algorithms-the-solution">Are Unaware Algorithms the Solution</h3>
<ul>
<li>
<p>There could be inherent differences between the populations defined by these masked protected attributes, which would only render this process undesirable.</p>
</li>
<li>
<p>The aware approaches use these proctected attributes and have a better chance of understanding depence of outcome on them.</p>
</li>
<li>
<p>This can be seen as a distinction between <strong>fairness of process</strong> vs <strong>fairness of outcomes</strong>. The unaware algorithms ensure a fairness of process, because under such a scheme the algorithm does not use any of the protected attributes for decision making. However, such fairness in process does not guarantee a fair outcome towards the protected and un-protected sub-groups.</p>
</li>
<li>
<p>The aware approaches on the contrary use these protected attributes and hence not a fair process, but it can reach an outcome that is more fair towards the minorities.</p>
</li>
</ul>
<h3 id="mathematical-fairness-statistical-parity">Mathematical Fairness: Statistical Parity</h3>
<p>A mathematical version of absolute fairness can be a statistical condition where the chances of success or failure is same for both the majority and minority classes (or more classes in case of multi-class scenarios). This can be written as,</p>
<script type="math/tex; mode=display">Pr[h(x) = 1 \vert x \in P^C] = Pr[h(x) = 1 \vert x \in P] \tag{1} \label{1}</script>
<p>The main drawback of such models is given by the argument that is that <strong>does one really want to equalize the outcomes across all sub-groups?</strong>. For example, predicting the success chances of a basketball player irrespective of his height is not really a very strong model, because the discrimination in various domains do not really fall in a black or white region but may lie in the gray region somewhere in between. Another example might be predicting the chances of child birth without using the features such as gender and age would be a really poor algorithm. So, <strong>enforcing the statistical parity is not always the solution</strong>.</p>
<h3 id="cross-group-calibration">Cross-group Calibration</h3>
<ul>
<li>Instead of equalizing the outcomes themselves, one can look to equalize some other statistics of the algorithm’s performance, for example <strong>error rates across groups</strong>.</li>
</ul>
<blockquote>
<p>A fair algorithm would make as many mistakes on a minority group as it does on the majority group.</p>
</blockquote>
<p>A useful tool for such an analysis is the confusion matrix as shown below</p>
<p><img src="/assets/2018-10-09-algorithmic-fairness/fig-1-confusion-matrix.png?raw=true" alt="Fig-2: Confusion Matrix" /></p>
<p>Some of the metrics based on the confusion matrix are:</p>
<ul>
<li>
<p><strong>Treatment equality</strong> is achieved by a classifier that yields a ratio of false negatives and false positives (in table, c/b or b/c) that is same for both protected group categories.</p>
</li>
<li>
<p><strong>Conditional procedure accuracy equality</strong> is achieved when conditioning on the known outcome, the classifier is equally accurate across protected group categories. This is equivalent to the false negative rate and false positive rate being same for all protected categories.</p>
</li>
</ul>
<p>Since all the columns and rows of a confusion matrix should add up to the total number of observations, many of these fainess metrics have a trade-off relationship. This basically means <strong>zero-sum game</strong>, one increases at the cost of the other and there is no win-win situation. Based on the use-case one has to decide which metrics should be optimized for as there is no blanket solution to the group.</p>
<h3 id="example-titanic">Example: Titanic</h3>
<p><a href="https://www.kaggle.com/shamssam/algorithmic-fairness-in-ml" target="\_blank"><strong>Kaggle Notebook</strong></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="c">### libraries</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sn</span>
<span class="kn">from</span> <span class="nn">sklearn.model_selection</span> <span class="kn">import</span> <span class="n">train_test_split</span>
<span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">accuracy_score</span><span class="p">,</span> <span class="n">classification_report</span><span class="p">,</span> <span class="n">confusion_matrix</span>
<span class="kn">from</span> <span class="nn">xgboost</span> <span class="kn">import</span> <span class="n">XGBClassifier</span>
<span class="n">df_train</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'train.csv'</span><span class="p">,</span> <span class="n">index_col</span><span class="o">=</span><span class="s">'PassengerId'</span><span class="p">)</span>
<span class="n">df_train</span><span class="o">.</span><span class="n">Sex</span> <span class="o">=</span> <span class="n">df_train</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="s">'female'</span>
<span class="n">df_train</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Name'</span><span class="p">,</span> <span class="s">'Ticket'</span><span class="p">,</span> <span class="s">'Cabin'</span><span class="p">,</span> <span class="s">'Embarked'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">X_train</span><span class="p">,</span> <span class="n">X_valid</span> <span class="o">=</span> <span class="n">train_test_split</span><span class="p">(</span><span class="n">df_train</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="n">stratify</span><span class="o">=</span><span class="n">df_train</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="c"># aware classification</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">XGBClassifier</span><span class="p">()</span>
<span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">X_train</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"OVERALL"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"FEMALE"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat_female</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_female</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_female</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MALE"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat_male</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_male</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_male</span><span class="p">)))</span>
<span class="c"># output</span>
<span class="c"># ========================================</span>
<span class="c"># OVERALL</span>
<span class="c"># ========================================</span>
<span class="c"># precision recall f1-score support</span>
<span class="c"># 0 0.85 0.90 0.87 165</span>
<span class="c"># 1 0.82 0.75 0.78 103</span>
<span class="c"># avg / total 0.84 0.84 0.84 268</span>
<span class="c"># Accuracy: 0.8395522388059702</span>
<span class="c"># ========================================</span>
<span class="c"># FEMALE</span>
<span class="c"># ========================================</span>
<span class="c"># precision recall f1-score support</span>
<span class="c"># 0 0.45 0.41 0.43 22</span>
<span class="c"># 1 0.83 0.86 0.84 76</span>
<span class="c"># avg / total 0.75 0.76 0.75 98</span>
<span class="c"># Accuracy: 0.7551020408163265</span>
<span class="c"># ========================================</span>
<span class="c"># MALE</span>
<span class="c"># ========================================</span>
<span class="c"># precision recall f1-score support</span>
<span class="c"># 0 0.90 0.97 0.94 143</span>
<span class="c"># 1 0.75 0.44 0.56 27</span>
<span class="c"># avg / total 0.88 0.89 0.88 170</span>
<span class="c"># Accuracy: 0.888235294117647</span>
<span class="c"># unaware classification</span>
<span class="n">clf</span> <span class="o">=</span> <span class="n">XGBClassifier</span><span class="p">()</span>
<span class="n">clf</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X_train</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="n">X_train</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"OVERALL"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"FEMALE"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat_female</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_female</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">True</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_female</span><span class="p">)))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"MALE"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">40</span><span class="p">)</span>
<span class="n">y_valid_hat_male</span> <span class="o">=</span> <span class="n">clf</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">drop</span><span class="p">([</span><span class="s">'Survived'</span><span class="p">,</span> <span class="s">'Sex'</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="n">classification_report</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_male</span><span class="p">))</span>
<span class="k">print</span><span class="p">(</span><span class="s">"Accuracy: {}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">X_valid</span><span class="p">[</span><span class="n">X_valid</span><span class="o">.</span><span class="n">Sex</span> <span class="o">==</span> <span class="bp">False</span><span class="p">]</span><span class="o">.</span><span class="n">Survived</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">y_valid_hat_male</span><span class="p">)))</span>
<span class="c"># output</span>
<span class="c"># ========================================</span>
<span class="c"># OVERALL</span>
<span class="c"># ========================================</span>
<span class="c"># precision recall f1-score support</span>
<span class="c"># 0 0.73 0.84 0.78 165</span>
<span class="c"># 1 0.66 0.51 0.58 103</span>
<span class="c"># avg / total 0.71 0.71 0.70 268</span>
<span class="c"># Accuracy: 0.7126865671641791</span>
<span class="c"># ========================================</span>
<span class="c"># FEMALE</span>
<span class="c"># ========================================</span>
<span class="c"># precision recall f1-score support</span>
<span class="c"># 0 0.32 0.82 0.46 22</span>
<span class="c"># 1 0.90 0.50 0.64 76</span>
<span class="c"># avg / total 0.77 0.57 0.60 98</span>
<span class="c"># Accuracy: 0.5714285714285714</span>
<span class="c"># ========================================</span>
<span class="c"># MALE</span>
<span class="c"># ========================================</span>
<span class="c"># precision recall f1-score support</span>
<span class="c"># 0 0.91 0.84 0.87 143</span>
<span class="c"># 1 0.39 0.56 0.46 27</span>
<span class="c"># avg / total 0.83 0.79 0.81 170</span>
<span class="c"># Accuracy: 0.7941176470588235</span>
</code></pre></div></div>
<p>Confusion matrices for the cases using awareness and without awareness of protected attribute (sex in this case) is shown below.</p>
<p><img src="/assets/2018-10-09-algorithmic-fairness/fig-3-aware-confusion-matrix.png?raw=true" alt="Fig-3: Aware Confusion Matrix" /></p>
<p><img src="/assets/2018-10-09-algorithmic-fairness/fig-4-unaware-confusion-matrix.png?raw=true" alt="Fig-4: Unaware Confusion Matrix" /></p>
<p>Note:</p>
<ul>
<li>Conditional accuracy in the code output shows that the system is very biased both in aware in unaware scenarios.</li>
<li>treatment equality is more divergent in aware case than in unaware case.</li>
</ul>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://towardsdatascience.com/a-gentle-introduction-to-the-discussion-on-algorithmic-fairness-740bbb469b6" target="_blank">A Gentle Introduction to the Discussion on Algorithmic Fairness
</a></small><br />
<small><a href="https://medium.com/@mrtz/how-big-data-is-unfair-9aa544d739de" target="\_blank">How big data is unfair</a></small><br />
<small><a href="http://fairness-measures.org/" target="\_blank">Fairness Measures</a></small><br /></p>
Tue, 09 Oct 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/10/09/algorithmic-fairness/
https://machinelearningmedium.com/2018/10/09/algorithmic-fairness/machine-learningalgorithmic-fairnessIntroduction to Computer Architecture<h3 id="what-is-a-computer">What is a Computer?</h3>
<p>A computer is a general purpose device that can be programmed process information, yield meaningful results.</p>
<p>The three important take-aways being:</p>
<ul>
<li>programmable device</li>
<li>process information</li>
<li>yield meaningful results</li>
</ul>
<p>So the important parts for the working of a computer are:</p>
<ul>
<li>Program: a list of instructions given to computer</li>
<li>Information Store: the data it has to process</li>
<li>Computer: processes information into meaningful results.</li>
</ul>
<p>A fully functional computer includes at the very least:</p>
<ul>
<li>Processing Unit (CPU)</li>
<li>Memory</li>
<li>Hard disk</li>
</ul>
<p>Other than these some input output (I/O) devices can also be a part of the system, such as:</p>
<ul>
<li>Keyboard: Input</li>
<li>Mouse: Input</li>
<li>Monitor: Output</li>
<li>Printer: Output</li>
</ul>
<h3 id="memory-vs-hard-disk">Memory vs Hard Disk</h3>
<ul>
<li>Storage Capacity: more on hard disk, less on memory</li>
<li>Volatile: data on hard disk is non-volatile, while on memory is volatile</li>
<li>Speed: speed of access and other operations are slower on hard disk when compared to memory.</li>
</ul>
<h3 id="brain-vs-computer">Brain vs Computer</h3>
<ul>
<li>Brain is capable of doing a lot of abstract work that computers cannot be programmed to do.</li>
<li>Speed of basic calculations is much higher in a computer which is its primary advantage.</li>
<li>Computers do not get tired or bored or disinterested.</li>
<li>Humans can understand complicated instructions in a variety of semantics and languages.</li>
</ul>
<h3 id="program">Program</h3>
<ul>
<li>Write a instruction in a high level language like C, C++, Java etc. (done by human interface)</li>
<li>Compile it into an executable (binary) that converts it into byte-code, i.e. the language computers understand. (done by compilers)</li>
<li>Execute the binary. (done by processor)</li>
</ul>
<h3 id="instruction-set-architecture-isa">Instruction Set Architecture (ISA)</h3>
<p>The semantics of all the instructions supported by a processor is known as instruction set architecture (ISA). This includes the semantics of the instructions themselves along with their operands and interfaces with the peripherals.</p>
<blockquote>
<p>ISA is an interface between software and hardware.</p>
</blockquote>
<p>Examples of ISA:</p>
<ul>
<li>arithmetic instructions</li>
<li>logical instructions</li>
<li>data transfer/movement instructions</li>
</ul>
<p>Features of ISA:</p>
<ul>
<li>Complete: it should be able to execute the programs a user wants to write</li>
<li>Concise: smaller set of instructions, currently they fall in the range 32-1000</li>
<li>Generic: instructions should not be too specialized for a given user or a given system.</li>
<li>Simple: instructions should not be complicated</li>
</ul>
<p>There are two different paradigms of designing an ISA:</p>
<ul>
<li>RISC: Reduced Instruction Set Computer has fewer set of simple and regular instructions in the range 64 to 128. eg. ARM, IBM PowerPC. Found in mobiles and tablets etc.</li>
<li>CISC: Complex Instruction Set Computer implements complex instructions which are highly irregular, take multiple operands. Also the number of instructions are large, typically 500+. eg. Intel x86, VAX. Used in desktops and bigger computers.</li>
</ul>
<h3 id="completeness-of-isa">Completeness of ISA</h3>
<p><strong>How do we ensure the completeness of an ISA?</strong> Say, there are two instructions addition and subtraction, while it is possible to implement addition using substraction (a + b = a - (0 - b)), the same cannot be said otherwise. This basically means that <strong>in order to complete an ISA one needs a set of instructions such that no other instruction is more powerful than the set</strong>.</p>
<p><strong>How do we ensure that one has a complete instruction set such that one can write any program?</strong> The answer to this lies in finding a <strong>Universal ISA</strong> which would inturn constitute a <strong>Universal Machine</strong> which can be used to write any program known to mankind (Universal Machine has a set of basic actions where each such action can be interpretted as an instruction).</p>
<h3 id="turing-machine">Turing Machine</h3>
<p>Alan Turing, the father of computer science discovered a the theoretical device called <strong>turing machine</strong> which is the most powerful machine known because theoretically it can compute the results of all the programs one can be interested in.</p>
<p>A turing machine is a hypothetical machine which consists of an <strong>infinite tape consisting of cells</strong> extending in either directions, a <strong>tape head to maintain pointer on the tape that can move left or right</strong>, a <strong>state cell the saves the current state</strong> of the machine, and an <strong>action table to write down the set of instructions</strong>. It is posed as an thesis ( <strong>Church-Turing Thesis</strong> and not a theorem) that has not been counter in the past 60 years that</p>
<blockquote>
<p>Any real-world computation can be translated into an equivalent computation involving Turing machine.</p>
</blockquote>
<p>Also,</p>
<blockquote>
<p>Any computer that is equivalent to a Turing machine is said to be Turing Complete.</p>
</blockquote>
<p>So the answer to <strong>Can we build a complete ISA</strong> lies in the question <strong>can we design a Universal Turing Machine (UTM) that an simulate turing machine</strong>, i.e. the all one needs to do is to build a turing machine (seemingly simple architecture) that can implement other turing machines (manage tape, tape-head, cell and action table).</p>
<p>So analogously speaking, the current computers are an attempt to implement this universal turing machine (UTM), where the <strong>generic action table of the UTM is implemented as CPU</strong>, the <strong>the simulated action table of turing machine to be implemented is the Instruction memory</strong>, the <strong>working area or the UTM on the tape is the data memory</strong>, and the <strong>simulated state register of the implemented turing machine is the program counter (PC)</strong>.</p>
<h3 id="elements-of-computers">Elements of Computers</h3>
<ul>
<li>Memory (array of bytes), contains
<ul>
<li>program, which is a set of instructions</li>
<li>program data, i.e. variables, constants etc.</li>
</ul>
</li>
<li>Program Counter (PC)
<ul>
<li>points to an instruction the program</li>
<li>after execution of one instruction it points to the next one</li>
<li>branch instructions make PC jump to another instruction (not in sequence)</li>
</ul>
</li>
<li>CPU contains
<ul>
<li>program counter</li>
<li>instruction execution unit</li>
</ul>
</li>
</ul>
<h3 id="single-instruction-isa">Single Instruction ISA</h3>
<ul>
<li>sbn - subtract and branch if negative</li>
</ul>
<p>This basically leads to the following psuedocode</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sbn(a, b, line_no):
a = a-b
if (a<0):
goto line_no
else:
goto next_statement
</code></pre></div></div>
<ul>
<li>Addition using SBN</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>intialize
temp = 0
1: sbn temp, b, 2
exit: exit
2: sbn a, temp, exit
</code></pre></div></div>
<ul>
<li>Add 1-10 using SBN</li>
</ul>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>initialize
one = 1
index = 10
sum = 0
1: sbn temp, temp, 2 \\ sets temp = 0
2: sbn temp, index, 3 \\ sets temp = -index
3: sbn sum, temp, 4 \\ sets sum += index
4: sbn index, one, exit \\ sets index -= 1
5: sbn temp, temp, 6 \\ sets temp = 0
6: sbn temp, one, 1 \\ the for loop, since 0 - 1 < 0
exit: exit
</code></pre></div></div>
<p>This is similar to writing <strong>assembly level programs</strong>, which are low level programs.</p>
<h3 id="mutliple-instruction-isas">Mutliple Instruction ISAs</h3>
<p>They typicall have:</p>
<ul>
<li>Arithmetic Instructions: Add, Subtract, Multiply, Divide</li>
<li>Logical Instructions: And, Or, Not</li>
<li>Move Instructions: Transfer between memory locations</li>
<li>Branch Instructions: Jump to new memory locations based on program instructions</li>
</ul>
<h3 id="design-of-practical-machines">Design of Practical Machines</h3>
<ul>
<li>While Harvard Machine has seperate data and instruction memories, Von-Neumann Machine has a single memory to serve both the purposes.</li>
<li>The problems with these machines is that they assume memory to be one large array of bytes. In practice these are slower because as the size of the structure increases the speed of processing decreases. The possible solution of this lies in having several smaller array of name locations called <strong>registers</strong> that can be used by instructions. Hence these smaller arrays are faster.</li>
</ul>
<p>So,</p>
<ul>
<li>CPU contains a set of registers which are named storage locations.</li>
<li>values are loaded from memory to registers.</li>
<li>arithmetic an logical instructions use registers for input</li>
<li>finally, data is stored back in the memory.</li>
</ul>
<p>Example program in machine language,</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>r1 = mem[b] \\ load b
r2 = mem[c] \\ load c
r3 = r1 + r2
mem[a] = r3
</code></pre></div></div>
<p>where</p>
<ul>
<li>r1, r2, r3 are registers</li>
<li>mem is the array of bytes representing memory</li>
</ul>
<p>As a result the modern day computers are similar to Von-Neumann Machines with the addition of register in the CPU.</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://onlinecourses.nptel.ac.in/noc18_cs29/unit?unit=6&lesson=8" target="_blank">NPTEL: Introduction to Computer Architecture</a></small><br /></p>
Wed, 25 Jul 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/07/25/introduction-to-computer-architecture/
https://machinelearningmedium.com/2018/07/25/introduction-to-computer-architecture/computer-sciencenptelnptel-computer-architectureIntroduction to Survival Analysis<h3 id="introduction">Introduction</h3>
<p>Survival analysis refers to the set of statistical analyses that are used to analyze the length of time until an event of interest occurs. These methods have been traditionally used in analysing the survival times of patients and hence the name. But they also have a utility in a lot of different application including but not limited to analysis of the time of recidivism, failure of equipments, survival time of patients etc. Hence, simply put the phrase <strong>survival time</strong> is used to refer to the type of variable of interest. It is often also referred by names such as <strong>failure time</strong> and <strong>waiting time</strong>.</p>
<p>Such studies generally work with <strong>data leading upto an event of interest</strong> along with several other characteristics of individual data points that may be used to explain the survival times statistically.</p>
<p>The statistical problem (survival analysis) is to construct and estimate an appropriate model of the time of event occurance. A survival model fulfills the following expectations:</p>
<ul>
<li>yield predictions of number of individuals who will fail (undergo the event of interest) at any length of time since the beginning of observation (or other decided point in time).</li>
<li>estimate the effect of observable individual characteristics on the survival time (to check the relevance of one variable holding constant all others).</li>
</ul>
<p>It is often observed that the survival models such as proportional hazard model are capable of <strong>explaining the survival times in terms of observed characteristics</strong> which is better than straight-forward statistical inferences such as <strong>rates of event occurence without considering characteristic features</strong> of data.</p>
<h3 id="basics">Basics</h3>
<p>Assume <strong>survival time T is a random variable</strong> following some distribution <strong>characterized by cumulative distribution function \(F(t, \theta)\)</strong>, where</p>
<ul>
<li>\(\theta\) is the set of <strong>parameters to be estimated</strong></li>
<li>\(F(t, \theta) = P(T \leq t) = \) <strong>probability that there is a failure</strong> at or before time \(t\), for any \(t \geq 0\)</li>
<li>\(F(t, \theta) \to 1\) as \(t \to \infty\), since \(F(t, \theta)\) is a <strong>cumulative distribution function</strong></li>
<li>Above tendency leads to an <strong>implicit assumption that all candidates would eventually fail</strong>. While this assumptions works selectively based on settings (true for patient survival times, not true for time of repayment of loans) and hence needs to be relaxed where it does not hold true.</li>
</ul>
<p><strong>Survival times are non-negative</strong> by definition and hence the distributions (like exponential, Weibull, gamma, lognormal etc.) characterising it are defined for value of time \(t\) from \(0\) to \(\infty\).</p>
<p>Let \(f(t, \theta)\) be the <strong>density function</strong> correponding to the distribution function \(F(t, \theta)\), then the <strong>survival function</strong> is given by,</p>
<script type="math/tex; mode=display">S(t, \theta) = 1 - F(t, \theta) = P(T \gt t) \tag{1} \label{1}</script>
<p>which gives the <strong>probability of survival</strong> until time \(t\) (\(S(t, \theta) \to 0\) as \(t \to \infty\) because, \(F(t, \theta) \to 1\) as \(t \to \infty\)).</p>
<p>Another useful concept in survival analysis is called <strong>hazard rate</strong>, defined by,</p>
<script type="math/tex; mode=display">h(t, \theta) = \frac{f(t, \theta)} {1-F(t, \theta)} = \frac{f(t, \theta)} {S(t, \theta)} \tag{2} \label{2}</script>
<blockquote>
<p>Hazard rate represents the density of a failure at time \(t\), conditional on no failure prior to time \(t\), i.e., it indicates the probability of failure in the next unit of time, given that no failure has occured yet.</p>
</blockquote>
<p><strong>While \(f(t, \theta)\) roughly represents the proportion of original cohort that should be expected to fail between time \(t\) and \(t+1\), hazard rate \(h(t, \theta)\) represents the proportion of survivors until time \(t\) that should be expected to fail in the same time window, \(t\) to \(t+1\).</strong></p>
<p>The relationship betwee the cumulative distribution function and the hazard rate is given by,</p>
<script type="math/tex; mode=display">F(t, \theta) = 1 - exp \left[ - \int_0^t h(x, \theta) dx \right] \tag{3} \label{3}</script>
<p>and</p>
<script type="math/tex; mode=display">h(t, \theta) = - \frac {d\,ln\,[1 - F(t, \theta)]} {dt} \tag{4} \label{4}</script>
<p>The fact that \(F(t, \theta)\) is a cdf puts some restrictions on the hazard rate,</p>
<ul>
<li>hazard rate is non-negative function</li>
</ul>
<script type="math/tex; mode=display">H(t, \theta) = \int_0^t h(x, \theta) dx \tag{5} \label{5}</script>
<ul>
<li>the integrated hazard in \eqref{5} is finite for finite \(t\) and tends to \(\infty\) as \(t\) approaches \(\infty\).</li>
</ul>
<h3 id="state-dependence">State Dependence</h3>
<ul>
<li>Positive state dependence or an increasing hazard rate \(dh(t)/dt \gt 0 \) indicates that the <strong>probability of failure during the next time unit increases</strong> as the length of time at risk increases.</li>
<li>Negative state dependence or a decreasing hazard rate \(dh(t)/dt \lt 0 \) indicates that the <strong>probability of failure in the next time unit decreases</strong> as the length of time at risk decreases.</li>
<li>No state dependence indicates a <strong>constant hazard rate</strong>.</li>
</ul>
<blockquote>
<p>Only exponential distribution displays no state dependence.</p>
</blockquote>
<h3 id="censoring-and-truncation">Censoring and Truncation</h3>
<p>A common feature of data on survival times is that they are censored or truncated. Censoring and truncation are statistical terms that refer to the <strong>inability to observe the variable of interest for the entire population</strong>.</p>
<ul>
<li>A standard example to understand this can be understood in the form of a case of an individual shooting at a round target with a rifle and the variable of interest is the distance by which the bullet misses the center of the target.</li>
<li>If all shots hit the target, this distance can be measure for all the shots and there is no problem of censoring or truncation.</li>
<li>If some shots miss the target, but we know the number of shots fired, <strong>the sample is censored</strong>. In this case either the distance of shot from center is known or it is known that it was atleast as large as the radius of the target.</li>
<li>Similarly if one does not know how many shots were fired but only have information about distance for shots that hit the target, <strong>the sample is truncated</strong>.</li>
</ul>
<blockquote>
<p>Censored sample has more information than a truncated sample.</p>
</blockquote>
<p>Survival times are often censored because not all candidates would fail by the end of time during which the data was collected. This <strong>censoring of data must be taken into account</strong> while making the estimations because it is <strong>not legitimate to drop such observations</strong> with unobserved survival times <strong>r to set survival times for these observations equal to the length of the follow-up period</strong> (when the data was collected).</p>
<ul>
<li>Infrequently so, but there is also a chance of getting information about a candidate during a follow-up collection who was not a part of the original population. In such cases the <strong>survival time is truncated</strong> because there is no information of the candidate or his survival time.</li>
</ul>
<h3 id="problem-of-estimation">Problem of Estimation</h3>
<p>The initial assumption specifies a cumulative distribution function \(F(t, \theta)\), or equivalently a density \(f(t, \theta)\) or hazard \(h(t, \theta)\) that is of a known form except that it depends on a unknown parameter \(\theta\). Estimation of this parameter is first step for the model to make any meaningful prediction about the survival time of new candidate</p>
<p>Consider a case of estimation of parameter for a censored sample which is defined as follows,</p>
<ul>
<li>sample has \(N\) individuals with follow-up periods \(T_1, T_2, \cdots, T_N\). These follow-ups may be all equal, but they usually are not.</li>
<li>\(n\) is number of individuals who fail, numbered \(1, 2, \cdots, n\) and individuals numbered \(n+1, n+2, \cdots, N\) are the non-failures.</li>
<li>for the candidates who fail, there exists a survival time \(t_i \leq T_i, \, i \in [1, n]\)</li>
<li>for the non-failures, survival time \(t_i\) is not observed but it is known that it is greater than the length of the follow-up period \(T_i\), \(i \in [n+1, N]\).</li>
</ul>
<p>If it is assumed that <strong>all the outcomes are independent</strong> of each other the likelyhood function of the sample is,</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n f(t_i, \theta) \prod_{i=n+1}^N S(T_i, \theta) \tag{6} \label{6}</script>
<blockquote>
<p>Likelyhood function is a general statistical tool that expresses the probability of outcomes observed in terms of unknown parameters that are to be estimated, i.e., it is function of the parameters to be estimated, which serves as a measure of how likely it is that the statistical model, with a given parameter value, would generate the given data.</p>
</blockquote>
<p>A common used estimator of \(\theta\) is the <strong>Maximum Likelyhood Estimator (MLE)</strong> which is defined as the value of \(\theta\) that maximizes the likelyhood function.</p>
<p>The MLE have been shown to display the following desirable properties over a large sample (<strong>as the sample size approaches infinity</strong>),</p>
<ul>
<li>Unbiased</li>
<li>Efficient</li>
<li>Normally Distributed</li>
</ul>
<p>As mentioned, the properties of MLE are only <strong>relevant when the sample size is large</strong>. It is often <strong>observed that the sample sizes in these studies are much smaller</strong> and hence reliance on large sample properties of estimator is more tenuous.</p>
<p>The above survival model uses observed survival time \(t_i\) while <strong>ignoring the specific timing of the observed returns</strong>. So the analysis of the fact of failure or non-failure, ignoring the timing of observed failures, would properly be based on the likelyhood function</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n F(T_i, \theta) \prod_{i=n+1}^N S(T_i, \theta) \tag{7} \label{7}</script>
<p>Estimation using \eqref{7} is a legitimate procedure and does not cause any bias or inconsistency, but the estimates are inefficient relative to MLEs from \eqref{6}.</p>
<p><strong>The estimates of \(\theta\) gotten by maximizing \eqref{7} will be less efficient (have larger variance) than the estimates of \(\theta\) gotten by maximizing \eqref{6}, atleast for large sample sizes. Hence if the information on time of return is available, it should be used.</strong></p>
<p>If the <strong>truncated</strong> case of sample is considered, then there is <strong>no information on all the individuals who do not fail</strong>. Formally, one starts with a cohort of \(N\) candidates, where <strong>\(N\) is unknown</strong>, and the only <strong>observations available are the survival times \(t_i\) for the \(n\) individuals who fail</strong> before the end of follow-up period. The \(n\) individuals appear in sample because \(t_i \leq T_i\), and the appropriate density is therefore</p>
<script type="math/tex; mode=display">f(t_i, \theta \mid t_i \leq T_i) = \frac{f(t_i, \theta)}{P(t_i \leq T_i)} = \frac{f(t_i, \theta)}{F(T_i, \theta)} \tag{8} \label{8}</script>
<p>And the corresponding <strong>likelyhood function</strong> which can be maximized to obtain the MLEs is given by,</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n f(t_i, \theta \mid t_i \leq T_i) = \prod_{i=1}^n \frac{f(t_i, \theta)}{F(T_i, \theta)} \tag{9} \label{9}</script>
<h3 id="explanatory-variables">Explanatory Variables</h3>
<p>Information on <strong>explanatory variables may or may not be used</strong> in estimating survival time models. Some models that are based on the <strong>implicit assumption that distribution of survival time is the same for all individuals</strong>, do not use explanatory variables.</p>
<p><strong>But practically it is observed that some individuals are more prone to failing than others and hence if information on individual charestistics and environmental variables is available, it should be used.</strong></p>
<p>This information can be incorporated is survival models by letting the parameter \(\theta\) depend on these individual characteristics and a new set of parameters. E.g. exponential model depends on a single parameter, say \(\theta\), and <strong>\(\theta\) can be assumed to depend on the individual characteristics</strong> as in linear regression.</p>
<h3 id="non-parametric-hazard-rate-and-kaplan-meier">Non-Parametric Hazard Rate and Kaplan Meier</h3>
<p>Before beginning any formal analyses of the data, it is often instructive to check the hazard rate. For this purpose, the <strong>time until failure are rounded to the nearest quantized time unit</strong> (month, week, day etc.). Following this it is easy to count the <strong>number of candidates at risk at the beginning of the said time period</strong> (i.e. the number of individuals who have not yet failed or been censored at the beginining of the time unit) and the <strong>number of individuals who fail during the time period</strong>.</p>
<p>Then, the <strong>non-parametric hazard rate</strong> can be estimated as the ratio of number of failures during the time period to the number of individuals at risk at the beginning of time period, i.e., if the number of individuals at risk at the beginning of time \(t \, (t = 1, 2, \cdots)\) is denoted by \(r\), and the number of individuals who fail during this time \(t\) is denoted by \(n_t\), then the estimated hazard for time \(t\), \(\hat{h}(t)\) is given by,</p>
<script type="math/tex; mode=display">\hat{h}(t) = \frac{n_t}{r} \tag{10} \label{10}</script>
<p>Such estimated hazard rates are prone to high variability. Also this high variability makes the purely non parametric estimates unattractive as they are less likely to give an accurate prediction on a new dataset. The parametric models such as exponential, Weibull or lognormal take care of this high variability and makes the model more tractable.</p>
<p><strong>But the plots of non parametric estimates of hazard rate provides a good initial guide as to which probability distribution may work well for a given usecase.</strong></p>
<p>As noted earlier, the hazard function, density function, and distribution function are alternative but equivalent ways of characterizing the distribution of the time until failure. Hence, once the hazard rate is estimated, then implicitly so is the density and the distribution function. It is possible to solve explicitly for the estimated density of distribution function in terms of the estimated hazard function. The resulting estimator (called <strong>Kaplan Meier</strong> or <strong>product limit</strong> estimator in statistical literature which is nothing but the non-parametric estimate) of the distribution function is given by,</p>
<script type="math/tex; mode=display">\hat{F}(t) = 1 - \prod_{j=1}^t [1 - \hat{h}(j)] \tag{11} \label{11}</script>
<h3 id="models-without-explanatory-variables">Models without Explanatory Variables</h3>
<p>There are various models that do not consider the explanatory variables, and instead <strong>assume some specific distribution</strong> such as exponential, Weibull, or lognormal for the length of time until failure. Essentially, the distribution of time until failure is known, except for some <strong>unknown parameters that have to be estimated</strong>. Hence, models of this type are called parametric models, which are different from the models discussed before as the later have no associated parameters or distribution.</p>
<p>The unknown parameters are <strong>estimated by maximizing the likelyhood function</strong> of the form \eqref{6}.</p>
<blockquote>
<p>In case of exponential distribution, MLEs cannot be written in closed form (i.e. expressed algebraically), and so the maximization of likelyhood function is done numerically.</p>
</blockquote>
<p>Once the characteristic parameters have been estimated, one can determine the following (which cannot be determined in case of non-parametric estimates like Kaplan Meier):</p>
<ul>
<li><strong>mean time</strong> until failures</li>
<li><strong>proportion of population that should be expected to fail</strong> within any arbitrary period of time.</li>
</ul>
<p>While the <strong>advantage of such models lies in the smoothness of predictions</strong>, the <strong>disadvatage is the fact that it can be wrong and inturn lead to statements that are systematically misleading</strong>.</p>
<p><strong>Exponential Distribution</strong></p>
<p>The exponential distribution has density,</p>
<script type="math/tex; mode=display">f(t) = \theta \, e^{-\theta t} \tag{12} \label{12}</script>
<p>and <strong>survivor function</strong>,</p>
<script type="math/tex; mode=display">S(t) = e^{-\theta t} \tag{13} \label{13}</script>
<p>where</p>
<ul>
<li>the parameter is constrained, \(\theta \gt 0\)</li>
<li>mean: \(1 / \theta\) and variance: \(1 / \theta^2\)</li>
<li>only distribution with a <strong>constant hazard rate</strong>, specifically \(h(t) = \theta\) for all \(t \geq 0\)</li>
<li>such hazard rates are generally seen in some physical processes such as radioactive decay.</li>
<li>it is often not the most reasonable distribution for survival models.</li>
<li>exponential distribution requires estimation of single parameter \(\theta\).</li>
</ul>
<p>Consider a sample of \(N\) individuals, of which \(n\) have failed before the end of the follow-up period. The observed failure times be denoted by \(t_i\, (i=1, 2, \cdots, n)\) and the censoring times (length of follow up) for the non-failures de denoted by \(T_i\, (i = n+1, \cdots, N)\). Then the likelyhood function \eqref{6} can be written as</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n \theta\,e^{-\theta t_i} \prod_{i=n+1}^N e^{-\theta T_i} \tag{14} \label{14}</script>
<p>Maximizing \eqref{14} w.r.t. \(\theta\) yields MLE in closed form:</p>
<script type="math/tex; mode=display">\hat{\theta} = \frac {n} {\sum_{i=1}^n t_i + \sum_{i=n+1}^N T_i} \tag{15} \label{15}</script>
<p>For large samples \(\hat{\theta}\) is normal with mean \(\theta\) and variance</p>
<script type="math/tex; mode=display">\frac{\theta^2}{\sum_{i=1}^N [1 - exp(-\theta T_i)]} \tag{16} \label{16}</script>
<p>which for large \(N\) is adequately approximated by \(\theta^2/n\).</p>
<ul>
<li>Exponential distribution is highly skewed.</li>
<li>Mean may not be a good measure of central tendency for exponential distribution.</li>
<li>Median may be more preferrable indicator in most cases.</li>
</ul>
<blockquote>
<p>Logarithm of likelyhood or log-likelyhood is used as a value to measure the goodness of fit. A higher value (more positive or less negative) for this variable indicates that the model fits the data better.</p>
</blockquote>
<p><strong>Weibull Distribution</strong></p>
<p>In statistical literature, a very common alternative to the exponential distribution is the Weibull distribution. It is a generalization of the exponential distribution. By using Weibull distribution one can test to check if a simpler exponential model is more appropriate.</p>
<ul>
<li>A variable \(T\) has Weibull distribution if \(T^{\tau}\) has an exponential distribution for some value of \(\tau\).</li>
<li>increasing hazard rate if \(\tau \gt 1\) and decreasing hazard rate if \(\tau \lt 1\). Also, if \(\tau = 1\) the hazard rate is constant and the Weibull distribution reduces to the exponential.</li>
<li><strong>Weibull distribution has a monotonic hazard rate</strong>, i.e it can be increasing, constant or decreasing but it cannot be increasing at first and then decreasing after some point.</li>
</ul>
<p>The density of Weibull distribution is given by,</p>
<script type="math/tex; mode=display">f(t) = \tau \theta^{\tau} \, t^{\tau -1} e^{-(\theta t)^\tau} \tag{17} \label{17}</script>
<p>and the survivor function is,</p>
<script type="math/tex; mode=display">S(t) = e^{-(\theta t)^\tau} \tag{18} \label{18}</script>
<p>The likelyhood function for Weibull distribution can be derived by substituting \eqref{17} and \eqref{18} in \eqref{6}.</p>
<p><strong>Lognormal Distribution</strong></p>
<p>If \(z\) is distributed as \(N(\mu, \sigma^2)\), then \(y = e^z\) has a lognormal distribution with mean</p>
<script type="math/tex; mode=display">\phi = exp(\mu + {1 \over 2} \sigma^2) \tag{19} \label{19}</script>
<p>and variance,</p>
<script type="math/tex; mode=display">\tau^2 = exp(2 \mu + \sigma^2) [exp(\sigma^2) -1] = \phi^2 \psi^2 \tag{20} \label{20}</script>
<p>where</p>
<script type="math/tex; mode=display">\psi^2 = exp(\sigma^2) - 1 \tag{21} \label{21}</script>
<p>The <strong>density</strong> of \(z = ln \, y\) is the density of \(N(\mu, \sigma^2)\) given by,</p>
<script type="math/tex; mode=display">f(ln \, y) = (1 / \sqrt{2\pi} \sigma) exp [-(1/2 \sigma^2) (ln\, y - \mu)^2] \tag{22} \label{22}</script>
<p>Generally there is <strong>no advantage to working with the density of \(y\) itself, rather than \(ln \, y\)</strong>. Thus, one can simply assume that log of survival time is distributed normally, and hence the likelyhood function \eqref{6} becomes</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
L = &- {n \over 2} ln(2\pi) - {n \over 2} ln(\sigma^2) - {1 \over 2\sigma^2} \sum_{i=1}^n (ln\, t_i - \mu)^2 \\
&+ \sum_{i=n+1}^N ln \, F \left[ \frac {\mu - ln\, T_i} {\sigma} \right]
\end{align}
\tag{23} \label{23} %]]></script>
<ul>
<li>where <strong>\(F\) is the cumulative distribution function</strong> for \(N(0, 1)\) distribution.</li>
<li><strong>No analytical solution</strong> exists for the maximization of \eqref{23} w.r.t. \(\mu\), and \(\sigma^2\), so it <strong>must be maximized numerically</strong>.</li>
<li>the hazard function for lognormal distribution is complicated; it <strong>increases first and then decreases</strong>.</li>
</ul>
<p><strong>Other distributions</strong></p>
<p>Although exponential, Weibull and lognormal are among the three most used distributions, there are various other well-known probability distributions possible, such as</p>
<ul>
<li>log-logistic</li>
<li>LaGuerre</li>
<li>distributions based on Box-Cox power transformation of the normal</li>
</ul>
<p>There are various ways of measuring how well models fit the data:</p>
<ul>
<li>value of likelyhood (or log-likelyhood) function</li>
<li>maximum difference between the fitted value and actual cumulative distribution function</li>
<li>standard Kolmogorov-Smirnov test of goodness of fit</li>
<li>chi-square goodness-of-fit statistic based on predicted and actual failure times.</li>
</ul>
<p>Over time it has been observed that even though some of these parametric distributions <strong>might fit the data</strong> better than others and excel on various metrics of good fit of data, these <strong>do not give any explaination about the reasons governing the distribution</strong> or any <strong>insight into the affecting parameters</strong> that lead to the different survival times in a population. Hence, these parametric models without the explanatory variables are not considered to be an effective tool for analysis.</p>
<h3 id="models-with-explanatory-variables">Models with Explanatory Variables</h3>
<ul>
<li>
<p>Explanatory variables are in general added to survival models in an attempt to make more accurate predictions: the practical experiments over time corroborate the fact that individual characteristics, previous experiences and environmental setup helps predict whether or not a person will fail.</p>
</li>
<li>
<p>An analysis of survival time without using the explanatory variables amounts to an analysis of its <strong>marginal distribution</strong>, whereas an analysis using explanatory variable amounts to an analysis of the <strong>distribution of survival time conditional on these variables</strong>.</p>
</li>
</ul>
<blockquote>
<p>Variance of the conditional distribution is less than the variance of the marginal distribution, i.e. expect more precise distribution from former.</p>
</blockquote>
<ul>
<li>
<p>Another more fundamental reason may include the interest of understanding the effect of explanatory variables on the survival time.</p>
</li>
<li>
<p>More generally, these variables might be the demographics or environmental characteristics.</p>
</li>
</ul>
<h3 id="proportional-hazards-model">Proportional Hazards Model</h3>
<ul>
<li>
<p>allows one to estimate the effects of individual characteristics on survival time without having to assume a specific parametric form of distribution of time until failure.</p>
</li>
<li>
<p>For an individual with the vector of characteristics, \(x\), the proportional hazards model assumes a hazard rate of the form,</p>
</li>
</ul>
<script type="math/tex; mode=display">h(t \mid x) = h_0(t) e^{x_i^\prime \beta} \tag{24} \label{24}</script>
<p>where \(h_0(t)\) is completely arbitrary and unspecified baseline hazard function. <strong>Thus, the model assumes that the hazard functions of all individuals differ only by a factor of proportionality,</strong> i.e. if an individuals hazard rate is 10 times higher than another’s at a given point of time, then it must be 10 times higher at all points in time. <strong>Each hazard function follows same pattern over time.</strong></p>
<p>However, there is no restriction on what this pattern can be, i.e. it puts no restriction on the \(h_0(t)\) curve, which determines the shape of \(h(t \vert x)\) curve. <strong>\(\beta\) can be estimated without specifying \(h_0(t)\), and \(h_0(t)\) can be estimated non-parametrically and thus with flexibility.</strong></p>
<p>Consider a sample of \(N\) individuals, \(n\) of whom fail before the end of their follow-up period. Let the observations be ordered such that individual 1 has the shortest failure time, individual 2 has the second shortest failure time, and so forth. Thus, for individual \(i\), failure time \(t_i\) is observed, with,</p>
<script type="math/tex; mode=display">t_1 \lt t_2 \lt \cdots \lt t_n \tag{25} \label{25}</script>
<p>A vector \(x_i\) represents individual characteristics for each individual \(i = 1, 2, \cdots, N\), irrespective of whether they failed.</p>
<p>For each observed failure times, \(t_i\), \(R(t_i)\) is defined as set of all individuals who were at risk just prior to time \(t_i\), i.e., it includes the individuals with failure times greater than or equal to \(t_i\), as well as the individuals whose follow-up is at least of length \(t_i\).</p>
<p>Using these definitions, the <strong>partial-likelihood</strong> function proposed by Cox can be defined for any failure time \(t_i\), as the probability that it is individual \(i\) who fails, given that exactly one individual from set \(R(t_i\)) fails, is given by,</p>
<script type="math/tex; mode=display">\frac {h(t_i \vert x_i)} {\sum_{j \in R(t_i)} h(t_i \vert x_j)} = \frac {exp(x_i^\prime \beta)} {\sum_{j \in R(t_i)} exp(x_j^\prime \beta)} \tag{26} \label{26}</script>
<p>The partial-likelyhood function is formed by multiplying \eqref{26} over all \(n\) failure times,</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n \frac {exp(x_i^\prime \beta)} {\sum_{j \in R(t_i)} exp(x_j^\prime \beta)} \tag{27} \label{27}</script>
<p>The estimate of \(\beta\) by maximizing \eqref{27} numerically w.r.t \(\beta\) is the <strong>partial maximum-likelyhood estimate</strong>. The word <strong>partial</strong> in partial likelyhood refers to the fact that not all available information is used in estimating \(\beta\), i.e., it only depends on knowing which individuals were at risk when each observed failure occured. The exact numerical values of the failure times \(t_i\) or of the censoring times for the non recedivists are not needed; only their <strong>order matters</strong>.</p>
<p>Once \(\beta\) is estimated, \(h_0(t)\), the baseline hazard function can be estimated non-parametrically. The estimated baseline hazard function is constant over the intervals between failure times. One can also calculate <strong>survivor function</strong> \(S_0(t)\) or equivalently the baseline cumulative distribution function \(F_0(t)\), that corresponds to the estimated baseline hazard function.</p>
<p><strong>The estimated survivor function is a step function that falls at each time at which there is a failure.</strong></p>
<p>The point of proportional hazard model is that the survivor function is estimated non-parametrically (i.e. not imposing any structure on its pattern over time, except that it must decrease as \(t\) increases) and estimation of \(\beta\) can proceed seperately from estimation of survivor function.</p>
<h3 id="split-population-models">Split Population Models</h3>
<p>The models considered so far assume some cumulative distribution function, \(F(t)\) for the survival time, that gives the probability of a failure upto and including time \(t\), and it approaches one as \(t\) approaches infinity. This basically means that every individual must eventually fail, if they were observed for long enough time. This assumption is not true in all cases.</p>
<p><strong>Split Population Models</strong> (or split models) do not imply that every individual would eventually fail. Rather the population is divided into two groups, one of which would never fail.</p>
<p>Mathematically, let \(Y\) be an observable indicator with two values, one implying ultimate failure and zero implying perpetual success. Then,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(Y=1) &= \delta \\
P(Y=0) &= 1 - \delta
\end{align}
\tag{28} \label{28} %]]></script>
<p>where \(\delta\) is the proportion of the population that would eventually fail, and \(1 - \delta\) is the proportion that would never fail.</p>
<p>Let \(g(t \vert Y=1)\) be density of survival times for the ultimate failures, and \(G(t \vert Y=1)\) be the corresponding cumulative distribution function. If one considers exponential model to represent them, then</p>
<script type="math/tex; mode=display">\begin{align}
g(t \vert Y=1) = \theta e^{-\theta t} \\
G(t \vert Y=1) = 1 - e^{-\theta t}
\end{align}
\tag{29} \label{29}</script>
<p>It can also be noted that \(g (t \vert Y = 0)\) and \(G(t \vert Y=0)\) are not defined.</p>
<p>Let \(T\) be the length of the follow up period and let \(R\) be an observable indicator equal to one if there is failure by time \(T\) and zero if there is not. The probability for individuals who do not fail during the follow up period, i.e, the event of \(R = 0\) is given by,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P(R=0) &= P(Y=0) + P(Y=1)P(t \gt T \vert Y=1) \\
&= 1 - \delta + \delta e^{-\theta T}
\end{align}
\tag{30} \label{30} %]]></script>
<p>Similarly, probability density for people who fail with survival time \(t\) is given by,</p>
<script type="math/tex; mode=display">P(Y=1)P(t \lt T \vert Y=1) g(t \vert t \lt T, Y=1) = P(Y=1) g(t \vert Y=1) = \delta \theta e^{-\theta t} \tag{31} \label{31}</script>
<p>So the likelyhood function is made up of \eqref{29} for those who do not fail and \eqref{30} for those who do. It is given by,</p>
<script type="math/tex; mode=display">L = \prod_{i=1}^n \delta \theta exp(-\theta t_i) \prod_{i = n+1}^N (1 - \delta + \delta exp(-\theta T_i)) \tag{32} \label{32}</script>
<p>The maximum likelyhood estimate of both \(\theta\) and \(\delta\) can be obtained by maximizing \eqref{32} numerically. It can be noted that when \(\delta = 1\), \eqref{32} reduces to \eqref{14}, the original exponential survival time model.</p>
<p>The split population model can be seen as a model of two seperate subpopulations, one with hazard rate \(\theta\) and other with zero. A more generalized model exists where the subpopulations exist with two non-zero hazard rates namely, \(\theta_1\) and \(\theta_2\). Such models help to account for population that is heterogenous in nature.</p>
<p>Split models can also be based on other distributions such as lognormal etc. Also, it is possible to include explanatory variables into a split model. In such cases, the explanatory variables maybe taken to affect the probabiliy of failure, \(\delta\) or distribution of time until failure.</p>
<p>For example, for a given feature vector \(x_i\) of explanatory variables, using <strong>logit/individual lognormal model</strong>, \(\delta\) is modeled using,</p>
<script type="math/tex; mode=display">\delta_i = 1/(1+exp(x_i^\prime \alpha)) \tag{33} \label{33}</script>
<p>and parameter \(\mu\) of the lognormal distribution is given by,</p>
<script type="math/tex; mode=display">\mu_i = x_i^\prime \beta \tag{34} \label{34}</script>
<p>Here, the parameter \(\alpha\) gives the effect of \(x_i\) on the probablity of failure, and \(\beta\) gives the effect of \(x_i\) on the time until failure.</p>
<p>Such models are of importance because they let one distinguish between effects of explanatory variable on probability of eventual failure from effects on time until failure who eventually do fail.</p>
<h3 id="heterogeneity-and-state-dependence">Heterogeneity and State Dependence</h3>
<p>The two major causes of observed declining hazard rates are:</p>
<ul>
<li>state dependence</li>
<li>heterogeneity</li>
</ul>
<p>The phenomenon of an actually decreasing hazard rate over time due to an actual change in behavior over time at individual level is referred to as <strong>state dependence</strong>.</p>
<p>The second possible reason is <strong>heterogeneity</strong>. This basically means that the hazard rates are different across individuals, i.e., some individuals are more prone to failure than others. Naturally, individuals with higher hazard rates tend to fail earlier, on average, than individuals with lower hazard rates. As a result the average hazard rate of the surviving group will decrease with length of time simply because the most failure prone individuals have been removed already. This is true even without state dependence, i.e, each individual has a constant hazard rate but hazard rate varies across individuals. Even such a group would display decreasing hazard rate.</p>
<p>It is important to understand the difference because a decrease in a hazard rate due to state dependance means a success of the underlying program, while decrease due to heterogeneity does not imply that the program is effective in preventing failure, because it is happening by the virtue of the data at hand.</p>
<h3 id="time-varying-covariates">Time Varying Covariates</h3>
<p>Until now explanatory variables affecting the time until failure do not potray changing values over time, but is a possibility that can not be denied.</p>
<p>The types of explanatory variables can be categorizaed as follows:</p>
<ul>
<li>variables that do not change over time, e.g race, sex etc.</li>
<li>variables that change over time but not within a single follow-up period, e.g. number of times followed up etc.</li>
<li>variables that change continuously over time, such as age, education etc.</li>
</ul>
<p>The last type of variables make it reasonable to use a statistical model that allows covariates to vary over time. Such incorporation is relatively straightforward in hazard-based models such as proportional hazard models. At each point in time, hazard rate is determined by the values of explanatory variables at that time.</p>
<p>However, it is much more difficult to introduce time-varying components into parametric models because these models are parameterized in terms of density and cumulative distribution function, and the density of distribution function at time \(t\) depends on the whole history of the explanatory variables up to time \(t\). <strong>In the presence of time varying covariates, a parameterization of the hazard rate would be much more convenient.</strong></p>
<p><strong>Panel or Longitudinal Data:</strong> data on individuals over time without reference to just a single follow-up. Such datasets include a large number of time-varying explanatory variables.</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://link.springer.com/article/10.1007/BF01083132#" target="_blank">Survival Analysis: A Survey</a></small><br /></p>
Mon, 23 Jul 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/07/23/survival-analysis/
https://machinelearningmedium.com/2018/07/23/survival-analysis/machine-learningpapersfeaturedGoogle Smart Reply<h3 id="introduction">Introduction</h3>
<p>Smart reply is an end to end method for automatically generating <strong>short yet semantically diverse</strong> email repsonses. The feature also depends on some novel methods for <strong>semantic clustering of user-generated content</strong> that requires minimal amount of explicitly labeled data.</p>
<p>Google reveals that around 25% of the email responses are 20 tokens or less in length. The high frequency of short replies was the major motivation behind developing an automated reply assist feature. The system exploits concepts of machine learning such as fully-connected neural networks, LSTMs etch.</p>
<p>Major challenges that have been addressed in building this features includes the following:</p>
<ul>
<li>High <strong>repsonse quality</strong> in terms of language and content.</li>
<li><strong>Utility</strong> maintained by presenting a variety of responses.</li>
<li><strong>Scalable architecture</strong> to serve millions of emails google handles without significant latencies.</li>
<li>Maintaining <strong>privacy</strong> by ensuring that no personal data is leaked while generating training data. <strong>Only aggregate statistics are inspected.</strong></li>
</ul>
<h3 id="smart-reply">Smart Reply</h3>
<p>Smart reply consists of the following components:</p>
<ul>
<li>Response Selection: An LSTM network processes the incoming messages and produces the most likely responses. <strong>To improve scalability and increase speed of processing, only approximate best responses are found.</strong></li>
<li>Response Set Generation: In order to maintain high quality, the responses are selected from a response state <strong>generated offline using semi-supervised graph learning approach</strong>.</li>
</ul>
<p><img src="/assets/2018-07-22-google-smart-reply/fig-1-smart-reply.png?raw=true" alt="Fig-1: Lifecycle of a message" width="50%" /></p>
<ul>
<li>Diversity: After generating the most likely responses, a smaller set of responses are chosen among them to <strong>maximize the utility which requires enforcing diverse semantic intents</strong> among the presented options.</li>
<li>Triggering Model: A feedforward neural network decides whether or not to suggest responses, which further improves the utility by not showing suggestions when they are unlikely to be used.</li>
</ul>
<h3 id="background">Background</h3>
<p>The entire application of smart reply can be basically broken down into two core tasks:</p>
<ul>
<li>predicting responses</li>
<li>identifying a target response space</li>
</ul>
<p>While the task of finding the apt response has be attempted before, it has never been applied to a production environment at such a scale. It is this widespread use of the application that requires it to deliver high quality responses at all the instances. This is achieved by choosing the responses from a set of pre-identified response space.</p>
<p>Which leads to the second core task of identifying the target response space. This is achieved by using an algorithm called <strong>Expander Graph Learning Approach</strong>. It is used because it scales well to really large datasets and large output sizes. Generally used for knowledge expansion and classification tasks, smart reply is the first attempt to use it for semantic intent clustering.</p>
<h3 id="selecting-responses">Selecting Responses</h3>
<p>The fundamental aim of smart reply is to find the most likely response given an original message text. i.e. given an original message \(o\) and the set of all possible responses \(R\), find,</p>
<script type="math/tex; mode=display">r^* = argmax_{r \in R} P(r|o) \tag{1} \label{1}</script>
<p>In order to acheive this a model is built to score the responses and then response with the highest score is picked.</p>
<p><strong>LSTM Model</strong></p>
<ul>
<li>Since a sequence of tokens \(r\) is being scored conditional on another sequence of characters \(o\), the task is a natural fit for <strong>sequence to sequence learning</strong>.</li>
<li>Input to the model is the original message \(\{o_1, o_2, \cdots o_n\}\)</li>
<li>The output is the conditional probability distribution of sequence of response tokens given the input:</li>
</ul>
<script type="math/tex; mode=display">P(r_1, r_2, \cdots, r_m | o_1, o_2, \cdots, o_n) \tag{2} \label{2}</script>
<p>The distribution in \eqref{2} can be further factorized as,</p>
<script type="math/tex; mode=display">P(r_1, \cdots, r_m | o_1, \cdots, o_n) = \prod_{i=1}^m P(r_i|o_1, \cdots, o_n, r_1, \cdots, r_{i-1}) \tag{3} \label{3}</script>
<p>In practice, the sequence of original message is fed to the LSTM, which then encodes the entire message in a vector representation. Then given this state, a softmax output is computed, which is interpretted as \(P(r_1|o_1, \cdots, o_n)\)(probability distribution of the first response token).</p>
<p>Similarly, as the response tokens are fed in, softmax at each timestep \(t\) is interpretted as \(P(r_t|o_1, \cdots, o_n, r_1, \cdots, r_{t-1})\)</p>
<p>Using the factorization in \eqref{3}, these softmax scores can be used to compute \(P(r_1, r_2, \cdots, r_m | o_1, o_2, \cdots, o_n)\).</p>
<p>Training involves the following points:</p>
<ul>
<li>maximize the log probability of observed responses, given their respective original messages, i.e.</li>
</ul>
<script type="math/tex; mode=display">\sum_{(o, r)} log \, P(r_1, \cdots, r_m | o_1, \cdots, o_n) \tag{4} \label{4}</script>
<ul>
<li>train using stochastic gradient descent using AdaGrad.</li>
<li>training is done on a distributed system because of the size of the dataset.</li>
<li><strong>recurrenct projection layer</strong> helped improve quality and time of convergence.</li>
<li><strong>gradient clipping</strong> helps stabalize training.</li>
</ul>
<p><strong>Inference</strong>: At the time of inference one can feed in the original message and then use the output of the softmaxes to get a probability distribution over the vocabulary at each timestep. These can be used in a variety of ways:</p>
<ul>
<li>to draw a random sample from the response distribution. This is done by sampling one token at each timestep to feed it back into the model.</li>
<li>to approximate the most likely response given the original message. This can be done greedily by taking most likely token at each timestep and feeding it back in. A less greedy strategy is to use <strong>beam search</strong>, i.e. take the top \(b\) tokens and feed them in, then retain the best \(b\) response prefixes and repeat.</li>
<li>to determine the likelyhood of a specific response candidate. Done by feeding each token of the candidate and using softmax output to get the likelyhood of next candidate token.</li>
</ul>
<h3 id="challenges">Challenges</h3>
<p><strong>Response Quality</strong></p>
<ul>
<li>In order to surface responses to the users, responses must be always high quality in terms of style, tone, diction, and content. Since the models are trained on real-world data, one has to account for the possibility where the most response is not necessarily a high quality response. Even the most frequent responses might not be appropriate to suggest to users because it could contain poor grammar, spelling or machanics (like <em>you’re the best!</em>) or it could also convey a sense of familiarity that is likely to be offensive (like <em>thanks hon!</em>) etc.</li>
<li>While restricting the vocabulary can take care of issues such as profanity or spell errors, it would not be sufficient in averting a politically incorrect statement that can be formed in a wide variety of ways.</li>
<li>Hence, smart reply uses a semi-supervised learning to build the target repsonse space \(R\) comprising of only high quality responses.</li>
<li>Hence the model described is used to choose the best response among \(R\), instead of best response from any sequence of words in the vocabulary.</li>
</ul>
<p><strong>Utility</strong></p>
<ul>
<li>Suggestions are most useful when they are highly specific to the original message and express a diverse intent.</li>
<li>Generally the outputs from LSTM observed tend to (1) favor common but unspecific responses and (2) have little diversity.</li>
<li>Specificity of the responses is increased by penalizing the responses that are applicable to a broad range of incoming messages.</li>
<li>In order to increase the breadth of options presented to users, diversity is enforced by exploiting the semantic structure of \(R\).</li>
<li>Utility of responses is also boosted by passing the incoming message first through a triggering model which decides whether or not it is appropriate for suggestions to pop up.</li>
</ul>
<p><strong>Scalability</strong></p>
<ul>
<li>Scoring every candidate \(r \in R\) would require \(O(|R | l)\) LSTM steps where \(l\) is the length of the longest response.</li>
<li>This would mean a growing response time as the number of responses in \(R\) increases over time.</li>
<li>In general, an efficient algorithm for this purpose should not be a function of \(|R|\)</li>
<li>In order to achieve this, the responses among \(R\) are organized as a trie, followed by a left-to-right beam-search but retain only the hypotheses that appear in the trie.</li>
<li>This search process has a complexity of \(O(bl)\) where both \(b\) and \(l\) are in a range of 10-30, which greatly reduces the time it would take to generate the responses.</li>
<li>Although the search only approximates the best responses in \(R\), its results are very similar to what one would get by scoring and ranking all \(r \in R\), even for a small \(b\).</li>
<li>Also first pass through the triggering model, reduces the average time a message has to spend in LSTM computations.</li>
</ul>
<h3 id="response-set-generation">Response Set Generation</h3>
<ul>
<li>The goal of this step is to generate a structured response set that effectively captures various intents conveyed by people in natural language conversations.</li>
<li>The target response space is required to capture both variablity in language and intents.</li>
<li>The results are used in two ways - (1) define a response space and (2) promote diversity among chosen suggestions.</li>
<li>Response set is constructed by aggregating the most frequently used sentences among the preprocessed data.</li>
</ul>
<p><strong>Canonicalizing Email Responses</strong></p>
<ul>
<li>Involves generating a set of canonicalized responses that capture the variability in language.</li>
<li>This is done by performing a dependency parse on all the sentences and then using the syntactic structure to generate a canonicalized representation.</li>
<li>Words, phrases that are modifiers or not attached to the head words are ignored.</li>
</ul>
<p><strong>Semantic Intent Clustering</strong></p>
<ul>
<li>partition the responses into semantic clusters where each cluster represents a meaningful response intent.</li>
<li>all the messages within a cluster share the same semantic meaning but may appear different in structure.</li>
<li>this helps digest the entire information present in frequent responses into a coherent set of semantic cluster</li>
<li>because of the lack of data available to train a classifier, a supervised model cannot be trained to predict the semantic cluster of a candidate response.</li>
<li>another hindrance in performing supervised learning is that the semantic space classes cannot be all defined a priori.</li>
<li>hence the semi-supervised technique is used for achieving this.</li>
</ul>
<p><strong>Graph Construction</strong></p>
<ul>
<li>Start by manually defining the clusters sampled from top frequent responses.</li>
<li>A small number of responses are added as seed for the clustering.</li>
<li>This leads to a base graph, where <strong>frequent responses are represented by nodes, \(V_R\)</strong>. Lexical features (n-grams and skip grams upto a length of 3) are extracted for the responses and populated in graph as the <strong>feature nodes, \(V_F\)</strong>. Edges are created between the pair of nodes, \((u,v)\) where \(u \in V_R\) and \(v \in V_F\). Similarly, nodes are created for manually labelled examples, \(V_L\).</li>
</ul>
<p><strong>Unsupervised Learning</strong></p>
<ul>
<li>The constructed graph captures the relationship between the canonicalized responses via feature nodes.</li>
<li>Semantic intent for each repsonse node is learnt by propagating intent information from manually labelled examples through the graph.</li>
</ul>
<p>The algorithm works to minimize the following objective function for the response nodes:</p>
<script type="math/tex; mode=display">s_i \lVert \hat{C_i} - C_i \rVert^2 + \mu_{pp} \lVert \hat{C_i} - U \rVert^2 + \mu_{np} \left( \sum_{j \in \mathcal{N}_{\mathcal{F}} (i)} w_{ij} \lVert \hat{C_i} - \hat{C_j} \rVert^2 + \sum_{k \in \mathcal{N}_{\mathcal{R}} (i)} w_{ik} \lVert \hat{C_i} - \hat{C_k} \rVert^2\right) \tag{5} \label{5}</script>
<p>where</p>
<ul>
<li>\(s_i\) is an <strong>indicator function</strong> equal to 1 if node \(i\) is a seed else 0.</li>
<li>\(\hat{C_i}\) is the <strong>learnt semantic cluster distribution</strong> for response node \(i\).</li>
<li>\(C_i\) is the <strong>true label distribution</strong> (i.e. for the manually provided examples)</li>
<li>\(\mathcal{N}_{\mathcal{F}} (i)\) and \(\mathcal{N}_{\mathcal{R}} (i)\) represent the feature and response neighbourhood of node \(i\).</li>
<li>\(\mu_{np}\) is the predefined penalty for neighbouring nodes with divergent label distributions.</li>
<li>\(\hat{C_j}\) is the learnt label distribution for feature neighbour \(j\).</li>
<li>\(w_{ij}\) is the weight of feature \(j\) in response \(i\).</li>
<li>\(\mu_{pp}\) is the penalty for label distribution deviating from prior, Uniform Distribution \(U\).</li>
</ul>
<p>Similarly, the objective is to reduce the following objective function for the feature nodes:</p>
<script type="math/tex; mode=display">\mu_{pp} \lVert \hat{C_i} - U \rVert^2 + \mu_{np} \left( \sum_{j \in \mathcal{N}_{\mathcal{F}} (i)} w_{ij} \lVert \hat{C_i} - \hat{C_j} \rVert^2 + \sum_{k \in \mathcal{N}_{\mathcal{R}} (i)} w_{ik} \lVert \hat{C_i} - \hat{C_k} \rVert^2\right) \tag{6} \label{6}</script>
<p>\eqref{5} and \eqref{6} are alike except that \eqref{6} does not have the first term as there are no seed labels for the feature nodes.</p>
<p>The objective functions \eqref{5} and \eqref{6} are jointly optimized for all the nodes. In order to discover the new clusters the algorithm is run in phases, in which randomly 100 new responses are sampled among the unlabeled nodes. These are treated as the potential new clusters and labeled with there canonicalized representations after which the algorithm is rerun and the process is repeated for the unlabeled nodes.</p>
<p><strong>Cluster Validation</strong></p>
<ul>
<li>Finally, the top \(k\) members from each semantic cluster are extracted and sorted by their label scores.</li>
<li>The set of (response, cluster label) pairs are then validated by human raters.</li>
</ul>
<h3 id="suggestion-diversity">Suggestion Diversity</h3>
<ul>
<li>The LSTM model is trained to returned the approximate best response among the target response set.</li>
<li>The responses are <strong>penalized if they are too general</strong> to be valuable to any user.</li>
<li>The next <strong>challenge lies in choosing a small number of responses</strong> to display to the user which maximizes the utility.</li>
<li>A straight-forward way of doing this can be to <strong>choose the top \(N\) responses</strong> and present them to the user. But in practice it is observed that such responses tend to be very similar. It is obvious to anyone that the likelihood of one of the repsonses being useful is greatest when none of the responses presented to the users are redundant, i.e. it would be wasteful to present a user with three responses that are a variation of same sentence.</li>
<li>The second and more optimal approach to suggest responses to users would <strong>include enforcing diversity</strong>. This is achieved by:
<ul>
<li>omitting redundant responses.</li>
<li>enforcing negative or positive responses.</li>
</ul>
</li>
</ul>
<p><strong>Omitting Redundant Responses</strong></p>
<ul>
<li>The strategy states that a user should <strong>never see two responses with the same intent</strong>.</li>
</ul>
<blockquote>
<p>Intent can be thought of as a cluster of responses that have a common communication purpose.</p>
</blockquote>
<ul>
<li>In smart reply, every suggested responses is associated with a exactly one intent. These intents are learnt using the semi-supervised learning algorithm explained <a href="#response-set-generation">above</a>.</li>
<li>The actual diversity strategy simple: the top responses are iterated over in order of decreasing score. Each response is added to suggestion list unless its intent is already covered by a response in the suggestion list.</li>
</ul>
<p><strong>Enforcing Negatives and Positives</strong></p>
<ul>
<li>It is observed that the LSTM trained has a strong tendency towards positive responses, whereas negative responses generally get a low score.</li>
<li>It might be reflective of the style of email conversations: positive replies are more common and when the replies are negative people prefer more indirect wording.</li>
<li>Since, it is important to give out and option of repsonding negatively, the following strategy is followed:</li>
</ul>
<blockquote>
<p>If the top two responses (chosen from different intents) contain atleast one positive and none of the three responses are negative, the third response is replaced with a negative one.</p>
</blockquote>
<ul>
<li>
<p>A positive response is the one that is clearly affirmative. In order to find the negative response to be included as the third option, a second LSTM pass is performed, in which the search is restricted to only to the negative responses in the target set.</p>
</li>
<li>
<p>It might also be the case that an incoming message triggers exclusively negative responses. In which case, an analogous strategy for enforcing positives is employed.</p>
</li>
</ul>
<h3 id="triggering">Triggering</h3>
<ul>
<li>This is a second model (in this case a fully-connected feed-forward neural network which produced probability score) that is responsible for filtering messages that are bad candidates for suggesting responses. These might include emails that require longer responses, or emails that do not require a response at all.</li>
<li>On an average this system only decides that 11% of the incoming messages should get processed for smart reply. This selectivity further helps to speed up the process of analyzing the incoming emails, and decrease the time spent on LSTM and hence inturn reduce the infrastructure costs.</li>
<li>The two main objectives that this system should fulful are:
<ul>
<li>it should be accurate enough to decide when a smart reply should not be generated</li>
<li>it should be fast.</li>
</ul>
</li>
<li>The choice of model is because it has been repeatedly observed that these ANN outperform linear models such as SVMs or linear regression on NLP tasks.</li>
</ul>
<p><strong>Data and Features</strong></p>
<ul>
<li>Data includes the set of emails in the pair \((o, y)\), where \(o\) is an incoming message and \(y\) is a boolean true or false based on whether or not a email was replied to. For the positive class, only the messages that were replied to from a mobile device are considered.</li>
<li>Since the number of emails that are not replied to are found to be higher, the negative class examples are downsampled to match the number of positive class examples.</li>
<li><strong>Features</strong> (unigrams and bigrams) are extracted from message body, subject and headers. Other <strong>social signals</strong> such as whether or not the sender is in receipent’s address book etc is also used.</li>
</ul>
<p><strong>Network Architecture and Training</strong></p>
<ul>
<li>Feed forward neural network with embedding layer and three fully connected hidden layers</li>
<li>Feature hashing is used to bucket rare words that are not present in the vocabulary.</li>
<li>Embeddings are aggregated by summation within a features (like bigram etc.)</li>
<li>Activation function: ReLu and Dropout layers are used.</li>
<li>Trained using AdaGrad optimization technique.</li>
</ul>
<h3 id="evaluation-and-results">Evaluation and Results</h3>
<p><strong>Data</strong></p>
<ul>
<li>For the LSTM model data consists of incoming messages and its responses by a user.</li>
<li>
<p>For the triggering model, messages are used with the label describing whether or not they were replied to from a mobile device.</p>
</li>
<li>The following <strong>preprocessing</strong> techniques are used:
<ul>
<li>Language detection: non-english messages are discarded.</li>
<li>Tokenization: messages and subjects are broken down into words and punctuations</li>
<li>Sentence segmentation: sentence boundaries are detected in the message body</li>
<li>Normalization: infrequent words and entities like personal informations are replaced by special tokens.</li>
<li>Quotation removal: Quoted original messages and forwarded messages are removed.</li>
<li>Salutation/close removal: salutations and closing notes are removed.</li>
</ul>
</li>
<li>After preprocessing the size of the training data is <strong>238 million</strong> messages, which includes 153 million messages that have no response.</li>
</ul>
<h3 id="conclusions">Conclusions</h3>
<ul>
<li>Standard binary performance metrics are observed for triggering model: Precision, recall and area under the ROC curve.</li>
<li>AUC of triggering model is 0.854</li>
<li>For the LSTM model Precision, Mean Reciprocal Rank and Precision@K is observed.</li>
<li>A model with lower perplexity assigns a higher likelyhood to the test responses, and hence should be better at predicting responses. Perplexity of smart reply is 17.0 (by comparison, and n-gram model with katz backoff and maximum order of 5 has a perplexity of 31.4)</li>
</ul>
<blockquote>
<p>A perplexity equal to \(k\) means that when the model predicts the next word, there are on average \(k\) likely candidates.</p>
</blockquote>
<ul>
<li>In an ideal scenario the perplexity of the system would be 1, i.e. one knows exactly what should be the next word. The perplexity on a set of \(N\) test samples is computed using the following formula:</li>
</ul>
<script type="math/tex; mode=display">P_r = exp\left( - {1 \over W} \sum_{i=1}^N ln (\hat{P} (r_1^i, \cdots, r_m^i| o_1^i, \cdots, o_n^i)) \right) \tag{7} \label{7}</script>
<p>where</p>
<ul>
<li>\(W\) is the total number of words in the \(N\) samples.</li>
<li>\(\hat{P}\) is the learnt distribution</li>
<li>
<p>\(r^i\) and \(o^i\) are the \(i-th\) repsonse and original message.</p>
</li>
<li>The model is also evaluated on the response ranking. Simply put, the rank of the actual response with respect to other responses in R is evaluated. Using this, the <strong>mean reciprocal rank</strong> (MRR) is calculated using:</li>
</ul>
<script type="math/tex; mode=display">MRR = {1 \over N} \sum_{i=1}^N {1 \over rank_i} \tag{8} \label{8}</script>
<ul>
<li>
<p>Additionally, Precision@K (for a given value of K, the number of cases for which target response \(r\) is within the topK responses that were ranked by the model) is also computed.</p>
</li>
<li>
<p>On a daily basis, the smart reply system generates 12.9k unique suggestions that belong to 376 unique semantic clusters, out of which the users utilized, 31.9% of the suggestions and 83.2% of the unique clusters.</p>
</li>
<li>Among the selected responses, 45% are the 1st responses, 35% 2nd responses, and 20% 3rd responses.</li>
<li>If using only the straight-forward approach instead of enforcing diversity, the click through rates drop by roughly 7.5%.</li>
</ul>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://ai.google/research/pubs/pub45189" target="_blank">Smart Reply: Automated Response Suggestion for Email</a></small><br />
<small><a href="https://www.blog.google/products/gmail/save-time-with-smart-reply-in-gmail/" target="_blank">Save time with Smart Reply in Gmail</a></small></p>
Sun, 22 Jul 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/07/22/google-smart-reply/
https://machinelearningmedium.com/2018/07/22/google-smart-reply/NLPmachine-learningpapersLarge Scale Learning<h3 id="introduction">Introduction</h3>
<p>The popularity of machine learning techniques have increased in the recent past. One of the reasons leading to this trend is the exponential growth in data available to learn from. Large datasets coupled with a high variance model has the potential to perform well. But as the size of datasets increase, it poses various problems in terms of space and time complexities of the algorithms.</p>
<blockquote>
<p>It’s not who has the best algorithm that wins. It’s who has the most data.</p>
</blockquote>
<p>For example, consider the update rule for parameter optimization using gradient descent from (3) and (4) in the <a href="/2017/08/23/multivariate-linear-regression/" target="\_blank">multivariate linear regression post</a>,</p>
<script type="math/tex; mode=display">\theta_j := \theta_j - \alpha {1 \over m} \sum_{i=1}^m \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} \tag{1} \label{1}</script>
<blockquote>
<p><a href="https://www.kaggle.com/shamssam/gradient-descent-for-regression" target="\_blank">Kaggle Kernel Implementation</a></p>
</blockquote>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">batch_update_vectorized</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">m</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="k">return</span> <span class="n">torch</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_add_bias</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">forward</span><span class="p">()</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">)</span>
<span class="p">)</span> <span class="o">/</span> <span class="n">m</span>
<span class="k">def</span> <span class="nf">batch_update_iterative</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">m</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="n">update_theta</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">X</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_bias</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">update_theta</span><span class="p">)</span> <span class="o">==</span> <span class="n">torch</span><span class="o">.</span><span class="n">DoubleTensor</span><span class="p">:</span>
<span class="n">update_theta</span> <span class="o">+=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">update_theta</span> <span class="o">=</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="k">return</span> <span class="n">update_theta</span><span class="o">/</span><span class="n">m</span>
<span class="k">def</span> <span class="nf">batch_train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tolerance</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">):</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="n">init_cost</span> <span class="o">=</span> <span class="n">prev_cost</span>
<span class="n">num_epochs</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="bp">self</span><span class="o">.</span><span class="n">batch_update_vectorized</span><span class="p">()</span>
<span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">if</span> <span class="p">(</span><span class="n">prev_cost</span> <span class="o">-</span> <span class="n">cost</span><span class="p">)</span> <span class="o"><</span> <span class="n">tolerance</span><span class="p">:</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">True</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="n">cost</span>
<span class="n">num_epochs</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>
<p>From \eqref{1} above, it can be seen that for each step of gradient descent, summation has to be performed over entire dataset of \(m\) examples. While for small datasets it might seem inconsequential, but as the size of datasets increases this would have very high impact on the training time.</p>
<p>In such cases, it would also be helpful to plot <a href="/2018/04/02/evaluation-of-learning-algorithm/#learning-curves">learning curves</a>, to check if actually training the model with such high number data samples is really helpful, because if the model has high bias then similar result could be acheived by using a smaller dataset. It would be more helpful to incrase variance of the model in such cases.</p>
<p>On the other hand, if the learning curves show that using the larger dataset is indeed helpful, it would be more productive to use more computationally efficient algorithms to train the model such as the ones mentioned in the following sections.</p>
<h3 id="stochastic-gradient-descent">Stochastic Gradient Descent</h3>
<p>The gradient descent rule presented in \eqref{1}, also known as <strong>batch gradient descent</strong>, has the disadvantage that for each update the summation of update term has to be performed over all the training data.</p>
<p>Stochastic gradient descent is an approximation of the batch gradient descent. Each epoch in this algorithm is begun with a random shuffle of the data followed by the following update rule,</p>
<script type="math/tex; mode=display">\theta_j := \theta_j - \alpha \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} \tag{2} \label{2}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">stochastic_train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tolerance</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">):</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">m</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="n">X</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_bias</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">init_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="n">num_epochs</span><span class="o">=</span><span class="mi">0</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="o">.</span><span class="n">view</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">))</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">*</span> <span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">]</span>
<span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">if</span> <span class="n">prev_cost</span><span class="o">-</span><span class="n">cost</span> <span class="o"><</span> <span class="n">tolerance</span><span class="p">:</span>
<span class="n">converged</span><span class="o">=</span><span class="bp">True</span>
<span class="n">num_epochs</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>
<p>i.e. for each training data in the sample dataset, as soon as the cost correponding to that instance is calculated it is used to make an approximate update to the parameters instead of waiting for the summation to finish. While this is not as accurate as the batch gradient descent in reaching the global minimum, it always converges within its close proximity.</p>
<blockquote>
<p>In practice, stochastic gradient descent speeds up the process of convergence over the traditional batch gradient descent.</p>
</blockquote>
<p>While learning rate is kept constant in most implementations of stochastic gradient descent, it is observed in practice that it helps to taper off the value of learning rate as the iteration proceeds. It can be done as follows,</p>
<script type="math/tex; mode=display">\alpha = \frac {constant_1} {iteration\_number + constant_2} \tag{3} \label{3}</script>
<h3 id="mini-batch-gradient-descent">Mini-Batch Gradient Descent</h3>
<p>While batch gradient descent sums over all the data for a single update iteration of the parameters, the stochastic gradient descent does it by considering individual training examples as and when they are encountered. The <strong>mini-batch gradient descent</strong> takes the mid-way and uses the summation from only <strong>b training examples (i.e. batch size)</strong> for every update iteration. Mathematically it can be presented as follows,</p>
<script type="math/tex; mode=display">\theta_j := \theta_j - \alpha {1 \over b} \sum_{i=1}^{i+b} \left( h_{\theta}(x^{(i)}) - y^{(i)} \right) x_j^{(i)} \tag{4} \label{4}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">mini_batch_train</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">tolerance</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">8</span><span class="p">):</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">m</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="o">.</span><span class="n">size</span><span class="p">()</span>
<span class="n">X</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">_add_bias</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">)</span>
<span class="n">init_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="n">num_epochs</span><span class="o">=</span><span class="mi">0</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="n">prev_cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">Theta</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">/</span> <span class="n">batch_size</span> <span class="o">*</span> <span class="n">torch</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="n">X</span><span class="p">[</span><span class="n">i</span><span class="p">:</span><span class="n">i</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span><span class="o">.</span><span class="n">transpose</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_forward</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">X_train</span><span class="p">[</span><span class="n">i</span><span class="p">:</span> <span class="n">i</span><span class="o">+</span><span class="n">batch_size</span><span class="p">])</span> <span class="o">-</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">:</span> <span class="n">i</span><span class="o">+</span><span class="n">batch_size</span><span class="p">]</span>
<span class="p">)</span>
<span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">cost</span><span class="p">()</span>
<span class="k">if</span> <span class="n">prev_cost</span><span class="o">-</span><span class="n">cost</span> <span class="o"><</span> <span class="n">tolerance</span><span class="p">:</span>
<span class="n">converged</span><span class="o">=</span><span class="bp">True</span>
<span class="n">num_epochs</span> <span class="o">+=</span> <span class="mi">1</span>
</code></pre></div></div>
<ul>
<li>
<p>Compared to stochastic gradient descent, the mini-batch gradient descent will be faster only if vectorized implementation is used for the updates.</p>
</li>
<li>
<p>Compared to batch gradient descent, the mini-batch gradient descent is faster due to the obvious reason of lesser number of summations that are to be performed for a single update iteration. Also, if both the implementations are vectorized, mini-batch gradient descent will have lower memory usage. The speed of operations depends on the trade-off between the matrix operation complexities and memory usage.</p>
</li>
<li>
<p>Generally it is observed that mini-batch gradient descent converges faster than both stochastic and batch gradient descent.</p>
</li>
</ul>
<h3 id="online-learning">Online Learning</h3>
<p>Online learning is a form of learning when the system has a continuous stream of training data. It implements the stochastic gradient descent forever using the input stream of data and discarding it once the parameter updates have been done using it.</p>
<p>It is observed that such an online learning setting is <strong>capable of learning the changing trends</strong> of data streams.</p>
<p>Typical domains where online learning can be successfully implemented include, search engines (predict click through rate i.e. CTR), recommendation websites etc.</p>
<p>Many of the listed problems can be modeled as a standard learning problem with fixed dataset, but often such data streams are available in such abundance that there is little utility of storing the data in place of implementing an online training system.</p>
<h3 id="map-reduce-and-parallelism">Map Reduce and Parallelism</h3>
<p>Map-Reduce is a technique used in large scale learning when a single system is not enough to train the models required. Under this training paradigm, all the <strong>summation operations are parallelized over a set of slave systems by spliting the training data</strong> (batch or entire set) across the systems which compute on smaller datasets and feed the results to the <strong>master system that aggregates the results</strong> from all the slaves and combines them together. This parallelized implementation boosts the speed of algorithm.</p>
<p>If the network latencies are not high, then one can expect a boost in speed by upto \(n\) times by using a pool of \(n\) systems. So, in practice when the systems are on a network speed boost is slightly less than \(n\) times.</p>
<blockquote>
<p>Algorithms that can be expressed as a summation over the training sets can be parallelized using map-reduce.</p>
</blockquote>
<p>Besides a pool of computers, parallelization also works on multi-core machines with the added benifit of near-zero network latencies and hence faster.</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://www.coursera.org/learn/machine-learning/lecture/CipHf/learning-with-large-datasets" target="_blank">Machine Learning: Coursera - Learning with Large Dataset</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/DoRHJ/stochastic-gradient-descent" target="_blank">Machine Learning: Coursera - Stochastic Gradient Descent</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/9zJUs/mini-batch-gradient-descent" target="_blank">Machine Learning: Coursera - Mini-Batch Gradient Descent</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/fKi0M/stochastic-gradient-descent-convergence" target="_blank">Machine Learning: Coursera - Convergence of Stochastic Gradient Descent</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/ABO2q/online-learning" target="_blank">Machine Learning: Coursera - Online Learning</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/10sqI/map-reduce-and-data-parallelism" target="_blank">Machine Learning: Coursera - Map Reduce and Data Parallelism</a></small></p>
Fri, 22 Jun 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/06/22/large-scale-learning/
https://machinelearningmedium.com/2018/06/22/large-scale-learning/machine-learningandrew-ngbasics-of-machine-learningRecommender Systems<h3 id="basics-of-machine-learning-series">Basics of Machine Learning Series</h3>
<blockquote>
<p><a href="/collection/basics-of-machine-learning">Index</a></p>
</blockquote>
<div class="horizontal-divider">· · ·</div>
<h3 id="problem-formulation">Problem Formulation</h3>
<p>Give \(n_m\) choices and \(n_u\) users,</p>
<ul>
<li>\(r(i, j) = 1\) if user \(j\) has rated choice \(i\).</li>
<li>\(y(i,j)\) is the rating given by user \(j\) to the choice \(i\), defined only if \(r(i, j) = 1\).</li>
</ul>
<p><a href="https://www.kaggle.com/shamssam/recommender-systems" target="\_blank"><strong>Kaggle Kernel</strong></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="c"># defining a ratings matrix, Y where 0's denote not rated</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span>
<span class="p">[</span>
<span class="p">[</span><span class="mf">3.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">4.5</span><span class="p">,</span> <span class="mf">4.</span> <span class="p">,</span> <span class="mf">2.</span> <span class="p">],</span>
<span class="p">[</span><span class="mf">3.</span> <span class="p">,</span> <span class="mf">4.</span> <span class="p">,</span> <span class="mf">3.5</span><span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">3.</span> <span class="p">],</span>
<span class="p">[</span><span class="mf">0.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">3.</span> <span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">3.</span> <span class="p">],</span>
<span class="p">[</span><span class="mf">4.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">3.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">],</span>
<span class="p">[</span><span class="mf">0.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">3.5</span><span class="p">],</span>
<span class="p">[</span><span class="mf">0.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">4.</span> <span class="p">,</span> <span class="mf">3.5</span><span class="p">],</span>
<span class="p">[</span><span class="mf">0.</span> <span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">4.5</span><span class="p">],</span>
<span class="p">[</span><span class="mf">4.</span> <span class="p">,</span> <span class="mf">4.</span> <span class="p">,</span> <span class="mf">2.5</span><span class="p">,</span> <span class="mf">5.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">],</span>
<span class="p">[</span><span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">4.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">2.5</span><span class="p">],</span>
<span class="p">[</span><span class="mf">0.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">,</span> <span class="mf">4.</span> <span class="p">,</span> <span class="mf">0.</span> <span class="p">]</span>
<span class="p">]</span>
<span class="p">)</span>
<span class="c"># calculating matrix R from matrix Y</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">y</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<p>So, the objective of the reocmmender system is to use the rated choices by the population of users and predict the ratings that a user would attribute to a choice that is not rated i.e. \(r(i, j) = 0\). In most real-world cases such as movie ratings, the number of unrated choices is generally very high and hence is not an elementary/easy problem to solve.</p>
<h3 id="content-based-recommendations">Content Based Recommendations</h3>
<ul>
<li>Each choice is alloted an \(n\) number of features and rated along those dimensions.</li>
<li>Following this, for each user \(j\) the ratings are regressed as a function of the alloted set of features.</li>
<li>The learnt parameter for user \(j\), \(\theta^{(j)}\) lies in space \(\mathbb{R}^{n+1}\).</li>
</ul>
<p>Summarizing,</p>
<ul>
<li>\(\theta^{(j)}\) is the parameter vector for user \(j\).</li>
<li>\(x^{(i)}\) is the feature vector for choice \(i\).</li>
<li>For user \(j\) and choice \(i\), predicted rating is given by, \((\theta^{(j)})^T (x^{(i)})\).</li>
</ul>
<p>Suppose user \(j\) has rated \(m^{(j)}\) choices, then learning \(\theta^{(j)}\) can be treated as linear regression problem. So, to learn \(\theta^{(j)}\),</p>
<script type="math/tex; mode=display">min_{\theta^{(j)}} {1 \over 2} \sum_{i: r(i, j)=1} ((\theta^{(j)})^T (x^{(i)}) - y^{(i, j)})^2 + {\lambda \over 2} \sum_{k=1}^n (\theta_k^{(j)})^2 \tag{1} \label{1}</script>
<p>Similarly, to learn \(\theta^{(1)}, \theta^{(2)}, \cdots, \theta^{(n_u)}\),</p>
<script type="math/tex; mode=display">min_{\theta^{(1)}, \cdots, \theta^{(n_u)}} {1 \over 2} \sum_{j=1}^{n_u} \sum_{i: r(i, j)=1} ((\theta^{(j)})^T (x^{(i)}) - y^{(i, j)})^2 + {\lambda \over 2} \sum_{j=1}^{n_u} \sum_{k=1}^n (\theta_k^{(j)})^2 \tag{2} \label{2}</script>
<p>where cost function is given by,</p>
<script type="math/tex; mode=display">J(\theta^{(1)}, \cdots, \theta^{(n_u)}) = {1 \over 2} \sum_{j=1}^{n_u} \sum_{i: r(i, j)=1} ((\theta^{(j)})^T (x^{(i)}) - y^{(i, j)})^2 + {\lambda \over 2} \sum_{j=1}^{n_u} \sum_{k=1}^n (\theta_k^{(j)})^2 \tag{3} \label{3}</script>
<p>Gradient Descent Update,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\theta_k^{(j)} &= \theta_k^{(j)} - \alpha \left( \sum_{i: r(i, j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i, j)}) x_k^{(i)} \right) \text{, for } k = 0 \\
\theta_k^{(j)} &= \theta_k^{(j)} - \alpha \left( \sum_{i: r(i, j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i, j)}) x_k^{(i)} + \lambda \theta_k^{(j)}\right) \text{, otherwise }
\end{align}
\tag{4} \label{4} %]]></script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">estimate_theta_v2</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">theta</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
<span class="n">_alpha</span> <span class="o">=</span> <span class="mf">0.01</span><span class="p">,</span> <span class="n">_lambda</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">_tolerance</span> <span class="o">=</span> <span class="mf">0.001</span><span class="p">):</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">y</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">max_i</span><span class="p">,</span> <span class="n">max_j</span> <span class="o">=</span> <span class="n">y</span><span class="o">.</span><span class="n">shape</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">!=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">max_i</span><span class="p">,</span> <span class="n">max_k</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span> <span class="o">!=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">:</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">max_j</span><span class="p">,</span> <span class="n">max_k</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="n">update_theta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">theta</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">update_theta</span> <span class="o">=</span> <span class="n">_alpha</span> <span class="o">*</span> <span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)),</span><span class="n">x</span><span class="p">))</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span>
<span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)),</span><span class="n">x</span><span class="p">)),</span>
<span class="n">theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span>
<span class="p">)</span> <span class="o">-</span> <span class="n">y</span>
<span class="p">)</span> <span class="o">*</span> <span class="n">r</span><span class="p">,</span>
<span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span> <span class="o">+</span> <span class="n">_lambda</span> <span class="o">*</span> <span class="n">theta</span>
<span class="p">)</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">theta</span> <span class="o">-</span> <span class="n">update_theta</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">update_theta</span><span class="p">))</span> <span class="o"><</span> <span class="n">_tolerance</span><span class="p">:</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">return</span> <span class="n">theta</span><span class="p">,</span> <span class="n">x</span>
</code></pre></div></div>
<p>where,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac {\partial} {\partial \theta_k^{(j)}} J(\theta^{(1)}, \cdots, \theta^{(n_u)}) &= \sum_{i: r(i, j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i, j)}) x_k^{(i)} \text{, for } k = 0 \\
\frac {\partial} {\partial \theta_k^{(j)}} J(\theta^{(1)}, \cdots, \theta^{(n_u)}) &= \sum_{i: r(i, j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i, j)}) x_k^{(i)} + \lambda \theta_k^{(j)} \text{, otherwise }
\end{align}
\tag{5} \label{5} %]]></script>
<p>Note: By convention, the terms \({1 \over m^{(j)}}\) terms are removed from the equations in recommendation systems. But these do not affect the optimization values as these are only constants used for ease of derivations in linear regression cost function.</p>
<blockquote>
<p>The effectiveness of content based recommendation depends of identifying the features properly, which is often not easy.</p>
</blockquote>
<h3 id="collaborative-filtering">Collaborative Filtering</h3>
<blockquote>
<p>Collaborative filtering has the intrinsic property of feature learning (i.e. it can learn by itself what features to use) which helps overcome drawbacks of content-based recommender systems.</p>
</blockquote>
<p>Given the scores \(y(i, j)\) for a choice, \(i \in [1, n_m]\) by various users \(j \in [1, n_u]\), and the parameter vector \(\theta^{(j)}\) for user \(j\), the algorithm learns the values for the features \(x^{(i)}\) applying regression by posing the following optimization problem,</p>
<script type="math/tex; mode=display">min_{x^{(i)}} {1 \over 2} \sum_{j:r(i,j)=1} \left[ (\theta^{(j)})^T x^{(i)} - y(i,j) \right]^2 + {\lambda \over 2} \sum_{k=1}^n \left( x_k^{(i)} \right)^2 \tag{6} \label{6}</script>
<p>Intuitively this boils down to the scenario where given a choice and its ratings by various users and their parameter vectors, the collaborative filitering algorithm tries to find the most optimal features to represent the choice such that the squared error between the two is minimized. Since this is very similar to the linear regression problem, regularization term is introduced to prevent overfitting of the features learnt. Similarly by extending this, it is possible to learn all the features for all the choices \(i \in [1, n_m]\), i.e. given \( \theta^{(1)}, \theta^{(2)}, \cdots, \theta^{(n_u)} \) learn, \(x^{(1)}, x^{(2)}, \cdots, x^{(n_m)}\),</p>
<script type="math/tex; mode=display">min_{x^{(1)}, \cdots, x^{(n_m)}} {1 \over 2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} \left[ (\theta^{(j)})^T x^{(i)} - y(i,j) \right]^2 + {\lambda \over 2} \sum_{i=1}^{n_m} \sum_{k=1}^n \left( x_k^{(i)} \right)^2 \tag{7} \label{7}</script>
<p>Where the updates to the feature vectors will be given by,</p>
<script type="math/tex; mode=display">x_k^{(i)} := x_k^{(i)} - \alpha \left( \sum_{j:r(i,j)=1} \left[ (\theta^{(j)})^T x^{(i)} - y(i,j) \right] \theta_k^{(j)} + \lambda x_k^{(i)} \right) \tag{8} \label{8}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">estimate_x_v2</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span> <span class="n">theta</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
<span class="n">_alpha</span> <span class="o">=</span> <span class="mf">0.01</span><span class="p">,</span> <span class="n">_lambda</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">_tolerance</span> <span class="o">=</span> <span class="mf">0.001</span><span class="p">):</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">y</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">max_i</span><span class="p">,</span> <span class="n">max_j</span> <span class="o">=</span> <span class="n">y</span><span class="o">.</span><span class="n">shape</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">!=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">max_i</span><span class="p">,</span> <span class="n">max_k</span><span class="p">)</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span> <span class="o">!=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">:</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">max_j</span><span class="p">,</span> <span class="n">max_k</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="n">update_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">update_x</span> <span class="o">=</span> <span class="n">_alpha</span> <span class="o">*</span> <span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">)),</span><span class="n">x</span><span class="p">)),</span>
<span class="n">theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span>
<span class="p">)</span> <span class="o">-</span> <span class="n">y</span>
<span class="p">)</span> <span class="o">*</span> <span class="n">r</span><span class="p">,</span>
<span class="n">theta</span>
<span class="p">)[:,</span> <span class="mi">1</span><span class="p">:]</span> <span class="o">+</span> <span class="n">_lambda</span> <span class="o">*</span> <span class="n">x</span>
<span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">-</span> <span class="n">update_x</span>
<span class="k">if</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">update_x</span><span class="p">))</span> <span class="o"><</span> <span class="n">_tolerance</span><span class="p">:</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">return</span> <span class="n">theta</span><span class="p">,</span> <span class="n">x</span>
</code></pre></div></div>
<blockquote>
<p>It is possible to arrive at optimal \(\theta\) and \(x\) by repetitively minimizing them using \eqref{4} and \eqref{8}.</p>
</blockquote>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tolerance</span><span class="o">=</span><span class="mf">0.001</span>
<span class="n">max_k</span><span class="o">=</span><span class="mi">50</span>
<span class="c"># the order of application of the estimate_x and estimate_theta can be altered</span>
<span class="n">theta</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">estimate_x_v2</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">_tolerance</span><span class="o">=</span><span class="n">tolerance</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="n">max_k</span><span class="p">)</span>
<span class="c"># iterating twice. more iterations would result in better convergence</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">):</span>
<span class="n">theta</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">estimate_theta_v2</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="o">=</span><span class="n">theta</span><span class="p">,</span> <span class="n">_tolerance</span><span class="o">=</span><span class="n">tolerance</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="n">max_k</span><span class="p">)</span>
<span class="n">theta</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">estimate_x_v2</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">x</span><span class="o">=</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="o">=</span><span class="n">theta</span><span class="p">,</span> <span class="n">_tolerance</span><span class="o">=</span><span class="n">tolerance</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="n">max_k</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">10</span><span class="p">,</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">x</span><span class="p">)),</span> <span class="n">theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
</code></pre></div></div>
<p>But it is also possible to solve for both \(\theta\) and \(x\) simultaneously, given by an update rule which is nothing but the combination of the earlier two update rules in \eqref{3} and \eqref{7}. So the resulting cost function is given by,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
J(x^{(1)}, \cdots, x^{(n_m)}, \theta^{(1)}, \cdots, \theta^{(n_u)}) &= {1 \over 2} \sum_{(i,j):r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i, j)})^2 \\
&+ {\lambda \over 2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 \\
&+ {\lambda \over 2} \sum_{j=1}^{n_u} \sum_{k=1}^n (\theta_k^{(j)})^2
\end{align}
\label{9} \tag{9} %]]></script>
<p>and the minimization objective can be written as,</p>
<script type="math/tex; mode=display">min_{x^{(1)}, \cdots, x^{(n_m)}, \theta^{(1)}, \cdots, \theta^{(n_u)}} J(x^{(1)}, \cdots, x^{(n_m)}, \theta^{(1)}, \cdots, \theta^{(n_u)}) \tag{10} \label{10}</script>
<p>Practically, the minimization objective \eqref{10} is equivalent to \eqref{4} if \(x\) is kept constant. Similarly, it’s equivalent to \eqref{8} if \(\theta\) is kept constant.</p>
<blockquote>
<p>In \eqref{10}, by convention there is no \(x_0=1\) and thus consequently, there in no \(\theta_0\), hence leading to \(x \in \mathbb{R}^n\) and \(\theta \in \mathbb{R}^n\).</p>
</blockquote>
<p>To summarize, the collaborative filtering algorithm has the following steps,</p>
<ul>
<li>Initializa \(x^{(1)}, \cdots, x^{(n_m)}, \theta^{(1)}, \cdots, \theta^{(n_u)}\) to small random values.</li>
<li>Minimize \eqref{9} using gradient descent or any other advance optimization algorithm. The update rules given below can be obtained by following the partial derivatives along \(x’s\) and \(\theta’s\).</li>
</ul>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
x_k^{(i)} &= x_k^{(i)} - \alpha \left( \sum_{j:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y(i,j)) \theta_k^{(j)} + \lambda x_k^{(i)} \right) \\
\theta_k^{(j)} &= \theta_k^{(j)} - \alpha \left( \sum_{i: r(i, j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i, j)}) x_k^{(i)} + \lambda \theta_k^{(j)}\right)
\end{align} \tag{11} \label{11} %]]></script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">colaborative_filtering_v2</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">_alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">_lambda</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">_tolerance</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">r</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="k">if</span> <span class="nb">type</span><span class="p">(</span><span class="n">r</span><span class="p">)</span> <span class="o">!=</span> <span class="n">np</span><span class="o">.</span><span class="n">ndarray</span><span class="p">:</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">y</span><span class="o">></span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">False</span>
<span class="n">max_i</span><span class="p">,</span> <span class="n">max_j</span> <span class="o">=</span> <span class="n">y</span><span class="o">.</span><span class="n">shape</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">max_i</span><span class="p">,</span> <span class="n">max_k</span><span class="p">)</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">rand</span><span class="p">(</span><span class="n">max_j</span><span class="p">,</span> <span class="n">max_k</span><span class="p">)</span>
<span class="k">while</span> <span class="ow">not</span> <span class="n">converged</span><span class="p">:</span>
<span class="n">update_x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">update_theta</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">theta</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">update_x</span> <span class="o">=</span> <span class="n">_alpha</span> <span class="o">*</span> <span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">r</span><span class="p">,</span>
<span class="n">theta</span>
<span class="p">)</span> <span class="o">+</span> <span class="n">_lambda</span> <span class="o">*</span> <span class="n">x</span>
<span class="p">)</span>
<span class="n">update_theta</span> <span class="o">=</span> <span class="n">_alpha</span> <span class="o">*</span> <span class="p">(</span>
<span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span>
<span class="n">x</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span>
<span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span> <span class="o">*</span> <span class="n">r</span><span class="p">,</span>
<span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span> <span class="o">+</span> <span class="n">_lambda</span> <span class="o">*</span> <span class="n">theta</span>
<span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">-</span> <span class="n">update_x</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">theta</span> <span class="o">-</span> <span class="n">update_theta</span>
<span class="k">if</span> <span class="nb">max</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">update_x</span><span class="p">)),</span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="nb">abs</span><span class="p">(</span><span class="n">update_theta</span><span class="p">)))</span> <span class="o"><</span> <span class="n">_tolerance</span><span class="p">:</span>
<span class="n">converged</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">return</span> <span class="n">theta</span><span class="p">,</span> <span class="n">x</span>
<span class="n">theta</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">colaborative_filtering_v2</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="n">max_k</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
</code></pre></div></div>
<ul>
<li>For a user with parameter \(\theta\) and a choice with learned features \(x\), the predicted star rating is given by \(\theta^T x\).</li>
</ul>
<p>Consequently, the matrix of ratings, \(Y\), can be written as,</p>
<script type="math/tex; mode=display">% <![CDATA[
Y = \left[
\begin{matrix}
(\theta^{(1)})^T x^{(1)} & (\theta^{(2)})^T x^{(1)} & \cdots & (\theta^{(n_u)})^T x^{(1)} \\
(\theta^{(1)})^T x^{(2)} & (\theta^{(2)})^T x^{(2)} & \cdots & (\theta^{(n_u)})^T x^{(2)} \\
\vdots & \vdots & \ddots & \vdots \\
(\theta^{(1)})^T x^{(n_m)} & (\theta^{(2)})^T x^{(n_m)} & \cdots & (\theta^{(n_u)})^T x^{(n_m)} \\
\end{matrix}
\right ] \tag{12} \label{12} %]]></script>
<p>Where \(y(i, j)\) is the rating for choice \(i\) by user \(j\).</p>
<p>Vectorized implementation of \eqref{12}, is given by,</p>
<script type="math/tex; mode=display">Y = X \Theta^T \tag{13} \label{13}</script>
<p>Where,</p>
<ul>
<li>each row \(i\) in \(X\) represents the feature vector of choice \(i\).</li>
<li>each row \(j\) in \(\Theta\) represents the parameter vector for user \(j\).</li>
</ul>
<blockquote>
<p>The algorithm discussed is also called low rank matrix factorization which is a property of the matrix \(Y\) is linear algebra.</p>
</blockquote>
<h3 id="similar-recommendations">Similar Recommendations</h3>
<p>After the collaborative filtering algorithm has converged, it can be used to find related choices. For each choice \(i\), a feature vector is learned, \(x^{(i)} \in \mathbb{R}^n\). Although it is generally not possible to decipher what the values in the matrix \(X\) denote, they encode representative features of the choices in detail. So in order to find choices close to a given choice \(i\), a simple euclidean distance calculation will give the desired results</p>
<blockquote>
<p>If the distance between choices \(i\) and \(j\) is small, i.e. \(\lVert x^{(i)} - x^{(j)} \rVert\) is small, then they are similar.</p>
</blockquote>
<h3 id="mean-normalization">Mean Normalization</h3>
<p>Consider a case where one of the users has not rated any of the choices, then the rating matrix Y can be defined as,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">y</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">y</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="mi">1</span><span class="p">))))</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">y</span> <span class="o">></span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<p>Since none of the choices are rated by this user, the entries in R matrix corresponding to this user would be all zeros. So, \eqref{9} can be written as follows (because \({1 \over 2} \sum_{(i,j):r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i, j)})^2 = 0\)),</p>
<script type="math/tex; mode=display">J(x^{(1)}, \cdots, x^{(n_m)}, \theta^{(1)}, \cdots, \theta^{(n_u)}) = {\lambda \over 2} \sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 + {\lambda \over 2} \sum_{j=1}^{n_u} \sum_{k=1}^n (\theta_k^{(j)})^2 \tag{14} \label{14}</script>
<p>Since the updates to \(\theta\) corresponding to this user is only governed by this cost function, it would only minimize parameter vector \(\theta\). This can be seen easily by setting a low tolerance for the collaborative filtering,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">max_k</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">tolerance</span> <span class="o">=</span> <span class="mf">0.0000001</span>
<span class="n">theta</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">colaborative_filtering_v2</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="n">max_k</span><span class="p">,</span> <span class="n">_tolerance</span><span class="o">=</span><span class="n">tolerance</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span>
</code></pre></div></div>
<p>Obviously this is not a correct assumption to rate all the choices 0 for a user that has rated none so far. For such a user it would be ideal to predict the rating as the average of ratings attibuted to it by other users so far.</p>
<p>Mean normalization helps in acheiving this. In this process each row of the ratings matrix is normalized by its mean and later denormalized after predictions.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">normalized</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span>
<span class="n">_alpha</span><span class="o">=</span><span class="mf">0.01</span><span class="p">,</span> <span class="n">_lambda</span><span class="o">=</span><span class="mf">0.001</span><span class="p">,</span> <span class="n">_tolerance</span><span class="o">=</span><span class="mf">0.001</span><span class="p">):</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">where</span><span class="p">(</span><span class="n">y</span><span class="o">></span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">y_sum</span> <span class="o">=</span> <span class="n">y</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">r_sum</span> <span class="o">=</span> <span class="n">r</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">y_mean</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span><span class="n">y_sum</span><span class="o">/</span><span class="n">r_sum</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span>
<span class="n">y_norm</span> <span class="o">=</span> <span class="n">y</span> <span class="o">-</span> <span class="n">y_mean</span>
<span class="n">theta</span><span class="p">,</span> <span class="n">x</span> <span class="o">=</span> <span class="n">colaborative_filtering_v2</span><span class="p">(</span><span class="n">y_norm</span><span class="p">,</span> <span class="n">max_k</span><span class="p">,</span> <span class="n">_alpha</span><span class="p">,</span> <span class="n">_lambda</span><span class="p">,</span> <span class="n">_tolerance</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span>
<span class="k">return</span> <span class="n">theta</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y_mean</span>
<span class="n">theta</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y_mean</span> <span class="o">=</span> <span class="n">normalized</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">max_k</span><span class="o">=</span><span class="n">max_k</span><span class="p">,</span> <span class="n">_tolerance</span><span class="o">=</span><span class="n">tolerance</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">theta</span><span class="o">.</span><span class="n">transpose</span><span class="p">())</span> <span class="o">+</span> <span class="n">y_mean</span>
</code></pre></div></div>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://www.coursera.org/learn/machine-learning/lecture/uG59z/content-based-recommendations" target="_blank">Machine Learning: Coursera - Content Based Recommendations</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/2WoBV/collaborative-filtering" target="_blank">Machine Learning: Coursera - Collaborative Filtering</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/f26nH/collaborative-filtering-algorithm" target="_blank">Machine Learning: Coursera - Algorithm</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/CEXN0/vectorization-low-rank-matrix-factorization" target="_blank">Machine Learning: Coursera - Low Rank Matrix Factorization</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/Adk8G/implementational-detail-mean-normalization" target="_blank">Machine Learning: Coursera - Mean Normalization</a></small></p>
Fri, 11 May 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/05/11/recommender-systems/
https://machinelearningmedium.com/2018/05/11/recommender-systems/machine-learningandrew-ngbasics-of-machine-learningAnomaly Detection<h3 id="basics-of-machine-learning-series">Basics of Machine Learning Series</h3>
<blockquote>
<p><a href="/collection/basics-of-machine-learning">Index</a></p>
</blockquote>
<div class="horizontal-divider">· · ·</div>
<h3 id="introduction">Introduction</h3>
<blockquote>
<p>Anomaly detection is primarily an unsupervised learning problem, but some aspects of it are like supervised learning problems.</p>
</blockquote>
<p>Consider a set of points, \(\{x^{(1)}, x^{(2)}, \cdots, x^{(m)}\}\) in a training example (represented by blue points) representing the regular distribution of features \(x_1^{(i)}\) and \(x_2^{(i)}\). The aim of anomaly detection is to sift out anomalies from the test set (represented by the red points) based on distribution of features in the training example. For example, in the plot below, while point A is not an outlier, point B and C in the test set can be considered to be <strong>anomalous (or outliers)</strong>.</p>
<p><img src="/assets/2018-05-02-anomaly-detection/fig-1-anomaly.png?raw=true" alt="Fig-1 Anomaly" /></p>
<p>Formally, in anomaly detection the \(m\) training examples are considered to be normal or non-anomalous, and then the algorithm must decide if the next example, \(x_{test}\) is anomalous or not. So given the training set, it must come up with a model \(p(x)\) that gives the probability of a sample being normal (high probability is normal, low probability is anomaly) and resulting decision boundary is defined by,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(x_{test}) &\lt \epsilon \text{, flag as outlier or anomaly} \\
p(x_{text}) &\geq \epsilon \text{, flag as normal or non-anomalous}
\end{align}
\tag{1} \label{1} %]]></script>
<p>Some of the popular applications of anomaly detection are,</p>
<ul>
<li><strong>Fraud Detection:</strong> A observation set \(x^{(i)}\) would represent user \(i’s\) activities. Model \(p(x)\) is trained on the data from various users and unusual users are identified, by checking which have \(p(x^{(i)}) \lt \epsilon \).</li>
<li><strong>Manafacturing:</strong> Based on features of products produced on a production line, one can identify the ones with outlier characteristics for quality control and other such preventive measures.</li>
<li><strong>Monitoring Systems in a Data Center:</strong> Based on characteristics of a machine behaviour such as CPU load, memory usage etc. it is possible to identify the anomalous machines and prevent failure of nodes in a data-center and initiate diagnostic measures for maximum up-time.</li>
</ul>
<h3 id="gaussian-distribution">Gaussian Distribution</h3>
<blockquote>
<p>Gaussian distribution is also called Normal Distribution.</p>
</blockquote>
<p>For a basic derivation, refer <a href="/2017/07/31/normal-distribution/" target="\_blank"><strong>Normal Distribution</strong></a>.</p>
<p>If \(x \in \mathbb{R}\), and \(x\) follows Gaussian distribution with mean, \(\mu\) and variance \(\sigma^2\), denoted as,</p>
<script type="math/tex; mode=display">x \sim \mathcal{N}(\mu, \sigma^2) \label{2} \tag{2}</script>
<p>A standard normal gaussian distribution is a bell-shaped probability distribution curve with mean, \(\mu=0\) and standard deviation, \(\sigma=1\), as shown in the plot below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">5000000</span><span class="p">)</span>
<span class="n">n</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">data</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">500</span><span class="p">))</span>
</code></pre></div></div>
<p><img src="/assets/2018-05-02-anomaly-detection/fig-2-gaussian-distribution.png?raw=true" alt="Fig-2 Gaussian Distribution" /></p>
<p>The parameters \(\mu\) and \(\sigma\) signify the centring and spread of the gaussian curve as marked in the plot above. It can also be seen that the density is higher around the mean and reduces rapidly as distance from mean increases.</p>
<p>The probability of \(x\) in a gaussian distribution, \(\mathcal{N}(\mu, \sigma^2)\) is given by,</p>
<script type="math/tex; mode=display">p(x;\mu, \sigma^2) = {1 \over \sqrt{2\pi} \sigma} exp(- \frac {(x - \mu)^2} {2\sigma^2}) \tag{3} \label{3}</script>
<p>where,</p>
<ul>
<li>\(\mu\) is the mean,</li>
<li>\(\sigma\) is the standard deviation (\(\sigma^2\) is the variance)</li>
</ul>
<p>The effect of mean and standard deviation on a Gaussian plot can be seen clearly in figure below.</p>
<p><img src="/assets/2018-05-02-anomaly-detection/fig-3-effect-of-mean-and-standard-deviation.png?raw=true" alt="Fig-3 Effect of Mean and Standard Deviation" /></p>
<p>It can be noticed that, while mean, \(\mu\) defines the centering of the distribution, the standard deviation, \(\sigma\), defines the spread of the distribution. Also, as the spread increases the height of the plot decreases, because the total area under a probability distribution should always integrate to the value 1.</p>
<p>Given a dataset, as in the previous section, \(\{x^{(1)}, x^{(2)}, \cdots, x^{(m)}\}\), it is possible to determine the approximate (or the most fitting) gaussian distribution by using the following <strong>parameter estimation</strong>,</p>
<script type="math/tex; mode=display">\mu = {1 \over m} \sum_{i=1}^m x^{(i)} \tag{4} \label{4}</script>
<script type="math/tex; mode=display">\sigma^2 = {1 \over m} \sum_{i=1}^m (x^{(i)} - \mu)^2 \tag{5} \label{5}</script>
<p><strong>There is an alternative formula with the constant \({1 \over m-1}\) but in machine learning the formulae \eqref{4} and \eqref{5} are more prevalent. Both are practically very similar for high values of \(m\).</strong></p>
<h3 id="density-estimation-algorithm">Density Estimation Algorithm</h3>
<p>The Gaussian distribution explained above, can be used to model an anomaly detection algorithm for a training data, \(\{x^{(1)}, x^{(2)}, \cdots, x^{(m)}\}\) where each \(x^{(i)}\) is a set of \(n\) features, \(\{x_2^{(i)}, x_2^{(i)}, \cdots, x_n^{(i)}\}\). Then, \(p(x)\) in \eqref{1} is given by,</p>
<script type="math/tex; mode=display">p(x) = p(x_1; \mu_1, \sigma_1^2) p(x_2; \mu_2, \sigma_2^2) \cdots p(x_n; \mu_n, \sigma_n^2) \tag{6} \label{6}</script>
<blockquote>
<p>Assumption: The features \(\{x_1, x_2, \cdots, x_n\}\) are independent of each other.</p>
</blockquote>
<p>where,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
x_1 &\sim \mathcal{N}(\mu_1, \sigma_1^2) \\
x_2 &\sim \mathcal{N}(\mu_2, \sigma_2^2) \\
\vdots \\
x_j &\sim \mathcal{N}(\mu_j, \sigma_j^2) \\
\vdots \\
x_n &\sim \mathcal{N}(\mu_n, \sigma_n^2) \\
\end{align} %]]></script>
<p>And, \eqref{6}, can be written as,</p>
<script type="math/tex; mode=display">p(x) = \prod_{j=1}^n p(x_j; \mu_j, \sigma_j^2) \tag{7} \label{7}</script>
<p>This estimation of \(p(x)\) in \eqref{7} is called the <strong>density estimation</strong>.</p>
<p>To summarize:</p>
<ul>
<li>Choose features \(x_i\) that are indicative of anomalous behaviour (general properties that define an instance).</li>
<li>Fit parameters, \(\mu_1, \cdots, \mu_n, \sigma_1^2, \cdots, \sigma_n^2\), given by,</li>
</ul>
<script type="math/tex; mode=display">\mu_j = {1 \over m} \sum_{i=1}^m x_j^{(i)} \tag{8} \label{8}</script>
<script type="math/tex; mode=display">\sigma_j = {1 \over m} \sum_{i=1}^m (x_j^{(i)} - \mu_j)^2 \tag{9} \label{9}</script>
<ul>
<li>Given a new example, compute \(p(x)\), using \eqref{6} and \eqref{3}, and mark as anomalous based on \eqref{1}.</li>
</ul>
<p><strong>Implementation</strong></p>
<p><a href="https://github.com/shams-sam/CourseraMachineLearningAndrewNg/blob/master/Anomaly%20Detection.ipynb" target="\_blank"><strong>Ipython Notebook</strong></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">vstack</span><span class="p">((</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">15</span><span class="p">,</span> <span class="p">(</span><span class="mi">50</span><span class="p">,</span> <span class="mi">2</span><span class="p">)),</span>
<span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="p">(</span><span class="mi">4</span><span class="p">,</span> <span class="mi">2</span><span class="p">)),</span>
<span class="p">))</span>
<span class="c"># split into train and test</span>
<span class="n">X_train</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:</span><span class="mi">50</span><span class="p">]</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">X</span><span class="p">[</span><span class="mi">50</span><span class="p">:]</span>
<span class="c"># density estimation</span>
<span class="n">mu</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="n">X_train</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">X_train</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">sigma_squared</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="n">X_train</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">((</span><span class="n">X_train</span> <span class="o">-</span> <span class="n">mu</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="c"># probability calculation for test</span>
<span class="k">def</span> <span class="nf">p</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma_squared</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">prod</span><span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">pi</span><span class="o">*</span><span class="n">sigma_squared</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="p">(</span><span class="n">x</span><span class="o">-</span><span class="n">mu</span><span class="p">)</span><span class="o">**</span><span class="mi">2</span><span class="o">/</span><span class="p">(</span><span class="mi">2</span><span class="o">*</span><span class="n">sigma_squared</span><span class="p">)),</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">p_test</span> <span class="o">=</span> <span class="n">p</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma_squared</span><span class="p">)</span>
<span class="c"># visualization using contour plot</span>
<span class="n">delta</span> <span class="o">=</span> <span class="mf">0.025</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="n">delta</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="n">delta</span><span class="p">)</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">p</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">y</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))),</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma_squared</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
<span class="n">CS</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">contour</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">clabel</span><span class="p">(</span><span class="n">CS</span><span class="p">,</span> <span class="n">inline</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:</span><span class="mi">50</span><span class="p">,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:</span><span class="mi">50</span><span class="p">,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'b'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[</span><span class="mi">50</span><span class="p">:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[</span><span class="mi">50</span><span class="p">:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
<span class="c"># looking at the plot setting epsilon around p=0.003 seems like a fair value.</span>
</code></pre></div></div>
<p><img src="/assets/2018-05-02-anomaly-detection/fig-4-density-estimation.png?raw=true" alt="Fig-4 Density Estimation" width="75%" /></p>
<h3 id="evaluation-of-anomaly-detection-system">Evaluation of Anomaly Detection System</h3>
<blockquote>
<p>Single real-valued evaluation metrics would help in considering or rejecting a choice for improvement of an anomaly detection system.</p>
</blockquote>
<p>In order to evaluate an anomaly detection system, it is important to have a labeled dataset (similar to a supervised learning algorithm). This dataset would generally be skewed with a high number of normal cases. In order to evaluate the algorithm follow the steps (\(y=0\) is normal and \(y=1\) is anomalous),</p>
<ul>
<li>split the examples with \(y=0\) into 60-20-20 train-validation-test splits.</li>
<li>split the examples with \(y=1\) into 50-50 validation-test splits.</li>
<li>perform density estimation on the train set.</li>
<li>check the performance on the cross-validation set to find out metrics like true positive, true negative, false positive, false negative, precision/recall, f1-score. <strong>Accuracy score would not be a valid metric because in most cases the classes would be highly skewed</strong> (refer <a href="/2018/04/08/error-metrics-for-skewed-data-and-large-datasets/"><strong>Error Metrics for Skewed Data</strong></a>).</li>
<li>Following this, the value of \(\epsilon\) can be altered on the cross-validation set to improved the desired metric in the previous step.</li>
<li>The evalutaion of the final model on the held-out test set would give a unbiased picture of how the model performs.</li>
</ul>
<h3 id="anomaly-detection-vs-supervised-learning">Anomaly Detection vs Supervised Learning</h3>
<blockquote>
<p>A natural question arises, “If we have labeled data, why not used a supervised learning algorithm like logistic regression or SVM?”.</p>
</blockquote>
<p>Even though there are no hard-and-fast rules about when to use what, there a few recommendations based on observations of learning performance of different algorithms in such settings. They are listed below,</p>
<ul>
<li>In an anomaly detection setting, it is generally the case that there is a very small number of positive examples (i.e. \(y=1\) or the anomalous examples) and a large number of negative examples (i.e. \(y=0\) or normal examples). On the contrary, for supervised learning there is a large number of positive and negative examples.</li>
<li>Many a times there are a variety of anomalies that might be presented by a sample (including anomalies that haven’t been presented so far), and if the number of positive set is small to learn from then anomaly detection algorithm stands a better chance in performing better. On the other hand a supervised learning algorithm needs a bigger set of examples from both positive and negative samples to get a sense of the differentiations among the two are as well as the future anomalies are more likely to be the ones presented so far in the training set.</li>
</ul>
<h3 id="choosing-features">Choosing Features</h3>
<blockquote>
<p>Feature engineering (or choosing the features which should be used) has a great deal of effect on the performance of an anomaly detection algorithm.</p>
</blockquote>
<ul>
<li>Since the algorithm tries to fit a Gaussian distribution through the dataset, it is always helpful if the the histogram of the data fed to the density estimation looks similar to a Gaussian bell shape.</li>
<li>If the data is not in-line with the shape of a Gaussian bell curve, sometimes a transformation can help bring the feature closer to a Gaussian approximation.</li>
</ul>
<p>Some of the popular transforms used are,</p>
<ul>
<li>\(log(x)\)</li>
<li>\(log(x + c)\)</li>
<li>\(\sqrt{x}\)</li>
<li>
<p>\(x^{ {1 \over 3} }\)</p>
</li>
<li>Choosing of viable feature options for the algorithm sometimes depends on the domain knowledge as it would help selecting the observations that one is targeting as possible features. For example, network load and requests per minute might a good feature for anomaly detection is a data center. Sometimes it possible to come up with combined features to achieve the same objective. So the rule of thumb is to come up with features that are found to differ substantially among the normal and anomalous examples.</li>
</ul>
<h3 id="multivariate-gaussian-distribution">Multivariate Gaussian Distribution</h3>
<p>The <a href="#density-estimation-algorithm">density estimation</a> seen earlier had the underlying assumption that the features are independent of each other. While the assumption simplifies the analysis there are various downsides to the assumption as well.</p>
<p>Consider the data as shown in the plot below. It can be seen clearly that there is some correlation (negative correlation to be exact) among the two features.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">50</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">X</span><span class="p">[:</span><span class="mi">10</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">X</span><span class="p">[</span><span class="mi">10</span><span class="p">:</span><span class="mi">40</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">30</span><span class="p">)</span>
<span class="n">X</span><span class="p">[</span><span class="mi">40</span><span class="p">:,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">30</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="o">-</span><span class="mi">3</span> <span class="o">*</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="mi">20</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">50</span><span class="p">,)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span>
<span class="p">[</span><span class="mf">10.</span><span class="p">,</span> <span class="o">-</span><span class="mf">100.</span><span class="p">],</span>
<span class="p">[</span><span class="mf">40.</span><span class="p">,</span> <span class="o">-</span><span class="mf">40.</span><span class="p">]</span>
<span class="p">])</span>
<span class="k">def</span> <span class="nf">normalize</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
<span class="n">X_mean</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">X_std_dev</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="n">X</span><span class="o">-</span><span class="n">X_mean</span><span class="p">)</span><span class="o">/</span><span class="n">X_std_dev</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">normalize</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">X_test</span> <span class="o">=</span> <span class="n">normalize</span><span class="p">(</span><span class="n">X_test</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'b'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_test</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_test</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'scaled'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/2018-05-02-anomaly-detection/fig-5-correlated-features.png?raw=true" alt="Fig-5 Correlated Features" /></p>
<p>Univariate Gaussian distribution applied to this data results in the following countour plot, which points to the assumption made in \eqref{7}. Because while <strong>the two features are negatively correlated, the contour plot do not show any such dependency</strong>. On the contrary, if multivariate gaussian distribution is applied to the same data one can point out the correlation. Seeing the difference, it is also clear that the chances of test sets (red points) being marked as normal is lower in multivariate Gaussian than in the other.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mu</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">sigma</span> <span class="o">=</span> <span class="n">X</span><span class="o">.</span><span class="n">std</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">p</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">)</span>
<span class="n">mu_mv</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">sigma_mv</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">matmul</span><span class="p">((</span><span class="n">X</span> <span class="o">-</span> <span class="n">mu_mv</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">(),</span> <span class="p">(</span><span class="n">X</span><span class="o">-</span><span class="n">mu_mv</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">p_mv</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma</span><span class="p">):</span>
<span class="n">res</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">x_i</span> <span class="ow">in</span> <span class="n">x</span><span class="p">:</span>
<span class="n">res</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">pi</span> <span class="o">**</span> <span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">/</span><span class="mi">2</span><span class="p">))</span> <span class="o">/</span> <span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">det</span><span class="p">(</span><span class="n">sigma_mv</span><span class="p">))</span> <span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="mf">0.5</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">x_i</span><span class="o">-</span><span class="n">mu</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">linalg</span><span class="o">.</span><span class="n">pinv</span><span class="p">(</span><span class="n">sigma</span><span class="p">),</span> <span class="p">(</span><span class="n">x_i</span><span class="o">-</span><span class="n">mu</span><span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">()))))</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">res</span><span class="p">)</span>
<span class="n">p_mv</span><span class="p">(</span><span class="n">X_test</span><span class="p">,</span> <span class="n">mu_mv</span><span class="p">,</span> <span class="n">sigma_mv</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">7</span><span class="p">))</span>
<span class="n">delta</span> <span class="o">=</span> <span class="mf">0.025</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">delta</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">delta</span><span class="p">)</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">p</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">y</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))),</span> <span class="n">mu</span><span class="p">,</span> <span class="n">sigma_squared</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">CS</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">contour</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">clabel</span><span class="p">(</span><span class="n">CS</span><span class="p">,</span> <span class="n">inline</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'b'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_test</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_test</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'scaled'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'univariate gaussian distribution'</span><span class="p">)</span>
<span class="n">delta</span> <span class="o">=</span> <span class="mf">0.025</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">delta</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="n">delta</span><span class="p">)</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">meshgrid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">z</span> <span class="o">=</span> <span class="n">p_mv</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">x</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">y</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">))),</span> <span class="n">mu_mv</span><span class="p">,</span> <span class="n">sigma_mv</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="n">x</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">CS</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">contour</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">clabel</span><span class="p">(</span><span class="n">CS</span><span class="p">,</span> <span class="n">inline</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">12</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'b'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.7</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_test</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_test</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'scaled'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'multivariate gaussian distribution'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/2018-05-02-anomaly-detection/fig-6-univariate-vs-multivariate.png?raw=true" alt="Fig-6 Univariate vs Multivariate Gaussian" /></p>
<p>So, mutlivariate gaussian distribution basically helps model \(p(x)\) is one go, unlike \eqref{7}, that models individual features \(\{x_1, x_2, \cdots, x_n\}\) in \(x\). The multivariate gaussian distribution is given by,</p>
<script type="math/tex; mode=display">p(x; \mu, \Sigma) = \frac {1} {(2\pi)^{n/2} \, |\Sigma|^{1/2}} exp\left(-{1 \over 2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right) \tag{10} \label{10}</script>
<p>where,</p>
<ul>
<li>\(\mu \in \mathbb{R}\) and \(\Sigma \in \mathbb{R}^{n * n}\) are the parameters of the distribution.</li>
<li>\(|\Sigma|\) is the determinant of the matrix \(\Sigma\).</li>
</ul>
<p>The density estimation for multivariate gaussian distribution can be done using the following 2 formulae,</p>
<script type="math/tex; mode=display">\mu = {1 \over m} \sum_{i=1}^m x^{(i)} \tag{11} \label{11}</script>
<script type="math/tex; mode=display">\Sigma = {1 \over m} \sum_{i=1}^m (x^{(i)} - \mu) (x^{(i)} - \mu)^T \tag{12} \label{12}</script>
<p><strong>Steps in multivariate density estimation:</strong></p>
<ul>
<li>Given a train dataset, estimate the parameters \(\mu\) and \(\Sigma\) using \eqref{11} and \eqref{12}.</li>
<li>For a new example \(x\), compute \(p(x)\) given by \eqref{10}.</li>
<li>Flag as anomaly if \(p(x) < \epsilon\).</li>
</ul>
<p>The covariance matrix is the term that brings in the major difference between the univariate and the multivariate gaussian. The effect of covariance matrix and mean shifting can be seen in the plots below.</p>
<blockquote>
<p>A covariance matrix is always symmetric about the main diagonal.</p>
</blockquote>
<p><img src="/assets/2018-05-02-anomaly-detection/fig-7-effect-of-mean-and-covariance.png?raw=true" alt="Fig-7 Effect of Mean and Covariance Matrix" /></p>
<ul>
<li>The mean shifts the center of the distribution.</li>
<li>Diagonal elements vary the spread of the distribution along corresponding features (also called the variance).</li>
<li>Off-diagonal elements vary the correlation among the various features.</li>
</ul>
<p>Also, the original model in \eqref{7} is a special case of the multivariate gaussian distribution where the off-diagonal elements of the covariance matrix are contrained to zero (<strong>countours are axis aligned</strong>).</p>
<h3 id="univariate-vs-multivariate-gaussian-distribution">Univariate vs Multivariate Gaussian Distribution</h3>
<ul>
<li>Univariate model can be used when the features are manually created to capture the anomalies and the features take unusual combinations of values. Whereas multivariate gaussian can be used when the correlation between features is to be captured as well.</li>
<li>Univariate model is computationally cheaper and hence scales well to the larger dataset (\(m=10,000-100,000\)), whereas the multivariate model is computationally expensive, majorly because of the term \(\Sigma_{-1}\).</li>
<li>Univariate model works well for smaller value of \(m\) as well. For multivariate model, \(m \gt n\), or else \(\Sigma\) is singular and hence not invertible.</li>
<li>Generally multivariate gaussian is used when \(m\) is much bigger than \(n\), like \(m \gt 10n\), because \(\Sigma\) is a fairly large matrix with around \({n \over 2}\) parameters, which would be learnt better in a setting with larger \(m\).</li>
</ul>
<p><strong>A matrix might be singular because of the presence of redundant features, i.e. two features are linearly dependent or a feature is a linear combination of a set of other features. Such matrices are non-invertible.</strong></p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://www.coursera.org/learn/machine-learning/lecture/V9MNG/problem-motivation" target="_blank">Machine Learning: Coursera - Problem Motivation</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/ZYAyC/gaussian-distribution" target="_blank">Machine Learning: Coursera - Gaussian Distribution</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/C8IJp/algorithm" target="_blank">Machine Learning: Coursera - Algorithm</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/Rkc5x/anomaly-detection-vs-supervised-learning" target="_blank">Machine Learning: Coursera - Anomaly Detection vs Supervised Learning</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/Cf8DF/multivariate-gaussian-distribution" target="_blank">Machine Learning: Coursera - Multivariate Gaussian Distribution</a></small></p>
Wed, 02 May 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/05/02/anomaly-detection/
https://machinelearningmedium.com/2018/05/02/anomaly-detection/machine-learningandrew-ngbasics-of-machine-learningPrincipal Component Analysis<h3 id="basics-of-machine-learning-series">Basics of Machine Learning Series</h3>
<blockquote>
<p><a href="/collection/basics-of-machine-learning">Index</a></p>
</blockquote>
<div class="horizontal-divider">· · ·</div>
<h3 id="introduction">Introduction</h3>
<p>For a given dataset, PCA tries to find a lower dimensional surface onto which these points can be projected while minimizing the approximation losses. For example consider the dataset (marked by blue dots’s) in \(\mathbb{R}^2\) in the the plot below. The line formed by the red x’s is the projection of the data from \(\mathbb{R}^2\) to \(\mathbb{R}\).</p>
<p><img src="/assets/2018-04-22-principal-component-analysis/fig-1-pca-projection.png?raw=true" alt="Fig-1 PCA Projection" /></p>
<h3 id="implementation">Implementation</h3>
<p><a href="https://github.com/shams-sam/CourseraMachineLearningAndrewNg/blob/master/PCA.ipynb" target="_blank"><strong>Ipython Notebook</strong></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># library imports</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">sklearn.decomposition</span> <span class="kn">import</span> <span class="n">PCA</span>
<span class="c"># random data generation</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">50</span><span class="p">,</span> <span class="mi">2</span><span class="p">))</span>
<span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">50</span><span class="p">)</span>
<span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">]</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">X</span> <span class="o">+</span> <span class="mi">5</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span>
<span class="c"># applying PCA on data</span>
<span class="c"># same number of dimensions will help visualize components</span>
<span class="n">pca</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="c"># reduced number of dimensions will help understand projections</span>
<span class="n">pca_reduce</span> <span class="o">=</span> <span class="n">PCA</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="c"># projection on new components found</span>
<span class="n">X_proj</span> <span class="o">=</span> <span class="n">pca_reduce</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="c"># rebuilding the data back to original space</span>
<span class="n">X_rebuild</span> <span class="o">=</span> <span class="n">pca_reduce</span><span class="o">.</span><span class="n">inverse_transform</span><span class="p">(</span><span class="n">X_proj</span><span class="p">)</span>
<span class="n">X_proj</span> <span class="o">=</span> <span class="n">pca</span><span class="o">.</span><span class="n">fit_transform</span><span class="p">(</span><span class="n">X</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">7</span><span class="p">))</span>
<span class="c"># plot data and projection</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'green'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_rebuild</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_rebuild</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="c"># plot the components</span>
<span class="n">soa</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span>
<span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">(</span><span class="n">pca</span><span class="o">.</span><span class="n">components_</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span> <span class="o">*</span> <span class="n">pca</span><span class="o">.</span><span class="n">mean_</span><span class="p">,</span>
<span class="n">pca</span><span class="o">.</span><span class="n">components_</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">atleast_2d</span><span class="p">(</span>
<span class="c"># components scaled to the length of their variance</span>
<span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">pca</span><span class="o">.</span><span class="n">explained_variance_</span><span class="p">)</span>
<span class="p">)</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span>
<span class="p">))</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">u</span><span class="p">,</span> <span class="n">v</span> <span class="o">=</span> <span class="nb">zip</span><span class="p">(</span><span class="o">*</span><span class="n">soa</span><span class="p">)</span>
<span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">gca</span><span class="p">()</span>
<span class="n">ax</span><span class="o">.</span><span class="n">quiver</span><span class="p">(</span>
<span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">u</span><span class="p">,</span> <span class="n">v</span><span class="p">,</span>
<span class="n">angles</span><span class="o">=</span><span class="s">'xy'</span><span class="p">,</span>
<span class="n">scale_units</span><span class="o">=</span><span class="s">'xy'</span><span class="p">,</span>
<span class="n">scale</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span>
<span class="n">color</span><span class="o">=</span><span class="s">'rb'</span>
<span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">axis</span><span class="p">(</span><span class="s">'scaled'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">draw</span><span class="p">()</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">([</span>
<span class="s">'original'</span><span class="p">,</span>
<span class="s">'projection'</span>
<span class="p">])</span>
<span class="c"># plot the projection errors</span>
<span class="k">for</span> <span class="n">p_orig</span><span class="p">,</span> <span class="n">p_proj</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">X_rebuild</span><span class="p">):</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="n">p_orig</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">p_proj</span><span class="p">[</span><span class="mi">0</span><span class="p">]],</span> <span class="p">[</span><span class="n">p_orig</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">p_proj</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span> <span class="n">c</span><span class="o">=</span><span class="s">'g'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/2018-04-22-principal-component-analysis/fig-2-pca-projection-with-errors.png?raw=true" alt="Fig-2 PCA Projection, Components, and Projection Errors" /></p>
<p>From the above plot it is easier to point out what exactly PCA is doing. The green points show the original data. So all PCA is trying to do is to find the orthogonal components along which the eigenvalues are maximized which is basically a fancy way of saying that PCA finds a feature set in the order of decreasing variance for a given dataset. In the above example, the red vector is displaying higher variance and is the first component, while the blue vector is displaying relatively less variance.</p>
<blockquote>
<p>Performing PCA for number of components greater than the current number of dimensions is useless as the data is preserved with 100% variance in the current dimension and no new dimension can help enhance that metric.</p>
</blockquote>
<p>So, when dimensionality reduction is done using PCA as can be seen in the red dots, the projection is done along the more dominant feature among the two as it is more representative of the data among the two dimensions. Also, it can be seen that the red vector lies on a line than minimizes the projection losses represented by the green lines from the original data point to the projected data points.</p>
<blockquote>
<p>Mean Normalization and feature scaling are a must before performing the PCA, so that the variance of a component is not affected by the disparity in the range of values.</p>
</blockquote>
<p>Generalizing to n-dimensional data the same technique can be used to reduce the data to k-dimensions in a similar way by finding the hyper-surface with least projection error.</p>
<h3 id="projection-vs-prediction">Projection vs Prediction</h3>
<blockquote>
<p>PCA is not Linear Regression.</p>
</blockquote>
<p>In linear regression, the aim is predict a given dependent variable, \(y\) based on independent variables, \(x\), i.e. <strong>minimize the prediction error</strong>. In contrast, PCA does not have a target variable, \(y\), it is mere feature reduction by <strong>minimizing the projection error</strong>. The difference is clear from the plot in Fig-3.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># import the sklearn model</span>
<span class="kn">from</span> <span class="nn">sklearn.linear_model</span> <span class="kn">import</span> <span class="n">LinearRegression</span>
<span class="n">lin_reg</span> <span class="o">=</span> <span class="n">LinearRegression</span><span class="p">()</span>
<span class="n">lin_reg</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">])</span>
<span class="c"># coef_ gives the regression coefficients</span>
<span class="n">y_pred</span> <span class="o">=</span> <span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span> <span class="o">*</span> <span class="n">lin_reg</span><span class="o">.</span><span class="n">coef_</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">7</span><span class="p">,</span> <span class="mi">7</span><span class="p">))</span>
<span class="c"># plot data and projection</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'green'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X_rebuild</span><span class="p">[:,</span> <span class="mi">0</span><span class="p">],</span> <span class="n">X_rebuild</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">],</span> <span class="n">y_pred</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">c</span><span class="o">=</span><span class="s">'blue'</span><span class="p">)</span>
<span class="c"># plot the projection errors</span>
<span class="k">for</span> <span class="n">p_orig</span><span class="p">,</span> <span class="n">p_proj</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">X_rebuild</span><span class="p">):</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="n">p_orig</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">p_proj</span><span class="p">[</span><span class="mi">0</span><span class="p">]],</span> <span class="p">[</span><span class="n">p_orig</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">p_proj</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span> <span class="n">c</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
<span class="c"># plot the prediction errors</span>
<span class="k">for</span> <span class="n">p_orig</span><span class="p">,</span> <span class="n">y</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">X</span><span class="p">[:,</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="n">y_pred</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">)))):</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">([</span><span class="n">p_orig</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">y</span><span class="p">[</span><span class="mi">0</span><span class="p">]],</span> <span class="p">[</span><span class="n">p_orig</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">y</span><span class="p">[</span><span class="mi">1</span><span class="p">]],</span> <span class="n">c</span><span class="o">=</span><span class="s">'b'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/2018-04-22-principal-component-analysis/fig-3-projection-vs-prediction.png?raw=true" alt="Fig-3 Projection vs Prediction" /></p>
<p>The blue points display the prediction based on linear regression, while the red points display the projection on the reduced dimension. The optimization objectives of the two algorithms are different. While linear regression is trying to minimize the squared errors represented by the blue lines, PCA is trying to minimize the projection errors represented by the red lines.</p>
<h3 id="mean-normalization-and-feature-scaling">Mean Normalization and Feature Scaling</h3>
<p>It is important to have both these steps in the preprocessing, before PCA is applied. The mean of any feature in a design matrix can be calculated by,</p>
<script type="math/tex; mode=display">\mu_j = {1 \over m} \sum_{i=1}^m x_j^{(i)} \label{1} \tag{1}</script>
<p>Following the calculation of means, the normalization can be done by replacing each \(x_j\) by \(x_j - \mu_j\). Similarly, feature scaling is done by replacing each \(x_j^{(i)}\) by,</p>
<script type="math/tex; mode=display">\frac {x_j^{(i)} -\mu_j} {s_j} \label{2} \tag{2}</script>
<p>where \(s_j\) is some <strong>measure of the range of values of feature</strong> \(j\). It can be \(max(x_j) - min(x_j)\) or more commonly the standard deviation of the feature.</p>
<h3 id="pca-algorithm">PCA Algorithm</h3>
<ul>
<li>Compute <strong>covariance matrix</strong> given by,</li>
</ul>
<script type="math/tex; mode=display">\Sigma = {1 \over m} \sum_{i}^n \left( x^{(i)} \right) \left( x^{(i)} \right)^T \label{3} \tag{3}</script>
<blockquote>
<p>All covariance matrices, \(\Sigma\), satisfy a mathematical property called symmetric positive definite. (Will take up in future posts.)</p>
</blockquote>
<ul>
<li>Follow by this, eigen vectors and eigen values are calculated. There are various ways of doing this, most popularly done by singular value decomposition (SVD) of the covariance matrix. SVD returns three different matrices given by,</li>
</ul>
<script type="math/tex; mode=display">U,S,V = SVD(\Sigma) \label{4} \tag{4}</script>
<p>where,</p>
<ul>
<li>\(\Sigma\) is a \(n * n\) matrix because each \(x^{(i)}\) is a \(n * 1\) vector.</li>
<li>\(U\) is the \(n * n\) matrix where each column represents a component of the PCA. In order to reduce the dimensionality of the data, one needs to choose the first \(k\) columns to form a matrix, \(U_{reduce}\), which is a \(n * k\) matrix.</li>
</ul>
<p>So the dimensionally compressed data is given by,</p>
<script type="math/tex; mode=display">z^{(i)} = U_{reduce}^T x^{(i)} \label{5} \tag{5}</script>
<p>Since, \(U_{reduce}^T\) is \(k * n\) matrix and \(x^{(i)}\) is \(n * 1\) vector, the product, \(z^{(i)}\) is a \(k * 1\) vector with reduced number of dimensions.</p>
<p>Given a reduced representation, \(z^{(i)}\), we can find its <strong>approximate reconstruction</strong> in the higher dimension by,</p>
<script type="math/tex; mode=display">x_{approx}^{(i)} = U_{reduce} \cdot z^{(i)} \label{6} \tag{6}</script>
<p>Since, \(U_{reduce}\) is \(n * k\) matrix and \(z^{(i)}\) is \(k * 1\) vector, the product, \(x_{approx}^{(i)}\) is a \(n * 1\) vector with the original number of dimensions.</p>
<h3 id="number-of-principal-components">Number of Principal Components</h3>
<p>How to determine the number pricipal components to retain during the dimensionality reduction?</p>
<p>Consider the following two metrics</p>
<ul>
<li>The objective of PCA is to minimize the projection error given by,</li>
</ul>
<script type="math/tex; mode=display">{1 \over m} \sum_{i=1}^m \lVert x^{(i)} - x_{approx}^{(i)} \rVert^2 \label{7} \tag{7}</script>
<ul>
<li>Total variation in the data is given by,</li>
</ul>
<script type="math/tex; mode=display">{1 \over m} \sum_{i=1}^m \lVert x^{(i)} \rVert^2 \label{8} \tag{8}</script>
<p><strong>Rule of Thumb</strong> is, choose the smallest value of \(k\), such that,</p>
<script type="math/tex; mode=display">\frac { {1 \over m} \sum_{i=1}^m \lVert x^{(i)} - x_{approx}^{(i)} \rVert^2} { {1 \over m} \sum_{i=1}^m \lVert x^{(i)} \rVert^2} \leq 0.01 (\text{or } 1\%) \label{9} \tag{9}</script>
<p>i.e. \(99\%\) of the variance is retained (Generally values such as \(95-90\%\) variance retention are used). It will be seen overtime than often the amount of dimensions reduced is significant while maintaining the 99% variance. (because many features are highly correlated.)</p>
<blockquote>
<p>Talking about the amount of variance retained in more informative than citing the number of principal components retained.</p>
</blockquote>
<p>So for choosing k the following method could be used,</p>
<ul>
<li>Try PCA for \(k=1\)</li>
<li>Compute \(U_{reduce}\), \(z^{(1)}, z^{(2)}, \cdots, z^{(m)}\), \(x_{approx}^{(1)}, x_{approx}^{(2)}, \cdots, x_{approx}^{(m)}\)</li>
<li>Check variance retention using \eqref{9}.</li>
<li>Repeat the steps for \(k = 2, 3, \cdots\) to satisfy \eqref{9}.</li>
</ul>
<p>There is an easy work around to bypass this tedious process by using \eqref{4}. The matrix \(S\) returned by SVD is a diagonal matrix of eigenvalues corresponding to each of the components in \(U\). \(S\) is a \(n * n\) matrix with diagonal eigenvalues \(s_{11}, s_{22}, \cdots, s_{nn}\) and off-diagonal elements equal to 0. Then for a given value of \(k\),</p>
<script type="math/tex; mode=display">\frac { {1 \over m} \sum_{i=1}^m \lVert x^{(i)} - x_{approx}^{(i)} \rVert^2} { {1 \over m} \sum_{i=1}^m \lVert x^{(i)} \rVert^2} = 1 - \frac {\sum_{i=1}^k s_{ii}} {\sum_{i=1}^n s_{ii}} \label{10} \tag{10}</script>
<p>Using \eqref{10}, \eqref{9} can be written as,</p>
<script type="math/tex; mode=display">\frac {\sum_{i=1}^k s_{ii}} {\sum_{i=1}^n s_{ii}} \gt 0.99 (\text{or } 99\%) \label{11} \tag{11}</script>
<p>Now, it is easier to calculate the variance retained by iterating over values of \(k\) and calculating the value in \eqref{11}.</p>
<blockquote>
<p>The value in \eqref{10} is a good metrics to cite as the performance of PCA, as to how well is the reduced dimensional data representing the original data.</p>
</blockquote>
<h3 id="suggestions-for-using-pca">Suggestions for Using PCA</h3>
<ul>
<li><strong>Speed up a learning algorithm</strong> by reducing the number of features by applying PCA and choosing top-k to maintain 99% variance. PCA should be only applied on the training data to get the \(U_{reduce}\) and not on the cross-validation or test data. This is because \(U_{reduce}\) is parameter of the model and hence should be only learnt on the training data. Once the matrix is determined, the same mapping can be applied on the other two sets.</li>
</ul>
<blockquote>
<p>Run PCA only on the training data, not on cross-validation or test data.</p>
</blockquote>
<ul>
<li>If using PCA for visualization, it does not make sense to choose \(k \gt 3\).</li>
<li>Usage of PCA to reduce overfitting is not correct. The reason it works well in some cases is because it reduces the number of features and hence reduces the variance and increases the bias. But often there are better ways of doing this by using regularization and other similar techniques than use PCA. This would be a bad application of PCA. It is generally adviced against because PCA removes some information without keeping into consideration the target values. While this might work when 99% of the variance is retained, it may as well on various occasions lead to the loss of some useful information. On the other hand, regularization parameters are more optimal for preventing overfitting because while penalizing overfitting they also keep in context the values of the target vector.</li>
</ul>
<blockquote>
<p>Do not use PCA to prevent overfitting. Instead look into <a href="/2017/09/08/overfitting-and-regularization/" target="_blank">regularization</a>.</p>
</blockquote>
<ul>
<li>
<p>It is often worth a shot to try any algorithm without using PCA before diving into dimensionality reduction. So, before implementing PCA, implement the models with original dataset. If this does not give desired result, one should move ahead a try using PCA to reduce the number of features. This would also give a worthy baseline score to match the performance of model against once PCA is applied.</p>
</li>
<li>
<p>PCA can also be used in cases when the original data is too big for the disk space. In such cases, compressed data will give some benefits of space saving by dimensionality reduction.</p>
</li>
</ul>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://www.coursera.org/learn/machine-learning/lecture/GBFTt/principal-component-analysis-problem-formulation" target="_blank">Machine Learning: Coursera - PCA Problem Formulation</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/ZYIPa/principal-component-analysis-algorithm" target="_blank">Machine Learning: Coursera - Algorithm</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/X8JoQ/reconstruction-from-compressed-representation" target="_blank">Machine Learning: Coursera - Reconstruction</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/S1bq1/choosing-the-number-of-principal-components" target="_blank">Machine Learning: Coursera - Choosing the number of principal components</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/RBqQl/advice-for-applying-pca" target="_blank">Machine Learning: Coursera - Advice</a></small></p>
Sun, 22 Apr 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/04/22/principal-component-analysis/
https://machinelearningmedium.com/2018/04/22/principal-component-analysis/machine-learningandrew-ngbasics-of-machine-learningK-Means Clustering<h3 id="basics-of-machine-learning-series">Basics of Machine Learning Series</h3>
<blockquote>
<p><a href="/collection/basics-of-machine-learning">Index</a></p>
</blockquote>
<div class="horizontal-divider">· · ·</div>
<h3 id="introduction">Introduction</h3>
<ul>
<li>K-means clustering is one of the most popular clustering algorithms.</li>
<li>It gets it name based on its property that it tries to find most optimal user specified k number of clusters in a any dataset. The quality of the dataset and their seperability is subject to implementation details, but it is fairly straight forward iterative algorithm.</li>
<li>It basically involves a random centroid initialization step followed by two steps, namely, cluster assignment step, and centroid calculation step that are executed iteratively until a stable mean set is arrived upon. It becomes more clear in the animation below.</li>
</ul>
<p><img src="/assets/2018-04-19-k-means-clustering/fig-1-clustering-animation.gif?raw=true" alt="Fig-1 K-Means Animation" width="70%" /></p>
<ul>
<li><strong>Cluster Assignment</strong>: Assign each data point to one of the two clusters based on its distance from them. A point is assigned to the cluster, whose centroid it is closer to.</li>
<li><strong>Move Centroid:</strong> After cluster assignment, centroids are moved to the mean of clusters formed. And then the process is repeated. After a certain number of steps the centroids will no longer move around and then the iterations can stop.</li>
</ul>
<h3 id="k-means-algorithm">K-Means Algorithm</h3>
<p>Input:</p>
<ul>
<li>\(K\), number of clusters</li>
<li>Training set, \(x^{(1)}, x^{(2)}, \cdots, x^{(m)}\)</li>
</ul>
<p>where:</p>
<ul>
<li>\(x^{(i)} \in \mathbb{R}^n\), as there are no bias terms, \(x_0=1\)</li>
</ul>
<p>Algorithm:</p>
<ul>
<li>Randomly initialize \(K\) cluster centroids, \(\mu_1, \mu_2, \cdots, \mu_k \in \mathbb{R}^n\)</li>
<li>Then,</li>
</ul>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
Repeat \{ \\
\text{for }i &= 1\text{ to }m \\
& c^{(i)} =\text{ index (from }1\text{ to }K\text{) of centroid closest to }x^{(i)}\text{, i.e., } min_k \lVert x^{(i)} - \mu_k \rVert^2 \\
\text{for }k &= 1\text{ to }K \\
& \mu_k =\text{ average (mean) of points assigned to cluster, }k\text{, i.e., } \frac {\text{sum of } x^{(i)}\text{, where }c^{(i)} = k} {\text{number of }c^{(i)} = k} \\
\}
\end{align}
\tag{1} \label{1} %]]></script>
<blockquote>
<p>It is common to apply k-means to a non-seperated clusters. This has particular applications in segmentation problems, like market segmentation or population division based on pre-selected features.</p>
</blockquote>
<h3 id="optimization-objective">Optimization Objective</h3>
<p>Notation:</p>
<ul>
<li>\(c^{(i)}\) - index of cluster \(\{1, 2, \cdots, K\}\) to which example \(x^{(i)}\) is currently assigned</li>
<li>\(\mu_k\) - cluster centroid \(k\), \(\mu_k \in \mathbb{R}^n\)</li>
<li>\(\mu_{c^{(i)}}\) - cluster centroid of the cluster to which the example \(x^{(i)}\) is assigned</li>
</ul>
<p>Following the above notation, the cost function of the k-means clustering is given by,</p>
<script type="math/tex; mode=display">J(c^{(1)}, c^{(2)}, \cdots, c^{(m)}, \mu_1, \mu_2, \cdots, \mu_K) = {1 \over m} \sum_{i=1}^m \lVert x^{(i)} - \mu_{c^{(i)}} \rVert^2 \tag{2} \label{2}</script>
<p>Hence the optimization objective is,</p>
<script type="math/tex; mode=display">min_{c^{(1)}, c^{(2)}, \cdots, c^{(m)},\\ \mu_1, \mu_2, \cdots, \mu_K} J(c^{(1)}, c^{(2)}, \cdots, c^{(m)}, \mu_1, \mu_2, \cdots, \mu_K) \tag{3} \label{3}</script>
<blockquote>
<p>The cost function in \eqref{2} is called distortion cost function or the distortion of k-means clustering.</p>
</blockquote>
<p>It can argued that the k-means algorithm in \eqref{1}, is implementing the cost function optimization. This is so because the first step of k-mean clustering, i.e. the cluster assignment step is nothing but the minimization of the cost w.r.t. \(c^{(1)}, c^{(2)}, \cdots, c^{(m)}\) as this step involves assigning a data point to the nearest possible cluster. Similarly the second step, i.e. moving the centroid step is the minimization of the clustering cost w.r.t. \(\mu_1, \mu_2, \cdots, \mu_K\) as the most optimal position of centroid for minimizing the distortion for a given set of points is the mean position.</p>
<p>One handy way of checking if the clustering alorithm is working correctly is to plot distortion as a function of number of iterations. As both the steps in the k-means are calculated steps for minimization it is always going to decrease or remain approximately constant as the number of iterations increase.</p>
<h3 id="random-initialization">Random Initialization</h3>
<p>There are various ways for randomly picking out \(K < m\) cluster centroids, but the most recommended one involves picking \(K\) randomly picked training examples and initialize \(\{\mu_1, \mu_2, \cdots, \mu_K\}\) equal to these \(K\) examples.</p>
<p>Based on initialization, it is possible that k-means could converge to different centroids or stuck in some local optima. One possible solution to this is to try multiple random initializations and then choose the one with the least distortion. It’s fairly usual to run k-means around 50-1000 times with random initialization to make sure that it does not get stuck in local optima.</p>
<p>Generally the trick of multiple random initializations will help only if the number of clusters is small, i.e. between 2-10. For higher number of clusters the multiple number of random initializations are less likely to help improve the distortion cost function.</p>
<h3 id="choosing-the-number-of-clusters">Choosing the Number of Clusters</h3>
<ul>
<li>One way of choosing the number of clusters is by manually visualizing the data.</li>
<li>Sometimes it is ambiguous as to how many clusters exist in the dataset and in such cases it’s rather more useful to choose the number of clusters on the basis of end goal or the number of clusters that serve well the later down stream goal that needs to be extracted from the datasets.</li>
<li><strong>Elbow Method:</strong> On plotting the distortion as a function of number of clusters, \(K\), this methods says that the optimal number of cluster at the point the elbow occurs as can be seen for line B in the plot below. It is a reasonable way of choosing the number of clusters. But this method does not always work because the sometimes the plot would look like line A which does not have clear elbow to choose.</li>
</ul>
<p><img src="/assets/2018-04-19-k-means-clustering/fig-2-elbow-method.png?raw=true" alt="Fig-2 Elbow Method" width="70%" /></p>
<blockquote>
<p>As a strict rule in k-means, as the number of cluster \(K\) increases, the distortion would decrease. But after some point the increase in cluster would not give much decrease in the distortion.</p>
</blockquote>
<h3 id="example">Example</h3>
<p><a href="https://github.com/shams-sam/CourseraMachineLearningAndrewNg/blob/master/k-means.ipynb" target="\_blank"><strong>Ipython Notebook</strong></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">cv2</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
</code></pre></div></div>
<p>Given a image it is reshaped into a vector for ease of processing,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">l</span><span class="p">,</span> <span class="n">w</span><span class="p">,</span> <span class="n">ch</span> <span class="o">=</span> <span class="n">img</span><span class="o">.</span><span class="n">shape</span>
<span class="n">vec_img</span> <span class="o">=</span> <span class="n">img</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">ch</span><span class="p">)</span><span class="o">.</span><span class="n">astype</span><span class="p">(</span><span class="nb">int</span><span class="p">)</span>
</code></pre></div></div>
<p>Following this K points are randomly chosen and assigned as centroids,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">choose_random</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">vec</span><span class="p">):</span>
<span class="n">m</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">vec</span><span class="p">)</span>
<span class="n">idx</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">m</span><span class="p">,</span> <span class="n">K</span><span class="p">)</span>
<span class="k">return</span> <span class="n">vec</span><span class="p">[</span><span class="n">idx</span><span class="p">]</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">choose_random</span><span class="p">(</span><span class="n">K</span><span class="p">,</span> <span class="n">vec_img</span><span class="p">)</span>
</code></pre></div></div>
<p>The two basic steps of k-means clustering, cluster assignment and moving centroids can be implemented as follows,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">cluster_assignment</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">vec</span><span class="p">):</span>
<span class="k">return</span> <span class="p">((</span><span class="n">vec</span> <span class="o">-</span> <span class="n">mu</span><span class="p">[:,</span> <span class="n">np</span><span class="o">.</span><span class="n">newaxis</span><span class="p">])</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">argmin</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">move_centroid</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">vec</span><span class="p">):</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">mu</span><span class="p">)):</span>
<span class="n">vec_sub</span> <span class="o">=</span> <span class="n">vec</span><span class="p">[</span><span class="n">c</span><span class="o">==</span><span class="n">i</span><span class="p">]</span>
<span class="n">mu</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">vec_sub</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="k">return</span> <span class="n">mu</span>
</code></pre></div></div>
<p>The distortion cost fuction is calculated as follows,</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">distortion</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">vec</span><span class="p">):</span>
<span class="k">return</span> <span class="p">((</span><span class="n">mu</span><span class="p">[</span><span class="n">c</span><span class="p">]</span> <span class="o">-</span> <span class="n">vec</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="nb">sum</span><span class="p">()</span> <span class="o">/</span> <span class="n">vec</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
</code></pre></div></div>
<p>Once all the modules are in place, k-means needs to iterate over the steps of cluster assignment and moving centroids until the distorion is within the threshold (threshold chosen = 1),</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">last_dist</span> <span class="o">=</span> <span class="n">distortion</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">vec_img</span><span class="p">)</span> <span class="o">+</span> <span class="mi">100</span>
<span class="n">curr_dist</span> <span class="o">=</span> <span class="n">last_dist</span> <span class="o">-</span> <span class="mi">100</span>
<span class="k">while</span> <span class="n">last_dist</span> <span class="o">-</span> <span class="n">curr_dist</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span>
<span class="n">last_dist</span> <span class="o">=</span> <span class="n">curr_dist</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">cluster_assignment</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">vec_img</span><span class="p">)</span>
<span class="n">mu</span> <span class="o">=</span> <span class="n">move_centroid</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">vec_img</span><span class="p">)</span>
<span class="n">curr_dist</span> <span class="o">=</span> <span class="n">distortion</span><span class="p">(</span><span class="n">mu</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">vec_img</span><span class="p">)</span>
</code></pre></div></div>
<p>Following plots are obtained after running k-means for image compression on two different images,</p>
<p><img src="/assets/2018-04-19-k-means-clustering/fig-3-image-compression-1.png?raw=true" alt="K-Means Compression - Image 1" width="70%" /></p>
<p><img src="/assets/2018-04-19-k-means-clustering/fig-4-image-compression-2.png?raw=true" alt="K-Means Compression - Image 2" width="70%" /></p>
<p>Following code implements k-means for different values of K.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">elbow</span><span class="p">(</span><span class="n">img</span><span class="p">):</span>
<span class="n">K_hist</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">dist_hist</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">K</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">)):</span>
<span class="n">K_hist</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">K</span><span class="p">)</span>
<span class="n">mu</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">dist</span> <span class="o">=</span> <span class="n">k_means</span><span class="p">(</span><span class="n">img</span><span class="p">,</span> <span class="n">K</span><span class="p">,</span> <span class="n">plot</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="n">dist_hist</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">dist</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">K_hist</span><span class="p">,</span> <span class="n">dist_hist</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">xlabel</span><span class="p">(</span><span class="s">"K"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">ylabel</span><span class="p">(</span><span class="s">"final distortion"</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">figure</span><span class="p">(</span><span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">15</span><span class="p">,</span> <span class="mi">5</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'elbow plot of image 1'</span><span class="p">)</span>
<span class="n">elbow</span><span class="p">(</span><span class="n">img_1</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">subplot</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">elbow</span><span class="p">(</span><span class="n">img_2</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">title</span><span class="p">(</span><span class="s">'elbow plot of image 2'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/2018-04-19-k-means-clustering/fig-5-elbow-plot.png?raw=true" alt="K-Means Compression - Image 2" width="80%" /></p>
<p>Seeing the two plots it is evident while the elbow plot gives a optimal value of two for image 1, there is no well defined elbow for the image 2 and it is not very clear which value would be optimal as mentioned in the <a href="#choosing-the-number-of-clusters">section</a> above.</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://www.coursera.org/learn/machine-learning/lecture/93VPG/k-means-algorithm" target="_blank">Machine Learning: Coursera - K-Means Clustering</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/G6QWt/optimization-objective" target="_blank">Machine Learning: Coursera - Optimization Objective</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/drcBh/random-initialization" target="_blank">Machine Learning: Coursera - Random Initialization</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/Ks0E9/choosing-the-number-of-clusters" target="_blank">Machine Learning: Coursera - Choosing the number of clusters</a></small><br /></p>
Thu, 19 Apr 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/04/19/k-means-clustering/
https://machinelearningmedium.com/2018/04/19/k-means-clustering/machine-learningandrew-ngbasics-of-machine-learningSupport Vector Machine<h3 id="basics-of-machine-learning-series">Basics of Machine Learning Series</h3>
<blockquote>
<p><a href="/collection/basics-of-machine-learning">Index</a></p>
</blockquote>
<div class="horizontal-divider">· · ·</div>
<h3 id="optimization-objective">Optimization Objective</h3>
<p>The support vector machine objective can seen as a modification to the cost of logistic regression. Consider the sigmoid function, given as,</p>
<script type="math/tex; mode=display">h_\theta(x) = \frac {1} {1 + e^{-z}} \tag{1} \label{1}</script>
<p>where \(z = \theta^T x \)</p>
<p>The cost function of logistic regression as in the post <a href="/2017/09/02/logistic-regression-model/#mjx-eqn-6"><strong>Logistic Regression Model</strong></a>, is given by,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
J(\theta) &= -{1 \over m} \sum_{i=1}^m \left( y^{(i)}\,log(h_\theta(x^{(i)}) + (1-y^{(i)})\,log(1 - h_\theta(x^{(i)})) \right) \\
&= -{1 \over m} \sum_{i=1}^m \left( y^{(i)}\,log(\frac {1} {1 + e^{-\theta^T x}}) + (1-y^{(i)})\,log(1 - \frac {1} {1 + e^{-\theta^T x}}) \right)
\tag{2} \label{2}
\end{align} %]]></script>
<p>Each training instance contributes to the cost function the following term,</p>
<script type="math/tex; mode=display">-y\,log(\frac {1} {1 + e^{-z}}) - (1-y)\,log(1 - \frac {1} {1 + e^{-z}})</script>
<p>So when \(y = 1\), the contributed term is \(-log(\frac {1} {1 + e^{-z}})\), which can be seen in the plot below. The cost function of SVM, denoted as \(cost_1(z)\), is a modification the former and a close approximation.</p>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-1-svm-cost-at-y-1.png" alt="Fig-1. SVM Cost function at y = 1" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="k">def</span> <span class="nf">svm_cost_1</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span> <span class="k">if</span> <span class="n">_</span> <span class="o">>=</span> <span class="mi">1</span> <span class="k">else</span> <span class="o">-</span><span class="mf">0.26</span><span class="o">*</span><span class="p">(</span><span class="n">_</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="o">-</span><span class="mi">3</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">negative</span><span class="p">(</span><span class="n">x</span><span class="p">)))))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">svm_cost_1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">([</span><span class="s">'logistic regression cost function'</span><span class="p">,</span> <span class="s">'modified SVM cost function'</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p>Similarly, when \(y = 0\), the contributed term is \(-log(1 - \frac {1} {1 + e^{-z}})\), which can be seen in the plot below. The cost function of SVM, denoted as \(cost_0(z)\), is a modification the former and a close approximation.</p>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-2-svm-cost-at-y-0.png" alt="Fig-2. SVM Cost function at y = 0" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">svm_cost_0</span><span class="p">(</span><span class="n">x</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span> <span class="k">if</span> <span class="n">_</span> <span class="o"><=</span> <span class="o">-</span><span class="mi">1</span> <span class="k">else</span> <span class="mf">0.26</span><span class="o">*</span><span class="p">(</span><span class="n">_</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">x</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="o">-</span><span class="n">np</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span><span class="o">+</span><span class="n">np</span><span class="o">.</span><span class="n">exp</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">negative</span><span class="p">(</span><span class="n">x</span><span class="p">))))))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">svm_cost_0</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
<span class="n">plt</span><span class="o">.</span><span class="n">legend</span><span class="p">([</span><span class="s">'logistic regression cost function'</span><span class="p">,</span> <span class="s">'modified SVM cost function'</span><span class="p">])</span>
<span class="n">plt</span><span class="o">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<blockquote>
<p>While the slope the straight line is not of as much importance, it is the linear approximation that gives SVMs computational advantages that helps in formulating an easier optimization problem.</p>
</blockquote>
<p>Regularized version of \eqref{2} can from the post <a href="/2017/09/15/regularized-logistic-regression/#mjx-eqn-1"><strong>Regularized Logistic Regression</strong></a> can rewritten as,</p>
<script type="math/tex; mode=display">J(\theta) = {1 \over m} \sum_{i=1}^m \left( y^{(i)}\,(-log(h_\theta(x^{(i)}))) + (1-y^{(i)})\,(-log(1 - h_\theta(x^{(i)}))) \right) + {\lambda \over 2m } \sum_{j=1}^n \theta_j^2 \tag{3} \label{3}</script>
<p>In order to come up with the cost function for the SVM, \eqref{3} is modified by replacing the corresponding cost terms, which gives,</p>
<script type="math/tex; mode=display">J(\theta) = {1 \over m} \sum_{i=1}^m \left( y^{(i)}\,cost_1(z) + (1-y^{(i)})\,cost_0(z) \right) + {\lambda \over 2m } \sum_{j=1}^n \theta_j^2 \tag{4} \label{4}</script>
<p>Following the conventions of SVM the following modifications are made to the cost in \eqref{4}, which effectively is a change in notation but not the underlying logic,</p>
<ul>
<li>removing \({1 \over m}\) does not affect the minimization logic at all as the minima of a function is not changed by the linear scaling.</li>
<li>change the form of parameterization from \(A + \lambda B\) to \(CA + B\) where it can be intuitively thought that \(C = {1 \over \lambda}\).</li>
</ul>
<p>After applying the above changes, \eqref{4} gives,</p>
<script type="math/tex; mode=display">J(\theta) = C \sum_{i=1}^m \left[ y^{(i)}\,cost_1(\theta^T x^{(i)}) + (1-y^{(i)})\,cost_0(\theta^T x^{(i)}) \right] + {1 \over 2 } \sum_{j=1}^n \theta_j^2 \tag{5} \label{5}</script>
<p>The SVM hypothesis does not predict probability, instead gives hard class labels,</p>
<script type="math/tex; mode=display">h_\theta(x) =
\begin{cases}
1 \text{, if } \theta^Tx \geq 0 \\
0 \text{, otherwise}
\end{cases}
\tag{6} \label{6}</script>
<h3 id="large-margin-intuition">Large Margin Intuition</h3>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-3-cost-plots.png?raw=true" alt="Fig-3. SVM Cost function plots" /></p>
<p>According to \eqref{5} and the plots of the cost function as shown in the image above, the following are two desirable states for SVM,</p>
<ul>
<li>if \(y=1\), then \(\theta^Tx \geq 1\) (not just \(\geq 0\))</li>
<li>if \(y=0\), then \(\theta^Tx \leq -1\) (not just \(\lt 0\))</li>
</ul>
<p>Let C in \eqref{5} be a large value. Consequently, in order to minimize the cost, the corresponding term \(\sum_{i=1}^m \left[ y^{(i)}\,cost_1(\theta^T x^{(i)}) + (1-y^{(i)})\,cost_0(\theta^T x^{(i)}) \right]\) must be close to 0.</p>
<p>Hence, in order to minimize the cost function, when \(y=1\), \(cost_1(\theta^T x)\) should be 0, and similarly, when \(y=0\), \(cost_0(\theta^T x)\) should be 0. And thus, from the plots in Fig.3, it is clear that it can only fulfilled by the two states listed above.</p>
<p>Following the above intuition, the cost function can we written as,</p>
<script type="math/tex; mode=display">min_\theta J(\theta) = min_\theta {1 \over 2 } \sum_{j=1}^n \theta_j^2 \tag{7} \label{7}</script>
<p>subject to contraints,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\theta^Tx^{(i)} &\geq 1 \text{, if } y^{(i)}=1 \\
\theta^Tx^{(i)} &\leq -1 \text{, if } y^{(i)}=0
\end{align} %]]></script>
<p>What this basically leads to is the selection of a decision boundary that tries to maximize the margin from the support vectors as shown in the plot below. This maximization of the margin as seen for decision boundary A increases the robustness over decision boundaries with lesser margins like B. And it is this property of the SVMs that attributes the name <strong>large margin classifier</strong> to it.</p>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-4-large-margin-decision-boundary.png?raw=true" alt="Fig-4. Large Margin Decision Boundary" width="50%" /></p>
<h3 id="effect-of-parameter-c">Effect of Parameter C</h3>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-5-effect-of-regularization.png?raw=true" alt="Fig-5. Effect of Parameter C" width="50%" /></p>
<p>As discussed in the <a href="#optimization-objective">section</a> above, the effect of C can be considered as reciprocal of regularization parameter, \(\lambda\). This is more clear from Fig-5. A single outlier, can make the model choose the decision boundary with smaller margin if the value of C is large. A small value of C ensures that the outliers are overlooked and best approximation of large margin boundary is determined.</p>
<h3 id="mathematical-background">Mathematical Background</h3>
<p><strong>Vector Inner Product:</strong> Consider two vectors, \(v\) and \(w\), given by,</p>
<script type="math/tex; mode=display">v = \begin{bmatrix}v_1 \\ v_2 \end{bmatrix}</script>
<script type="math/tex; mode=display">w = \begin{bmatrix}w_1 \\ w_2 \end{bmatrix}</script>
<p>Then, the <strong>inner product</strong> or the <strong>dot product</strong> is defined as \(v^Tw = w^Tv\).</p>
<p><strong>Norm</strong> of a vector, \(v\), denoted as \(\lVert v\rVert\) is the euclidean length of the vector given by the pythagoras theorem as,</p>
<script type="math/tex; mode=display">\lVert v\rVert = \sqrt{\sum_{i=0}^n v_i^2} \in \mathbb{R} \tag{8} \label{8}</script>
<p>The inner product can also be defined as,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{Inner_Product(v, w)} &= v^Tw = w^Tv = \sum_{i=0}^n v_i \cdot w_i \\
&= \lVert v\rVert \cdot \lVert w\rVert \cdot cos \theta = p \cdot \lVert v\rVert \tag{9} \label{9}
\end{align} %]]></script>
<p>where \(p=\lVert w\rVert \cdot cos \theta\) can be described as the projection of vector \(w\) onto vector \(v\) which can be either positive or negative signed based on the angle \(\theta\) between the vectors as shown in the image below.</p>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-6-dot-product.jpg?raw=true" alt="Fig-6. Dot Product" width="50%" /></p>
<p><strong>SVM Decision Boundary:</strong> From \eqref{7}, the optimization statement can be written as,</p>
<script type="math/tex; mode=display">min_\theta \, {1 \over 2 } \sum_{j=1}^n \theta_j^2 \tag{10} \label{10}</script>
<p>subject to contraints,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\theta^Tx^{(i)} &\geq 1 \text{, if } y^{(i)}=1 \\
\theta^Tx^{(i)} &\leq -1 \text{, if } y^{(i)}=0
\end{align}
\tag{11} \label{11} %]]></script>
<p>Let \(\theta_0 = 0\) and \(n=2\), i.e. number of features is 2 for simplicity, then \eqref{10} can be written as,</p>
<script type="math/tex; mode=display">min_\theta \, {1 \over 2 } (\theta_1^2 + \theta_1^2) = {1 \over 2 } \sqrt{(\theta_1^2 + \theta_1^2)}^2 = {1 \over 2 } \lVert \theta \rVert^2 \tag{12} \label{12}</script>
<p>Using \eqref{9}, \(\theta^Tx^{(i)}\) in \eqref{11} can be written as,</p>
<script type="math/tex; mode=display">\theta^Tx^{(i)} = p^{(i)} \cdot \lVert \theta \rVert \tag{13} \label{13}</script>
<p>The plot of \eqref{13} can be seen below,</p>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-7-dot-product-in-svm.png?raw=true" alt="Fig-7. Dot Product in SVM" width="50%" /></p>
<p>Hence, using \eqref{12} and \eqref{13}, the optimization objective in \eqref{10} and the constraints in \eqref{11} are written as,</p>
<script type="math/tex; mode=display">min_\theta \, {1 \over 2 } \lVert \theta \rVert^2 \tag{14} \label{14}</script>
<p>subject to contraints,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p^{(i)} \cdot \lVert \theta \rVert &\geq 1 \text{, if } y^{(i)}=1 \\
p^{(i)} \cdot \lVert \theta \rVert &\leq -1 \text{, if } y^{(i)}=0
\end{align}
\tag{15} \label{15} %]]></script>
<p>where \(p^{(i)}\) is the projection of \(x^{(i)}\) onto vector \(\theta\).</p>
<p>Consider two decision boundaries, A and B, and their respective perpendicular parameters, \(\theta_A\) and \(\theta_B\) as shown in the plot below. As a consequence of choosing \(\theta_0 = 0\) for simplification, all the corresponding decision boundaries pass through the origin.</p>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-8-choosing-large-margin.png?raw=true" alt="Fig-8. Choosing Large Margin Classifier" width="50%" /></p>
<p>Based on the two training examples of either class chosen, close to the boundaries, it can be seen that the magnitude of projection is more in case of \(\theta_B\) than \(\theta_A\). This basically tells that it would be possible to choose smaller values of \(\theta\) and satisfy \eqref{14} and \eqref{15} if the value of projection \(p\) is bigger and hence, the decision boundary, B is more favourable to the optimization objective.</p>
<p><strong>Why is decision boundary perpendicular to the \(\theta\)?</strong></p>
<p>Consider two points \(x_1\) and \(x_2\) on the decision boundary given by,</p>
<script type="math/tex; mode=display">\theta\,x + c= 0 \tag{16} \label{16}</script>
<p>Since the two points are on the line, they must satisfy \eqref{16}. Substitution leads to the following,</p>
<script type="math/tex; mode=display">\theta\,x_1 + c= 0 \tag{17} \label{17}</script>
<script type="math/tex; mode=display">\theta\,x_2 + c= 0 \tag{18} \label{18}</script>
<p>Subtracting \eqref{18} from \eqref{17},</p>
<script type="math/tex; mode=display">\theta\,(x_1 - x_2) = 0 \tag{17} \label{19}</script>
<p>Since \(x_1\) and \(x_2\) lie on the line, the vector \((x_1 - x_2)\) is on the line too. Following the property of orthogonal vectors, \eqref{19} is possible only if \(\theta\) is orthogonal or perpendicular to \((x_1 - x_2)\), and hence perpendicular to the decision boundary.</p>
<h3 id="kernels">Kernels</h3>
<p>When dealing with non-linear decision boundaries, a learning method like logistic regression relies on high order polynomial features to find a complex decision boundary and fit the dataset, i.e. predict \(y=1\) if,</p>
<script type="math/tex; mode=display">\theta_0\,f_0 + \theta_1\,f_1 + \theta_2\,f_2 + \theta_3\,f_3 + \cdots \geq 0 \tag{20} \label{20}</script>
<p>where \(f_0 = x_0,\, f_1=x_1,\, f_2=x_2,\, f_3=x_1x_2,\, f_4=x_1^2,\, \cdots \).</p>
<p>A natural question that arises is if there are choices of better/different features than in \eqref{20}? A SVM does this by picking points in the space called <strong>landmarks</strong> and defining functions called <strong>similarity</strong> corresponding to the landmarks.</p>
<p><img src="/assets/2018-04-10-support-vector-machines/fig-9-svm-landmarks.png?raw=true" alt="Fig-9. SVM Landmarks" width="50%" /></p>
<p>Say, there are three landmarks defined, \(l^{(1)}\), \(l^{(2)}\) and \(l^{(3)}\) as shown in the plot above, the for any given x, \(f_1\), \(f_2\) and \(f_3\) are defined as follows,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
f_1 &= similarity(x, l^{(1)}) = exp \left(- \frac {\lVert x - l^{(1)} \rVert^2} {2 \sigma^2} \right) \\
f_2 &= similarity(x, l^{(2)}) = exp \left(- \frac {\lVert x - l^{(2)} \rVert^2} {2 \sigma^2} \right) \\
f_3 &= similarity(x, l^{(3)}) = exp \left(- \frac {\lVert x - l^{(3)} \rVert^2} {2 \sigma^2} \right) \\
& \vdots
\end{align}
\tag{21} \label{21} %]]></script>
<p>Here, the similarity function is mathematically termed a <strong>kernel</strong>. The specific kernel used in \eqref{21} is called the \(Gaussian Kernel\). Kernels are sometimes also denoted as \(k(x, l^{(i)})\).</p>
<p>Consider \(f_1\) from \eqref{21}. If there exists \(x\) close to landmark \(l^{(1)}\), then \(\lVert x - l^{(1)} \rVert \approx 0\) and hence, \(f_1 \approx 1\). Similarly for a \(x\) far from the landmark, \(\lVert x - l^{(1)} \rVert\) will be a larger value and hence exponential fall will cause \(f_1 \approx 0\). So effectively the choice of landmarks has helped in increasing the number of features \(x\) had from 2 to 3. which can be helpful in discrimination.</p>
<p>For a gaussian kernel, the value of \(\sigma\) defines the spread of the normal distribution. If \(\sigma\) is small, the spread will be narrower and when its large the spread will be wider.</p>
<p>Also, the intuition is clear about how landmarks help in generating the new features. Along with the values of parameter, \(\theta\) and \(\sigma\), various different decision boundaries can be achieved.</p>
<h3 id="how-to-choose-optimal-landmarks">How to choose optimal landmarks?</h3>
<p>In a complex machine learning problem it would be advantageous to choose a lot more landmarks. This is generally acheived by choosing landmarks at the point of the training examples, i.e. landmarks equal to the number of training examples are chosen, ending up in \(l^{(1)}, l^{(2)}, \cdots l^{(m)}\) if there are \(m\) training examples. This translates to the fact that each feature is a measure of how close is an instance to the existing points of the class, leading to generation of new feature vectors.</p>
<blockquote>
<p>For SVM training, given training examples, \(x\), features \(f\) are computed, and \(y=1\), if \(\theta^Tf \geq 0\)</p>
</blockquote>
<p>The training objective from \eqref{5} is modified as follows,</p>
<script type="math/tex; mode=display">min_\theta \, C \sum_{i=1}^m \left[ y^{(i)}\,cost_1(\theta^T f^{(i)}) + (1-y^{(i)})\,cost_0(\theta^T f^{(i)}) \right] + {1 \over 2 } \sum_{j=1}^m \theta_j^2 \tag{22} \label{22}</script>
<p>In this case, \(n=m\) in \eqref{5} by the virtue of procedure used to choose \(f\).</p>
<blockquote>
<p>The regularization term in \eqref{22} can be written as \(\theta^T\theta\). But in practice most SVM libraries, instead \(\theta^TM\theta\), which can be considered a scaled version is used as it gives certain optimization benefits and scaling to bigger training sets, which will be taken up at a later point in maybe another post.</p>
</blockquote>
<p>While the kernels idea can be applied to other algorithms like logistic regression, the computational tricks that apply to SVMs do not generalize as well to other algorithms.</p>
<blockquote>
<p>Hence, SVMs and Kernels tend to go particularly well together.</p>
</blockquote>
<h3 id="biasvariance">Bias/Variance</h3>
<p>Since \(C (= {1 \over \lambda})\),</p>
<ul>
<li>Large C: Low bias, High Variance</li>
<li>Small C: High bias, Low Variance</li>
</ul>
<p>Regarding \(\sigma\),</p>
<ul>
<li>Large \(\sigma^2\): High Bias, Low Variance (Features vary more smoothly)</li>
<li>Small \(\sigma^2\): Low Bias, High Variance (Features vary less smoothly)</li>
</ul>
<h3 id="choice-of-kernels">Choice of Kernels</h3>
<ul>
<li><strong>Linear Kernel:</strong> is equivalent to a no kernel setting giving a standard linear classifier given by,</li>
</ul>
<script type="math/tex; mode=display">\theta_0\,x_0 + \theta_1\,x_1 + \theta_2\,x_2 + \theta_3\,x_3 + \cdots \geq 0 \tag{23} \label{23}</script>
<p>Linear kernels are used when the number of training data is less but the number of features in the training data is huge.</p>
<ul>
<li><strong>Gaussian Kernel:</strong> Make a choice of \(\sigma^2\) to adjust the bias/variance trade-off.</li>
</ul>
<p>Gaussian kernels are generally used when the number of training data is huge and the number of features are small.</p>
<blockquote>
<p>Feature scaling is important when using SVM, especially Gaussian Kernels, because if the ranges vary a lot then the similarity feature would be dominated by features with higher range of values.</p>
</blockquote>
<blockquote>
<p>All the kernels used for SVM, must satisfy Mercer’s Theorem, to make sure that SVM optimizations do not diverge.</p>
</blockquote>
<p>Some other kernels known to be used with SVMs are:</p>
<ul>
<li>Polynomial kernels, \(k(x, l) = (x^T l + constant)^degree\)</li>
<li>Esoteric kernels, like string kernel, chi-square kernel, histogram intersection kernel, ..</li>
</ul>
<h3 id="multi-class-classification">Multi-Class Classification</h3>
<ul>
<li>Most SVM libraries have multi-class classification.</li>
<li>Alternatively, one may use one-vs-all technique to train \(k\) different SVMs and pick class with largest \(\theta^Tx\)</li>
</ul>
<h3 id="logistic-regression-vs-svm">Logistic Regression vs SVM</h3>
<ul>
<li>If \(n\) is large relative to \(m\), use logistic regression or SVM with linear kernel, like if \(n=10000, m=10-1000\)</li>
<li>If \(n\) is small and \(m\) is intermediate, use SVM with gaussian kernel, like if \(n=1-1000, m=10-10000\)</li>
<li>If \(n\) is small and \(m\) is large, create/add more features, then use logistic regression or SVM with no kernel, as with huge datasets SVMs struggle with gaussian kernels, like if \(n=1-1000, m=50000+\)</li>
</ul>
<blockquote>
<p>Logistic Regression and SVM without a kernel (with linear kernel) generally give very similar. A neural network would work well on these training data too, but would be slower to train.</p>
</blockquote>
<p>Also, the optimization problem of SVM is a convex problem, so the issue of getting stuck in local minima is non-existent for SVMs.</p>
<h2 id="references">REFERENCES:</h2>
<p><small><a href="https://www.coursera.org/learn/machine-learning/lecture/sHfVT/optimization-objective" target="_blank">Machine Learning: Coursera - Optimization Objective</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/wrjaS/large-margin-intuition" target="_blank">Machine Learning: Coursera - Large Margin Intuition</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/3eNnh/mathematics-behind-large-margin-classification" target="_blank">Machine Learning: Coursera - Mathematics of Large Margin Classification</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/YOMHn/kernels-i" target="_blank">Machine Learning: Coursera - Kernel I</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/hxdcH/kernels-ii" target="_blank">Machine Learning: Coursera - Kernel II</a></small><br />
<small><a href="https://www.coursera.org/learn/machine-learning/lecture/sKQoJ/using-an-svm" target="_blank">Machine Learning: Coursera - Using An SVM</a></small><br />
<small><a href="https://www.quora.com/Support-Vector-Machines-Why-is-theta-perpendicular-to-the-decision-boundary" target="_blank">Quora - Why is theta perpendicular to the decision boundary?</a></small><br />
<small><a href="https://docs.opencv.org/2.4/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html" target="_blank">Introduction to support vector machines</a></small></p>
Tue, 10 Apr 2018 00:00:00 +0000
https://machinelearningmedium.com/2018/04/10/support-vector-machines/
https://machinelearningmedium.com/2018/04/10/support-vector-machines/machine-learningandrew-ngbasics-of-machine-learning