Monday, July 25, 2016

Understand quicksort partitioning

The most important part of understanding quicksort is to understand the partitioning function.

Here is an intuitive way to present the array being currently partitioned, let's say we've selected a pivot in the middle of the array and its value is 100. Here's the array being partitioned.
Where pivot was prior to partitioning and after the partitioning doesn't really matter, as long as it ends up being on the correct side of the result boundary.What really matters is that when partitioning is done, all elements to the left of left index are smaller than pivot.(or all elements to the right of right index are larger than pivot)Let's assume the pivot element 100 has already been swapped to the blue part. 
  1. The left index keeps increasing, and all elements it left behind are smaller than the pivot. At the same time, the right index keeps decreasing, and all elements it left behind are larger than the pivot.
  2. If left index meets an element larger than or equal to the pivot, it stops, and wait for right index to find an element smaller than or equal to the pivot, then they swap the elements and continue the step 1 until the left index is larger than the right index. What if the right index does not find an element smaller than or equal to the pivot? Then the right index will just continue moving to the left until it passes the left index, and stops at the next element of the left index (because its value is smaller than pivot). Then the partitioning is done.
  3. return the left index as the boundary to which all elements on the left are smaller than the pivot.
Bonus: Have you seen this amusing Hungarian quicksort dance before? 
https://www.youtube.com/watch?v=ywWBy6J5gz8

Thursday, July 14, 2016

Stanford Machine Learning Final Week Review! Congratulations!

What is a "machine learning pipeline"?

It's a system with many stages / components, several of which may use machine learning. When dealing with complex real world problems. Oftentimes, in order to solve one problem, we need to solve multiple sub-problems. For example, one way to solve Photo OCR (Optical Character Recognition) problem is to solve the following sub-problems:
  1. Text detection
  2. Character segmentation
  3. Character classification
Each of the above sub-problems also involves using machine learning (classification). Text detection uses Sliding Windows detection.

How can we know which stage / component is more important, on which we should spend more time?

By using Ceiling Analysis, we can tell which part of the pipeline is more important, which we should work on next. If we have a single numeric evaluation that tells us how good our global pipeline performance is, then we can imagine what if one of the stages worked perfectly, what would the global performance be in that case? If the global performance improved a lot, that means it's the right stage / component to work on next. If the global performance doesn't improve that much, it means even we spend a lot of time making the stage / component perfect, we still can't improve the global performance that much.
In the above figure, we clearly see by making the Text detection stage perfect, we would improve 17% of the global accuracy. That means we should spend more time working on improving Text detection, than say, working on Character segmentation, which would improve only 1% global accuracy.

Finally, I've earned the certificate :)

Stanford Machine Learning Week 10 review

What is the problem with learning with large dataset?

When training the parameters with a large dataset, such as 100 million training examples or even more, we may not be able to fit all training example into memory. However, we do need all the training examples to calculate partial derivatives for all the parameters. That means, for each step of gradient descent, we have to calculate the next θ value by accumulating the h(θ) - y of all the training examples. How can we deal with that if the training example is too large to fit into memory?

Three solutions

  • Stochastic Gradient Descent
  • Mini-Batch Gradient Descent (may be more efficient)
  • Map Reduce (multiple machines)

Stochastic Gradient Descent

Each update of θ is calculated by only one training example. Stochastic gradient descent is much faster than Batch gradient descent. Each baby step -update θ values - of Batch gradient descent uses all training examples, whereas each baby step of Stochastic gradient descent uses only one training example. But one caveat here is that even though the baby steps of Stochastic gradient descent will generally move the parameters in the direction of the global minimum, but not always. In fact as you run Stochastic gradient descent it doesn't actually converge in the same sense as Batch gradient descent does, and what it ends up doing is wandering around continuously in some region that's close to the global minimum, but it doesn't just get to the global minimum and stay there. In practise, it isn't a problem though since it will be a pretty good hypothesis.

Stochastic gradient descent convergence

We can plot a learning curve to check if the our stochastic gradient descent is converging when more and more iterations. The x-axis is the number of iterations, and the y-axis is the cost function. As we can see in the top-left figure, we get a slightly better result when we use a smaller learning rate. The top-right figure shows we get a more smoothy curve when calculating the cost function every 5000 iterations instead of every 1000 iterations. The bottom-right figure shows if the learning rate is too large, the learning curve might end up diverging.

Mini-Batch Gradient Descent

It's just another variation of stochastic gradient descent. Instead of calculating one training example at a time, it calculates multiple (mini batch) training examples. Mini-batch gradient descent is likely to outperform Stochastic gradient descent only if you have a good vectorised implementation, paralleling your computation. 

Map Reduce

All learning algorithms that can be expressed as a summation over the training set can apply Map Reduce in order to speed up the computation. By paralleling the computation over different computers. So whether it's Linear Regression, Logistic Regression, or Neural Network, they can all apply map reduce. For example, if we have 400 million training examples, we can partition them into 4 computers and each computer calculates one fourth of the training examples, before another machine combines the 4 results together to calculate the partial derivative.

What is Online Learning?

When you can continuously have streams of new training examples and you don't want to save all the previous training examples because you constantly have new data. Then you can use Online learning. It continuously (forever) calculate the stochastic gradient descent.

Stanford Machine Learning Week 9 review

What is Anomaly Detection?

Given a bunch of x's (where each x is a vector - a training example), detect whether for a new example x (a new vector), possibility p(x) < ε (epsilon). If yes, it is considered an anomaly; otherwise, it is considered normal.

What is Gaussian (Normal) distribution?

Gaussian distribution is defined as X ~ N(μ, σ^2), where μ (pronounced mu) is the mean of x; σ (pronounced sigma) is the standard deviation; σ^2 is the variance.

Anomaly Detection Algorithm

p(x;μ,σ^2) uses Gaussian distribution to plot the function of x for given a fixed value of mu and of sigma squared.

 

Recommender Systems: Collaborative Filtering Algorithm

Collaborative Filtering Algorithm is based on linear regression. We can think of each movie has its features: x1, x2, ..., xn. x1 may represent how romance the movie is, x2 may represent how action the movie is, etc. Then if the user has rated enough movies ( 1<= the rating y <= 5), we can use linear regression to predict hθ(x), given the features of a movie.

Learning features

Not only we can learn the thetas of each user, we can even learn the value of features of each movie automatically by using Collaborative Filtering Algorithm.
Given a dataset that consists of a set of ratings produced by some users on some movies, you wish to learn the parameter vectors x(1),...,x(nm),θ(1),...,θ(nu) that produce the best fit (minimizes the squared error).

Stanford Machine Learning Week 8 review

What is Unsupervised Learning?

Supervised Learning is to learn the weights from training examples with answers.Unsupervised Learning does not have such answers in the training examples.

What is K-Means Algorithm?

K-Means Algorithm is one of the supervised learning algorithm, to automatically find K clusters from the training points. It repeats the two steps iteratively until arrives at the optimal situation.
  1. Clustering assignment : assign each point (each training example) to a cluster centroid
  2. Move centroid : move centroid to the average position of all points belonging to it

A few things to pay attention to when implementing K-Means Algorithm

  • Always normalise the features (zero-mean, better scaling)
  • Local optimal is possible, depending on the initial K centroids used. Therefore, we should randomly initialise the K centroids many times and choose the one with the minimum cost function
  • Choosing the K number: Elbow Method is not always good. It is better to choose the K number that best serves the downstream purpose.

Dimension Reduction Motivation

By working with lower dimensional data, we have the following advantages:
  • Our learning algorithms can often run much much faster
  • Use less memory or disk space
  • Visualise data in a 2D plot if k = 2, or in a 3D plot if k = 3

 

PCA (Principle Component Analysis) algorithm

PCA finds a low dimensional "surface" onto which to project the training examples so that the sum of the projection errors is minimised.

Projection error means the projection distance which is the distance between points (All training examples X) and the projections.
Think about each training example x as a point in a 3D space, and the "surface" as a 2D plane in the 3D space, the distance of the projection from the point x onto the 2D plane is the projection error (projection distance).
Let's say we have an n-dimensional training example, and we want to reduce it to a k-dimensional training example.
The PCA will find k vectors - u(1), u(2), ..., u(k), which will define the "surface".
u(1) .. u(k) are called eigenvectors or principal components.

How to reduce a training example from n dimension to k dimension?

X the training set, is an n*m matrix
  1. Sigma = (1/m)*X'*X, the covariance matrix
  2. [U,S,V] = svd(Sigma), take the U, n*n matrix, which are the eigenvectors (the principle components)
  3. Ureduce = U(:, 1:k), meaning take the first k columns of U, name it Ureduce, a n*k matrix
  4. z = Ureduce'*x transform each training example x into a k dimensional vector


Reconstruction from Compressed Representation

One can easily reconstruct each training example x (n-dimensional) from the compressed representation (k-dimensional) using xapprox = Ureduce*z

How to choose k (number of principal components)?

Choose a (smallest value of) k so that "99% of variance is retained". Yes, we should try to retain as much variance as possible.

Is it possible to apply the mapping x -> z to cross validation and test set?

Yes, you should apply the same mapping x -> z you get from running PCA on the training set.

Caveat

Do not do PCA systematically in your ML system design, only apply PCA when your learning algorithm runs too slow. 

Dimension Reduction showcase

On the left side of the following figure is the original features, each training example (face) is represented by 1024 pixels. When choosing only 100 principal components (reduction of factor of 10), we reduce the dimension of each training example from 1024 to 100. On the right side of the figure is the reconstructed features. Even some informations are lost by dimension reduction, the faces are still recognisable and we 10 times less features to compute.

The eigenvectors

Each of the following faces represents an eigenvector. Each eigenvector transforms one original 1024 dimensional (pixel) training example to 1 pixel of the 100 dimensional (pixel) reduced training example. The figure shows only the first 36 principle components:

Stanford Machine Learning Week 7 review

What is SVM (Support Vector Machine)?

SVM is like logistical regression. It has the same way to solve z, which is ϴ'*X.
The difference is the cost function. SVM's cost function is two simple straight lines for y == 1; and symmetrically, two other straight lines for y == 0.
This is computationally more efficient. It also makes effort to make θ'*x ≥ 1 when y = 1 (not merely making θ'*x ≥ 0 ), and make tθ'*x ≤ -1 when y = 0 (not merely making θ'*x ≤ 0).

By solving the minimization of (this modified version of) cost function SVM, thanks toLarge Margin Technique, we can draw a linear line (if features are not polynomial), we will get the final ϴ, separating the positives (h=1) and negatives(h=0) for classification problem.

What is Large Margin?

With Large Margin, we can draw a line that separates the positive points from the negative points. Large Margin guaranties to have a large minimum length of projection from each point to that boundary line. 

What is Vector Inner Product?

If we have two vectors: u = [u1 u2] v = [v1 v2]. One way to calculate the inner product is u'*v, which is u1*v1 + u2*v2.
Another way to calculate the inner product is based on geometry:
The normal or (euclidean) length of vector u, ||u|| is sqrt(u1^2 + u2^2). It's like when we project u1, u2 into axis x (x = u1), and axis y (y = u2), then calculate the Pythagoras theorem. Draw vector v in the axis x and axis y, do a orthogonal projection from v to u and get the length of p (from the origin (0,0) to the orthogonal point), p is signed and could be negative, finally the inner production = p*||u||.

Apply Vector Inner Product to minimise the cost function, to get the Large Margin Decision Boundary

We can rewrite the cost function of SVM in a way that uses Vector Inner Product. In order to minimise the cost function, Vector Inner Product chooses a decision boundary that has the largest margin.

What are Kernels?

We use a kernel in order to develop complex nonlinear classifiers. Without a kernel (sometimes we call it a linear kernel), we can only develop linear classifiers. SVM is about the cost function, whereas Kernel is about the hypothesis function.

How to use Kernels in a hypothesis function?

Without kernel, we would write a polynomial hypothesis function. We can replace the polynomial hypothesis function x's by f's. The f's are the result of kernel. Gaussian Kernel is the most popular kernel that calculates the similarity of two vectors; it returns 1 when two vectors are very similar, and returns 0 when two vectors are very different. Each training example is a n-dimensional vector. We have m vectors in the training set. We can learn parameters θ so that when a vector is similar to certain other vectors, the hypothesis function > 0.

SVM, Logistic Regression, or Neural Network?

It's not alway obvious to make a choice of the learning algorithm when solving classification problem. Here's a recommended best practice guide:

Questions

Does it change the hypothesis function of SVM compared to the hypothesis function of Logistic Regression? Why?
It seems - to be verified - that the hypothesis function of SVM becomes:
h = 1, when theta-transpose*x ≥ 0
h = 0, when theta-transpose*x ≤ 0