Tuesday, November 29, 2016

Monday, July 25, 2016

Understand quicksort partitioning

The most important part of understanding quicksort is to understand the partitioning function.

Here is an intuitive way to present the array being currently partitioned, let's say we've selected a pivot in the middle of the array and its value is 100. Here's the array being partitioned.
Where pivot was prior to partitioning and after the partitioning doesn't really matter, as long as it ends up being on the correct side of the result boundary.What really matters is that when partitioning is done, all elements to the left of left index are smaller than pivot.(or all elements to the right of right index are larger than pivot)Let's assume the pivot element 100 has already been swapped to the blue part. 
  1. The left index keeps increasing, and all elements it left behind are smaller than the pivot. At the same time, the right index keeps decreasing, and all elements it left behind are larger than the pivot.
  2. If left index meets an element larger than or equal to the pivot, it stops, and wait for right index to find an element smaller than or equal to the pivot, then they swap the elements and continue the step 1 until the left index is larger than the right index. What if the right index does not find an element smaller than or equal to the pivot? Then the right index will just continue moving to the left until it passes the left index, and stops at the next element of the left index (because its value is smaller than pivot). Then the partitioning is done.
  3. return the left index as the boundary to which all elements on the left are smaller than the pivot.
Bonus: Have you seen this amusing Hungarian quicksort dance before? 
https://www.youtube.com/watch?v=ywWBy6J5gz8

Thursday, July 14, 2016

Stanford Machine Learning Final Week Review! Congratulations!

What is a "machine learning pipeline"?

It's a system with many stages / components, several of which may use machine learning. When dealing with complex real world problems. Oftentimes, in order to solve one problem, we need to solve multiple sub-problems. For example, one way to solve Photo OCR (Optical Character Recognition) problem is to solve the following sub-problems:
  1. Text detection
  2. Character segmentation
  3. Character classification
Each of the above sub-problems also involves using machine learning (classification). Text detection uses Sliding Windows detection.

How can we know which stage / component is more important, on which we should spend more time?

By using Ceiling Analysis, we can tell which part of the pipeline is more important, which we should work on next. If we have a single numeric evaluation that tells us how good our global pipeline performance is, then we can imagine what if one of the stages worked perfectly, what would the global performance be in that case? If the global performance improved a lot, that means it's the right stage / component to work on next. If the global performance doesn't improve that much, it means even we spend a lot of time making the stage / component perfect, we still can't improve the global performance that much.
In the above figure, we clearly see by making the Text detection stage perfect, we would improve 17% of the global accuracy. That means we should spend more time working on improving Text detection, than say, working on Character segmentation, which would improve only 1% global accuracy.

Finally, I've earned the certificate :)

Stanford Machine Learning Week 10 review

What is the problem with learning with large dataset?

When training the parameters with a large dataset, such as 100 million training examples or even more, we may not be able to fit all training example into memory. However, we do need all the training examples to calculate partial derivatives for all the parameters. That means, for each step of gradient descent, we have to calculate the next θ value by accumulating the h(θ) - y of all the training examples. How can we deal with that if the training example is too large to fit into memory?

Three solutions

  • Stochastic Gradient Descent
  • Mini-Batch Gradient Descent (may be more efficient)
  • Map Reduce (multiple machines)

Stochastic Gradient Descent

Each update of θ is calculated by only one training example. Stochastic gradient descent is much faster than Batch gradient descent. Each baby step -update θ values - of Batch gradient descent uses all training examples, whereas each baby step of Stochastic gradient descent uses only one training example. But one caveat here is that even though the baby steps of Stochastic gradient descent will generally move the parameters in the direction of the global minimum, but not always. In fact as you run Stochastic gradient descent it doesn't actually converge in the same sense as Batch gradient descent does, and what it ends up doing is wandering around continuously in some region that's close to the global minimum, but it doesn't just get to the global minimum and stay there. In practise, it isn't a problem though since it will be a pretty good hypothesis.

Stochastic gradient descent convergence

We can plot a learning curve to check if the our stochastic gradient descent is converging when more and more iterations. The x-axis is the number of iterations, and the y-axis is the cost function. As we can see in the top-left figure, we get a slightly better result when we use a smaller learning rate. The top-right figure shows we get a more smoothy curve when calculating the cost function every 5000 iterations instead of every 1000 iterations. The bottom-right figure shows if the learning rate is too large, the learning curve might end up diverging.

Mini-Batch Gradient Descent

It's just another variation of stochastic gradient descent. Instead of calculating one training example at a time, it calculates multiple (mini batch) training examples. Mini-batch gradient descent is likely to outperform Stochastic gradient descent only if you have a good vectorised implementation, paralleling your computation. 

Map Reduce

All learning algorithms that can be expressed as a summation over the training set can apply Map Reduce in order to speed up the computation. By paralleling the computation over different computers. So whether it's Linear Regression, Logistic Regression, or Neural Network, they can all apply map reduce. For example, if we have 400 million training examples, we can partition them into 4 computers and each computer calculates one fourth of the training examples, before another machine combines the 4 results together to calculate the partial derivative.

What is Online Learning?

When you can continuously have streams of new training examples and you don't want to save all the previous training examples because you constantly have new data. Then you can use Online learning. It continuously (forever) calculate the stochastic gradient descent.

Stanford Machine Learning Week 9 review

What is Anomaly Detection?

Given a bunch of x's (where each x is a vector - a training example), detect whether for a new example x (a new vector), possibility p(x) < ε (epsilon). If yes, it is considered an anomaly; otherwise, it is considered normal.

What is Gaussian (Normal) distribution?

Gaussian distribution is defined as X ~ N(μ, σ^2), where μ (pronounced mu) is the mean of x; σ (pronounced sigma) is the standard deviation; σ^2 is the variance.

Anomaly Detection Algorithm

p(x;μ,σ^2) uses Gaussian distribution to plot the function of x for given a fixed value of mu and of sigma squared.

 

Recommender Systems: Collaborative Filtering Algorithm

Collaborative Filtering Algorithm is based on linear regression. We can think of each movie has its features: x1, x2, ..., xn. x1 may represent how romance the movie is, x2 may represent how action the movie is, etc. Then if the user has rated enough movies ( 1<= the rating y <= 5), we can use linear regression to predict hθ(x), given the features of a movie.

Learning features

Not only we can learn the thetas of each user, we can even learn the value of features of each movie automatically by using Collaborative Filtering Algorithm.
Given a dataset that consists of a set of ratings produced by some users on some movies, you wish to learn the parameter vectors x(1),...,x(nm),θ(1),...,θ(nu) that produce the best fit (minimizes the squared error).