Thursday, June 30, 2016

7 ways to boost your brain productivity based on scientifically proved research

Sandra Bond Chapman met an autistic boy one day, and was fascinated by his problem solving skills. She then decided to dedicate her time and life to studying and unlocking the immense humain potential of our brain. Now she's a leading cognitive neuroscientist at The University of Texas, discovering approaches to build brain resilience, strengthen healthy brain development and repair brain function.

In the talk, she gave 7 tips - all based on scientifically proved research - to boost your brain productivity. Here's a brief summery:


  1. Single task
  2. Inhibit information : just focus on the few things that you're doing
  3. Detox distractions
  4. Big idea thinking: generalising, synthesising
  5. Calibrate mental effort: spend best-quality time (e.p. morning time) and longer time on things important, not on things we should finish quickly (e.p. things your manager often asks you to do and tells you it's quite urgent)
  6. Innovate: brain doesn't like status quo. Innovative conversations can even spark new romance :)
  7. Motivate: innovate the activities you don't like and make them more interesting. Motivation increases our most powerful neurotransmitter - dopamine, which makes us happier and increases the speed of learning

I might need to work on my Big idea thinking - generalising and synthesising skills, as I find it very helpful for me to understand the whole picture. It's also quite helpful to my audiences since they can understand better, keep themselves interested and follow the topic.

Source: https://youtu.be/uUL5o-1Yawo

Wednesday, June 22, 2016

Stanford Machine Learning Week 6 review


How to choose the number of layers in neural network?

The best default strategy is to choose to use one hidden layer. The more hidden layers you have, the more prone you'll have high variance problems. But having too few layers makes the neural network prone to high bias problem. A practical strategy is to increment the number of layers - it's like adding polynomial features or decreasing lambda λ in normal cost function to fix high bias problem - and get the Θ for the neural network. Then use the Θ to calculate the cost function of cross validation. Choose the Θ ( also number of layers) that has the minimum cost function of cross validation.

What's Learning Curve?

Learning Curve shows whether you have a high variance or high bias problem. The horizontal axis represents m : the training set size; the vertical axis represents the errors (training error and cross validation error).

High Bias

When m increases, training error and cross validation error converge and keeps high.


High Variance

When m increases, training error and cross validation error also converge, but there's a gap between training error and cross validation error and adding new training data helps.


Good practice when plotting learning curves with small training sets: It's often helpful to average across multiple sets of randomly selected examples to determine the training error and cross validation error. For example, for m = 10, randomly choose 10 examples from training set and 10 examples from cross validation set, the calculate the training error and cross validation error. Do this for 50 times to get the average training error and cross validation error for m = 10.

What's Training Set, Cross Validation Set, and Test Set?

A good (supervised learning) practice is to divide the data into 3 groups:

  • Training Set : Learn the model parameters θ
  • Cross Validation Set : Select the regularzation λ parameters to find tradeoff between high bias and high variance
  • Test Set : Evaluate the "final" model

Recommended approach to develop a learning algorithm

  1. Start with a very simple, quick-and-dirty algorithm that you can implement quickly. Implement it and test it against your cross validation data.
  2. Plot learning curves to decide if more data, more features are likely to help.
  3. Error analysis: Manually examine the examples (in cross validation data) that your algorithm made errors on. See if you can spot any systematic trend in what types of examples it is making errors on. For example, for a mail spam classification problem, you could manually examine (1) What type of email it is - Pharma, Replica, Stealing Password? If most of the cross validation error is related to emails of Stealing Password, then it's worth spending some time to see if you can come up with better features to categorise Stealing Password spam more correctly. (2) What features you think would have helped the algorithm classify them correctly. Finally, ALWAYS TEST your assumption again cross validation data. 

What are Precision and Recall, and when are they useful?


When solving classification problems, such as Cancer classification, we might be proud to see that we got 1% error on test set (99% correct diagnoses). But wait, only 0.50& of patients have cancer. If we had a "cheat" version of hypothesis predicting all patients don't have cancer, we would have 99.5% correct diagnoses. But by using "cheat" version, we are not actually improving our predicting algorithm, even though we have a better accuracy. This situation happens when we have skewed classes

Precision and Recall come to rescue

Precision : Of all patients where we predicted True (having cancer), what fraction actually has cancer. In the figure below, the denominator is the first row (all predicted True).
Recall : Of all patients having cancer, what fraction did we correctly predict as having cancer? In the figure below, the denominator is the first column (all actual True).


The "cheat" version would have 0 as recall, as it predicts all patients not having cancer: recall = zero/non-zero = 0. So we would find out the "cheat" version is not improving our algorithm.

Trading off precision and recall

When using logistic regression, we set a threshold for hθ(x). If threshold is 0.5, we predict 1 if hθ(x) > 0.5, and we predict 0 if hθ(x) < 0.5.

Suppose we want to predict y = 1 (cancer) only if very confident, we do not want to scare patients. We would set the threshold high, which results in higher precision and lower recall.

But if we want to be more preservative and avoid missing too many cases of cancer, we would set the threshold low, which results in higher recall and lower precision.

When does using a large training set makes sense?

It only makes sense when we have a low bias algorithm - algorithm with (1) many useful features and (2) many parameters θ. In this case, increasing training set size will help fix overfitting problem and training error will be closer to cross validation error. If features x do not contain enough information to predict y accurately (such as predicting a house's price from only its size), even if we are using a neural network with a large number of hidden units, it won't work. We can ask ourself whether features x contain enough information by imagining if we have a human expert look at the features x, can he/she confidently predict the value of y. Simply looking at a house's size, even a realtor cannot confidently predict the price. He/She has to have more informations: number of rooms, which part of city, etc.

Questions?

When features (e.p. polynomial of degree 8: x=40, x8 = too big) are badly scaled, we need to normalise the features. How do we do feature normalisation? What is mu, sigma? Do some further readings and earlier lectures reviews.

Tuesday, June 14, 2016

Mimicked version of Alice's Adventures in Wonderland


Why?

This is an easy algorithm I coded to practise Python.
I've been studying Machine Learning for a while, mainly by following courses on Coursera. Even though I have some occasions to implement the training algorithms such as Gradient Descent for Linear Regression, Logistic Regression for Classification; I mainly used Matlab, which is an efficient and practical tool and very easy to get start with. However, I think it would also be great to learn to use some of the ML libraries in Python, such as scikit-learn, which is already tuned and optimised by a lot of data scientists.  I also want to start doing some of the problems on Kaggle, and I'm not sure if Matlab is a perfect tool for that. This is why I decided to learn python by following Google Python Class.

What's the mimicked version of Alice's Adventures in Wonderland?

Basically, you pick a random word in the book. I picked the first word in the book, which is "Alice". Then you shall follow these steps to generate your version of the book:
  • Print the chosen word, and then look up what words might come next in the original novel, then pick one at random.
  • Do this recursively for 200 times.

Can you show me the mimicked version?

Here's the mimicked version I generated with the algorithm mentioned above :)
Alice's elbow was gently brushing away quietly marched off together, Alice a minute, `and I can be seen: she spread his book, `Rule Forty-two. ALL RETURNED FROM HIM TWO--" why, I ought to a well?' The first saw the mistake about for her eyes were of him), while she is asleep again,' said after her: its full of the opportunity of things--I can't have signed at once set to feel very dull!' `You might belong to my youth,' said the right THROUGH the Mouse with such nonsense!' `I feared it might knock, and his pocket, and anxious.) Alice to be civil, you'd better with the sea,' the rattle of the Cat went on a week: HE was.' `I feared it was certainly was, what? Alice called after it, and Grief, they don't believe there's nothing seems to be a thick wood. `The fourth.' `Two days and then, saying to speak again. This was I the others. `Are they were nowhere to have to Alice said; but there stood the King. `Nearly two and hurried on, looking over their own business!' said Alice. `Why the stairs. Alice thought was coming. It sounded promising, certainly: but the Mock Turtle, and doesn't understand
Isn't it fun? :)
I'll continue following the rest of the ML course. Hope I could use what I learned one day to solve real world problems.

Sunday, June 12, 2016

Advices

This is an article that I'd like update regularly. It should be a less technical article. I will write down some of the philosophies about life and work that I agree with or at least want to give it a try. Happy Learning :)

Pomodoro technique:
Concentrate on your work for 20 mins (without any distraction, like Facebook) and take a break for 5 mins. So you have to do something within the 20 mins. This technique also helps avoiding procrastination. 

Coding:
Sometimes you can see a problem in a different way and rewrite it so that a special case goes away and becomes the normal case.  -- Linus Torvalds

Profession:
Don't chase money, chase happiness. -- Liz Wessel

Saturday, June 11, 2016

Stanford Machine Learning Week 5 review

I just finished the 5th week of Stanford Machine Learning course:Neural Networks: Learning. Since this week's course is a little bit difficult. I thought I might as well write something as a reminder, so that I could look back in the future.
I do feel that it is a powerful way to predict the outcome when there's enough training data, hidden layers and intermediate neurons. A simple and practical neural network would consist of three layers: input layer, hidden layer, and output layer. Actually, one of the first versions of self-driving car was built on a three-layer neural network.
The forward propagation is straightforward and intuitive. On the other hand, it's a bit difficult for me to grasp the concept of back propagation. The good news is that I somehow managed to understand the implementation.

The Neural Network Learning Algorithm on a high level

  • Calculate the cost function, given multiple matrices of thetas* (one matrice per layer). In the end, we should have a cost given thetas. The cost represents how "far" our prediction is from the "reality".
  • Calculate the gradient (a.k.a. partial derivative) for each given theta. In the end, we should have a concrete numerical gradient value for each given theta. The back propagation take place on this step.
  • With the ability to calculate the cost, and gradient for each theta, we can use one of the optimised functions such as fminunc (or gradient descent) to do the following iteration: random initial thetas --> calculate cost and gradients --> update thetas --> less cost and new gradients --> update thetas --> less cost and new gradients --> ... ---> until we get the minimum cost. This is a glorious moment when we get the optimised thetas, which allows the neural network to do the most accurate prediction.
*theta is the weight of X, whereas X is the feature vector used to predict outcome.

 

Questions

Though I finished the assignment of this week, there are still some parts that I need to do more research on to understand better:
  • What exactly does δ (delta) represent in back propagation?
  • Why do we use derivative sigmoid function g'(z) to calculate the δ (delta) from last layer back to the hidden layers?
  • Why δ(l+1)*(a(l))T is the gradient (a.k.a. partial derivative) matrice at l layer?

Accomplishment

  • Built a neural network that recognises 1 - 9 digital number imagines with 96% accuracy.
  • Visualized hidden layer images, each of which represents a row of theta in the input layer, who calculates one neuron in the hidden layer. There are 25 neurons in the hidden layer.
  • 100% code score passed.