http://blog.csdn.net/pipisorry/article/details/44119187

机器学习Machine Learning - Andrew NG courses学习笔记

Machine Learning System Design机器学习系统设计

Prioritizing What to Work On优先考虑做什么

the first decision we must make is how do we want to represent x, that is the features of the email.

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准
Note:feature的选择

1. chose a hundred words to use for this representation manually.

2. in practice,look through a training set, and in the training set depict(描述) the most frequently occurring n words where n is usually between ten thousand and fifty thousand, and use those as your features.

用数据预处理降低错误率

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Note:

1. getting lots of data will often help, but not all the time.

2. when spammers send email,very often they will try to obscure(隐藏) the origins of the email, and maybe use fake email headers.Or send email through very unusual sets of computer service.Through very unusual routes, in order to get the spam to you.
3. the spam classifier might not equate "w4tches" as "watches," and so it may have a harder time realizing that something is spam with these deliberate misspellings.And this is why spammers do it.

皮皮blog

Error Analysis 错误分析

{help give you a way to more systematically make some of these decisions of different ideas on how to improve the algorithm.quick way to let you identify some errors and quickly identify what are the hard examples so that you can focus your efforts on those.}

设计机器学习系统的建议步骤

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Note: error analysis on the emails would inspire you to design new features.Or they'll tell you whether the current things or current shortcomings of the system and give you the inspiration you need to come up with improvements to it.

错误分析的一个例子

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Note:

1. 计算准确率Accuracy = (true positives + true negatives) / (total examples)判断

2. by counting up the number of emails in these different categories that you might discover, for example, that the algorithm is doing really particularly poorly on emails trying to steal passwords, and that may suggest that it might be worth your effort to look more carefully at that type of email, and see if you can come up with better features to categorize them correctly.
3. a strong sign that it might actually be worth your while to spend the time to develop more sophisticated features based on the punctuation.

numerical evaluation of your learning algorithm

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

note:

1. using a stemming software can help but it can hurt.
2. We'll see later, examples where coming up with this, sort of, single row number evaluation metric may need a little bit more work.then let you make these decisions much more quickly.

皮皮blog

Error Metrics for Skewed Classes有偏类的错误度量（精确度/召回率）

skewed class: in this case, the number of positive examples is much,much smaller than the number of negative examples.有偏类就是两类数据量不平衡，如正样本类的数目比负样本类的数目多得多，这时准确率accuracy并没有什么卵用了。

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Note:

1. So a non learning algorithm just predicting y equals 0 all the time is even better than the 1% error.

2. By going from 99.2% accuracy to 99.5% accuracy.we just need a good change to the algorithm or not?it becomes much harder to use just classification accuracy, because you can get very high classification accuracies or very low errors, and it's not always clear if doing so is really improving the quality of your classifier because predicting y equals 0 all the time doesn't seem like a particularly good classifier.

faced with such a skewed classes therefore come up with a different error metric called precision recall.

Precision/Recall精确度/召回率

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Note:

1. a learning algorithm that predicts y equals zero all the time,then recall equal to zero,recognize that just isn't a very good classifier.
2. defined setting y equals 1, rather than y equals 0, to be sort of that the presence of that rare class that we're trying to detect.哪个类别设为1哪个为0计算出的precision和recall是不一样的，一般选择类别中样本少的那个类为1。
总结： precision recall is often a much better way to evaluate our learning algorithms,than looking at classification error or classification accuracy, when the classes are very skewed.

[1.6 误差类型Types of errors-常见的误差度量方法]

皮皮blog

Trading Off Precision and Recall权衡精度和召回率：F1值

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Note:

1. tell someone that we think they have cancer only if they're very confident.that instead of setting the threshold at 0.5.
2. the position recall curve can look like many different shapes, depending on the details of the classifier.

3. 判断threshole变化给P\R带来的影响： Lowering the threshold means more y = 1 predictions，而recall的分母是不变的！先看recall变大还是变小，再判断precision怎么变化

A way to choose this threshold automatically?How do we decide which of these algorithms is best?

A way of combining precision recall called the f score.
Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

皮皮blog

Data For Machine Learning数据影响机器学习算法的表现

{the issue of how much data to train on}

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Note:

1. 而不是include high order polynomial features of x.

2. hopefully even though we have a lot of parameters but if the training set is sort of even much larger than the number of parameters then hopefully these albums will be unlikely to overfit.
3. Finally putting these two together that the train set error is small and the test set error is close to the training error what this two together imply is that hopefully the test set error will also be small.

4. A sufficiently large training set will not be overfit

总结：if you have a lot of data and you train a learning algorithm with lot of parameters, that might be a good way to give a high performance learning algorithm.

皮皮blog

Review:

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准
from:http://blog.csdn.net/pipisorry/article/details/44245513

秒客网

Machine Learning - XI. Machine Learning System Design机器学习系统设计(Week 6)系统评估标准

Machine Learning System Design机器学习系统设计

Prioritizing What to Work On优先考虑做什么

用数据预处理降低错误率

Error Analysis 错误分析

设计机器学习系统的建议步骤

错误分析的一个例子

numerical evaluation of your learning algorithm

Error Metrics for Skewed Classes有偏类的错误度量（精确度/召回率）

Precision/Recall精确度/召回率

Trading Off Precision and Recall权衡精度和召回率：F1值

A way to choose this threshold automatically?How do we decide which of these algorithms is best?

Data For Machine Learning数据影响机器学习算法的表现

Review:

相关文章