三个指标怎么做分层图_分层性能指标以及在哪里找到它们
三个指标怎么做分层图
Hierarchical machine learning models are one top-notch trick. As discussed in previous posts, considering the natural taxonomy of the data when designing our models can be well worth our while. Instead of flattening out and ignoring those inner hierarchies, we’re able to use them, making our models smarter and more accurate.
分层机器学习模型是一种一流的技巧。 正如讨论在 以前 的帖子 ,设计我们的模型时,可以是非常值得我们在考虑数据的自然分类。 我们可以使用它们来代替扁平化处理并忽略那些内部层次结构,从而使我们的模型更智能,更准确。
“More accurate”, I say — are they, though? How can we tell? We are people of science, after all, and we expect bold claims to be be supported by the data. This is why we have performance metrics. Whether it’s precision, f1-score, or any other lovely metric we’ve got our eye on — if using hierarchy in our models improves their performance, the metrics should show it.
我说“更准确”-是吗? 我们怎么知道? 毕竟,我们是科学人,我们希望大胆的主张能得到数据的支持。 这就是为什么我们有性能指标。 无论是精度,f1得分,还是我们关注的任何其他可爱指标,如果在模型中使用层次结构可以提高其性能,则指标都应显示出来。
Problem is, if we use regular performance metrics — the ones designed for flat, one-level classification — we go back to ignoring that natural taxonomy of the data.
问题是,如果我们使用常规的性能指标(为平坦的一级分类而设计的指标),那么我们会回避忽略数据的自然分类法。
If we do hierarchy, let’s do it all the way. If we’ve decided to celebrate our data’s taxonomy and build our model in its image, this needs to also be a part of measuring its performance.
如果我们执行层次结构,那么就一路做下去。 如果我们决定庆祝数据的分类法并按照其图像构建模型,则这也必须成为衡量其性能的一部分。
How do we do this? The answer lies below.
我们如何做到这一点? 答案就在下面。
在我们深入之前 (Before We Dive In)
This post is about measuring the performance of machine learning models designed for hierarchical classification. It kind of assumes you know what all those words mean. If you don’t, check out my previous posts on the topic. Especially the one introducing the subject. Really. You’re gonna want to know what hierarchical classification is before learning how to measure it. That’s kind of an obvious one.
这篇文章是关于测量为分层分类而设计的机器学习模型的性能。 假设您知道所有这些词的意思。 如果不这样做,看看我 以前 的帖子主题演讲。 特别是介绍主题的人 。 真。 您将要了解什么是分级分类,然后再学习如何对其进行度量。 这很明显。
Throughout this post, I’ll be giving examples based on this taxonomy of common house pets:
在整个这篇文章中,我将基于常见的家庭宠物分类给出示例:
哦,这么多指标 (Oh So Many Metrics)
So we’ve got a whole ensemble of hierarchically-structured local classifiers, ready to do our bidding. How do we evaluate them?
因此,我们已经有了完整的层次结构本地分类器集合,可以开始进行出价了。 我们如何评估它们?
That is not a trivial problem, and the solution is not obvious. As we’ve seen in previous problems in this series, different projects require different treatment. The best metric could differ depending on the specific requirements and limitations of your project.
这不是一个小问题,解决方案也不明显。 正如我们在本系列以前的问题中所看到的,不同的项目需要不同的对待。 最佳指标可能会有所不同,具体取决于项目的特定要求和限制。
All in all, there are three main options to choose from. Let’s introduce them, shall we?
总而言之,有三个主要选项可供选择。 让我们介绍一下,好吗?
The contestants, in all their grace and glory:
参赛者以其全部的光荣与荣耀:
脚踏实地:平面分类指标 (The Down-To-Earth One: Flat Classification Metrics)
These are the same classification metrics we all know and love (precision, recall, f-score — you name it), applied… Well, flatly.
这些是我们都知道并喜欢的相同分类指标(精度,召回率,f分数-随便你说吧),适用于……嗯,坦率地说。
Same as with the original “flat classification” approach (described in the first post in this series), this method is all about ignoring the hierarchy. Only the final, leaf-node predictions are considered (in our house pets example, those are the specific breeds), and they’re all considered as equal classes, without any special treatment to sibling-classes vs. non-sibling ones.
与最初的“平面分类”方法( 在本系列的第一篇文章中介绍)相同,该方法全部用于忽略层次结构。 仅考虑最终的叶节点预测(在我们的家养宠物示例中,这些是特定的品种),并且它们都被视为同等类别,对同胞类和非同胞类没有任何特殊处理。
This method is simple, but, obviously, not ideal. We don’t want the errors at different levels of the class hierarchy to be penalized in the same way (if I mistook a Pegasus for a Narwhal, that’s not as bad as mistaking it for a Labrador). Also, there isn’t an obvious way to handle cases where the final prediction is not a leaf-node one — which could definitely be the case if you implemented the previously-mentioned blocking by confidence method.
这种方法很简单,但是显然并不理想。 我们不希望以相同的方式惩罚类层次结构不同层次上的错误(如果我误以为是“飞马”作为“独角鲸”,那不如误以为是“拉布拉多”)。 另外,没有一种明显的方法可以处理最终预测不是叶节点预测的情况-如果您信心十足地实施了前面提到的阻止,肯定会是这种情况 方法。
时髦一:定制指标 (The Hipster One: A Custom-Made Metric)
Not happy with the flat metrics, and feel that creative spark tingling in your fingertips? You can conjure up your own special metric, which specifically fits your unique snowflake of a use case.
对平坦的指标不满意,并感到触手可及的创意火花吗? 您可以构想自己的特殊指标,该指标特别适合您用例的独特需求。
This could be useful where the model needs to fit some unusual business constraints. If, for example, you don’t really care about falsely identifying dogs as unicorns, but a Sphynx cat must be correctly spotted or all hell breaks loose, you can design your metrics accordingly, giving more or less weight to different errors.
当模型需要适应一些异常的业务约束时,这可能会很有用。 例如,如果您真的不关心将狗误认为是独角兽,但是必须正确地发现Sphynx猫,否则所有地狱都将变成现实,那么您可以相应地设计度量标准,或多或少地权衡不同的错误。
自命不凡的人:常规分类指标的层次结构特定变体 (The Pretentious One: Hierarchy-Specific Variations on the Regular Classification Metrics)
Those are variations of well-known precision, recall and f-score metrics, specifically adapted to fit hierarchical classification.
这些是众所周知的精度,召回率和f得分指标的变体,特别适合于分层分类。
Please bear with me as I throw some math notations in your general direction:
当我朝您的一般方向介绍一些数学符号时,请耐心配合:
What does it all mean, though?
那到底是什么意思呢?
Pi is the set consisting of the most specific class (or classes, in case of a multi label problem) predicted for each test example i, and all of its/their ancestor classes; Ti is the set consisting of the true most specific class(es) of test example i, and all its/their ancestor classes; and each summation is computed, of course, over all of the test set examples.
Pi是由针对每个测试示例i 预测的最特定的类(如果有多重标签问题,则为多个类)及其所有/其祖先类组成的集合; Ti是由测试示例i的最真实的特定类及其所有/其祖先类组成的集合; 当然, 所有求和示例都将计算出每个总和。
This one is a bit of a handful to unpack, so if you find yourself puzzled, check out the appendix, where I explain it in more detail.
拆开这个包有点难,因此,如果您感到困惑,请查看附录,我会在其中详细说明。
Now, if you’ve implemented your model with non-mandatory leaf-node prediction (meaning the most specific level predicted doesn’t have to be the deepest one), some adjustments need to be made; I won’t go into it here, but if this is something you want to read more about, let me know.
现在,如果您使用非强制性叶节点预测来实现模型(这意味着预测的最具体级别不必是最深的级别),则需要进行一些调整;否则,请进行调整。 我在这里不做介绍,但是如果您想了解更多信息,请告诉我。
哪个指标最合适? (Which Metric Is The Perfect Match?)
The everlasting question. As I previously mentioned, there isn’t one obvious answer, but here are my own thoughts on the subject:
永恒的问题。 正如我之前提到的,没有一个明显的答案,但这是我对这个问题的看法:
-
Flat metrics: it’s a simple enough method, but it loses hierarchy information, which you probably deem important if you went through the trouble of building a hierarchical ensemble model in the first place. I would recommend using flat metrics only for super quick-and-dirty projects, where time limit is a big factor.
扁平 度量:这是一种足够简单的方法,但是会丢失层次结构信息,如果您首先遇到构建层次结构集成模型的麻烦,您可能会认为这很重要。 我建议仅将平面度量标准用于时间有限是一个很大因素的超级快速和肮脏的项目。
-
Custom-made, unique metrics: might be a better fit, but you pay with time and effort. Also, since you’ll be using a metric that wasn’t peer reviewed, you could be missing something important. I would recommend custom-made metrics only when the project at hand has very unique requirements that should be taken into account when evaluating model performance.
量身定制的独特指标:可能更合适,但您要付出时间和精力。 另外,由于您将使用未经同行审查的指标,因此您可能会遗漏一些重要的内容。 我仅在手头的项目具有非常独特的要求时才建议使用定制指标,在评估模型性能时应考虑这些要求。
-
Hierarchical versions of common classification metrics: this method is somewhat intuitive (once you get the hang of it), and it makes a lot of sense for a hierarchical model. However, it might not fit your own use-case the best (for example, there’s no added weight for a correct/false prediction of the deepest class — which might be important in some use cases). It also requires some extra implementation time. All in all, though, I think it’s a good enough premade solution, and should probably be the first choice for most projects.
通用分类指标的分层版本:此方法有些直观(一旦掌握了方法),对于分层模型来说就很有意义了。 但是,它可能无法最适合您自己的用例(例如,对于最深层的类的正确/错误预测没有增加的权重-在某些用例中这可能很重要)。 它还需要一些额外的实施时间。 总而言之,我认为这是一个足够好的预制解决方案,并且可能应该是大多数项目的首选。
结论 (To Conclude)
A machine learning model is nothing without its performance metrics, and hierarchical models require their own special care. There is no one best method to measure hierarchy-based classification: different approaches have their own pros and cons, and each project has its own best fit. If you got this far, you hopefully have an idea as to which method is best for yours, and can now measure your model once you’ve got it rolling.
如果没有机器学习模型的性能指标,它就什么也不是,而分层模型则需要特别注意。 没有一种最佳的方法可以衡量基于层次的分类:不同的方法各有利弊,每个项目都有自己的最佳选择。 如果到此为止,您希望对哪种方法最适合自己有所了解,现在可以在模型滚动后对其进行测量。
This post concludes my four-post series about hierarchical classification models. If you’ve read all of them, you should have all of the tools you need to design, build and measure an outstanding hierarchical classification project. I hope you put it to the best use possible.
这篇文章总结了我关于层次分类模型的四篇文章系列。 如果已阅读所有内容,则应该拥有设计,构建和衡量出色的层次分类项目所需的所有工具。 希望您能将其发挥最大的作用。
Previous posts in the series:
该系列中的先前文章:
Noa Weiss is an AI & Machine Learning Consultant based in Tel Aviv.
Noa Weiss是位于特拉维夫的AI和机器学习顾问。
附录 (Appendix)
Can’t figure out those pesky hierarchical metrics? I’m here to help.
无法找出那些令人讨厌的分层指标? 我是来帮忙的。
In the table below I go over the mock results of a “common house pets” hierarchical model, looking at the measures for the “Dalmatian” class (remember: precision, recall and f-score metrics are calculated per class, treating the labels — both predicted and true — as binary).
在下面的表格中,我查看了“普通宠物”分层模型的模拟结果, 查看了“达尔马提亚”类的度量(请记住:精度,召回率和f得分指标是按类计算的,并处理标签-既是预测值,也是真实值-二进制)。
I go over a few examples, checking out what each of them contributes to both the precision and the recall scores. Remember — the final precision/recall scores are the summation of all those individual examples.
我看了几个例子,检查了每个例子对准确性和查全率的贡献。 请记住,最终的精确度/召回分数是所有这些单独示例的总和。
Comments by example:
举例说明 :
-
Misclassification of a different breed as a Dalmatian: a full point for recall (as the “dog” part was correctly identified), but only half a point for precision (as “dog” was correct, but the predicted “dalmatian” label was wrong). Recall isn’t negatively affected since the “Labrador” label, which was missed here, is not part of the [Dog, Dalmatian] classes, which are the ones measured here.
将另一个品种误分类为达尔马提亚犬:回想的满分(因为正确识别了“狗”部分),但精确度只有半分(因为“犬”是正确的,但是预测的“达尔马提亚”标签是错误的) )。 召回不会受到负面影响,因为在此处未使用的“拉布拉多”标签不是[Dog,Dalmatian]类的一部分,在此处已对其进行了测量。
-
Misclassification of a narwhal as a dalmatian — a zero for precision (as both the “dog” and “dalmatian” predicted labels are wrong), but the recall metric isn’t affected, since the true narwhal label is irrelevant to the measurements of the [Dog, Dalmatian] classes.
将独角鲸误分类为达尔马提亚狗(精度为零)(因为“ dog”和“ dalmatian”预测的标签均不正确),但召回指标不受影响,因为真实的独角鲸标签与测量结果无关[狗,达尔马提亚狗]类。
-
Perfect prediction — an extra point for both prediction and recall.
完美的预测-预测和召回的额外要点。
-
Misclassification of a dalmatian as a different breed: a full point for the precision metric (as the “dog” classifier, which is the only that came out positive out of the two, was correct), but only half a point for recall (as the “dog” label was correctly identified, but the “dalmatian” one was missed.
将达尔马提亚狗误分类为其他品种:精确度指标的满分(作为“狗”分类器,这是两者中唯一得出肯定的,是正确的),但是召回率只有半点(因为正确地识别了“狗”标签,但是错过了“达尔马提亚”标签。
-
A dalmatian misclassified as a Rainbow unicorn: 0 for recall (as both dog and dalmatian labels were missed), but the precision score isn’t affected.
斑点狗被错误地归类为彩虹独角兽:召回0(因为错过了狗和斑点狗的标签),但精度得分不受影响。
-
This example doesn’t teach us anything about the performance of the Dog/Dalmatian classifiers, so it stands to reason it doesn’t affect the score.
这个示例没有告诉我们有关Dog / Dalmatian分类器性能的任何信息,因此可以说它不会影响得分。
Source: C.N. Silla & A.A. Freitas, A survey of hierarchical classification across different application domains (2011), Data Mining and Knowledge Discovery, 22(1–2):182–196
资料来源 :CN Silla和AA Freitas, 不同应用领域的层次结构分类调查 (2011), 数据挖掘和知识发现 ,22(1-2):182-196
翻译自: https://towardsdatascience.com/hierarchical-performance-metrics-and-where-to-find-them-7090aaa07183
三个指标怎么做分层图