【数据挖掘 04】EDA 之 Bihistogram、Block 图分析



1. 双直方图(Bihistogram)

Bihistogram是一种EDA工具,用于评估前后工程修改是否引起了更改:

  • location;
  • variation;
  • distribution。

它是两个样本 t t t 检验的图形替代方案 。比起 t t t 检验,双直方图的功能更强大,因为所有分布特征(位置,比例,偏度,离群值)都在单个图上显示。它也是基于常见且易于理解的直方图。

【数据挖掘 04】EDA 之 Bihistogram、Block 图分析
从上面的JAHANMI2.DAT数据集的直方图,可以看到 batch 1的中心陶瓷强度值(ceramic strength value)约为725,而 batch 2的中心陶瓷强度值约为625。这表明这些 batches 被替换了大约100个力量单位。因此,batch 因子对强度的位置(典型值)具有显着影响,因此batch 被认为是“重要的”或“具有作用”。因此,我们以图形和令人信服的方式看到了t检验或方差分析将定量表示什么。

关于变化,请注意,上轴 batch 1 直方图的扩展(变化)似乎与下轴 batch 2 直方图的差异不大。关于分布形状,请注意,batch 1 直方图向左倾斜,而 batch 2 直方图更对称,甚至略有偏斜。

因此,双直方图显示,batch 之间在位置和分布方面存在明显差异,但在变异方面则没有明显差异。比较 batch 1和 batch 2,我们还注意到,batch 1是“更好的 batch”,因为其平均强度高100单位(大约725)。

通过将两个直方图垂直并置形成双直方图:

  • 轴上方:条件1的响应变量的直方图;
  • 轴下方:条件2的响应变量的直方图。

直方图可以提供以下问题的答案:

  1. (2级)因素重要吗?
  2. (2级)因素有影响吗?
  3. 位置在两个子组之间是否变化?
  4. 2个子组之间的差异是否改变?
  5. 子组之间的分布形状是否发生变化?
  6. 有离群值吗?

直方图是确定一个因素是否“有效”的重要EDA工具。由于双直方图可以洞察测量过程中四个基本假设(仅缺少随机性)中三个(位置,变化和分布)的有效性 ,因此它是特别有价值的工具。由于该图具有双重(上方/下方)性质,因此将直方图限制为仅具有两个级别的评估因子。但是,这在许多科学和工程实验的前后特征中非常普遍。

相关技巧。详细内容之后介绍。

  • Two-Sample t-Test for Equal Means(用于位置偏移)
  • F-Test for Equality of Two Variances(用于变量偏移)
  • Kolmogorov-Smirnov Goodness-of-Fit Test(用于分布偏移)
  • Quantile-Quantile Plot(用于位置和分布偏移)

2. 方块图(Block Plot)

方块图(Filliben 1993)是一种EDA工具,用于评估感兴趣的因素(主要因素)是否对响应产生统计学上的显著影响,以及有关主要因素影响的结论是否对所有其他因素或次要因素都有效实验中的因素。

它用较少依赖假设的二项式检验(less assumption-dependent binomial test)代替了方差检验的分析( analysis of variance test),并且当我们试图稳健地确定主要因素是否起作用时,应常规使用该检验。
【数据挖掘 04】EDA 之 Bihistogram、Block 图分析
SHEESLE2.DAT数据集的块图显示,在12种情况(条形)中的10种(条形)中,焊接方法2 低于(更好)焊接方法1。从二项式角度来看,焊接方法在统计上是有意义的。

块图组成如下:

  • 垂直轴:因变量Y;
  • 水平轴:所有级别的所有有害(次级)因子 X 1 , X 2 , . . . X_1,X_2,... X1X2... 的所有组合;
  • 情节特征:主要因素 X P X_P XP 的水平。

Average number of defective lead wires per hour from a study with four factors,

  • weld method (2 levels)
  • plant (2 levels)
  • speed (2 levels)
  • shift (3 levels)

are shown in the plot above. Weld method is the primary factor and the other three factors are nuisance factors. The 12 distinct positions along the horizontal axis correspond to all possible combinations of the three nuisance factors, i.e., 12 = 2 plants x 2 speeds x 3 shifts. These 12 conditions provide the framework for assessing whether any conclusions about the 2 levels of the primary factor (weld method) can truly be called “general conclusions”. If we find that one weld method setting does better (smaller average defects per hour) than the other weld method setting for all or most of these 12 nuisance factor combinations, then the conclusion is in fact general and robust.

In the above chart, the ordering along the horizontal axis is as follows:

  • The left 6 bars are from plant 1 and the right 6 bars are from plant 2.
  • The first 3 bars are from speed 1, the next 3 bars are from speed 2, the next 3 bars are from speed 1, and the last 3 bars are from speed 2.
  • Bars 1, 4, 7, and 10 are from the first shift, bars 2, 5, 8, and 11 are from the second shift, and bars 3, 6, 9, and 12 are from the third shift.

In the block plot for the first bar (plant 1, speed 1, shift 1), weld method 1 yields about 28 defects per hour while weld method 2 yields about 22 defects per hour–hence the difference for this combination is about 6 defects per hour and weld method 2 is seen to be better (smaller number of defects per hour).

Is “weld method 2 is better than weld method 1” a general conclusion?

For the second bar (plant 1, speed 1, shift 2), weld method 1 is about 37 while weld method 2 is only about 18. Thus weld method 2 is again seen to be better than weld method 1. Similarly for bar 3 (plant 1, speed 1, shift 3), we see weld method 2 is smaller than weld method 1. Scanning over all of the 12 bars, we see that weld method 2 is smaller than weld method 1 in 10 of the 12 cases, which is highly suggestive of a robust weld method effect.

What is the chance of 10 out of 12 happening by chance? This is probabilistically equivalent to testing whether a coin is fair by flipping it and getting 10 heads in 12 tosses. The chance (from the binomial distribution) of getting 10 (or more extreme: 11, 12) heads in 12 flips of a fair coin is about 2%. Such low-probability events are usually rejected as untenable and in practice we would conclude that there is a difference in weld methods.

The advantages of the block plot are as follows:

  • A quantitative procedure (analysis of variance) is replaced by a graphical procedure.
  • An F-test (analysis of variance) is replaced with a binomial test, which requires fewer assumptions.

The block plot can provide answers to the following questions:

Is the factor of interest significant?

  • 感兴趣的因素重要吗?
  • 感兴趣的因素有影响吗?
  • 位置是否在主要因素的水平之间变化?
  • 流程是否有所改善?
  • 主因子的最佳设置(=水平)是多少?
  • 在主要因素的最佳设置下,我们可以期望平均改善多少?
  • 主要因素和一个或多个有害因素之间是否存在相互作用?
  • 主要因素的影响会根据某些有害因素的设置而改变吗?
  • 有离群值吗?

方框图是一种图形化技术,着眼于主要因素的结论是否实际上具有鲁棒性。这个问题与分析员问“什么因素很重要而哪些因素不重要”(筛选问题)的通用多因素实验问题根本不同。全局数据分析技术(例如方差分析)可以通过利用这种差异的局部,集中的数据分析技术来进行改进。


LINK