为什么添加一个额外的字段结构大大提高了它的性能？

问题描述：

我注意到一个包裹一个浮点数的结构比直接使用浮点数要慢很多，大约有一半的性能。为什么添加一个额外的字段结构大大提高了它的性能？

using System; 
using System.Diagnostics; 

struct Vector1 { 

    public float X; 

    public Vector1(float x) { 
     X = x; 
    } 

    public static Vector1 operator +(Vector1 a, Vector1 b) { 
     a.X = a.X + b.X; 
     return a; 
    } 
}

然而，在增加一个额外的“额外”的领域，一些魔术似乎发生和表现再次变得更为合理：

struct Vector1Magic { 

    public float X; 
    private bool magic; 

    public Vector1Magic(float x) { 
     X = x; 
     magic = true; 
    } 

    public static Vector1Magic operator +(Vector1Magic a, Vector1Magic b) { 
     a.X = a.X + b.X; 
     return a; 
    } 
}

我用于衡量这些代码如下：

class Program { 
    static void Main(string[] args) { 
     int iterationCount = 1000000000; 
     var sw = new Stopwatch(); 
     sw.Start(); 
     var total = 0.0f; 
     for (int i = 0; i < iterationCount; i++) { 
      var v = (float) i; 
      total = total + v; 
     } 
     sw.Stop(); 
     Console.WriteLine("Float time was {0} for {1} iterations.", sw.Elapsed, iterationCount); 
     Console.WriteLine("total = {0}", total); 
     sw.Reset(); 
     sw.Start(); 
     var totalV = new Vector1(0.0f); 
     for (int i = 0; i < iterationCount; i++) { 
      var v = new Vector1(i); 
      totalV += v; 
     } 
     sw.Stop(); 
     Console.WriteLine("Vector1 time was {0} for {1} iterations.", sw.Elapsed, iterationCount); 
     Console.WriteLine("totalV = {0}", totalV); 
     sw.Reset(); 
     sw.Start(); 
     var totalVm = new Vector1Magic(0.0f); 
     for (int i = 0; i < iterationCount; i++) { 
      var vm = new Vector1Magic(i); 
      totalVm += vm; 
     } 
     sw.Stop(); 
     Console.WriteLine("Vector1Magic time was {0} for {1} iterations.", sw.Elapsed, iterationCount); 
     Console.WriteLine("totalVm = {0}", totalVm); 
     Console.Read(); 
    } 
}

随着基准测试结果：

Float time was 00:00:02.2444910 for 1000000000 iterations. 
Vector1 time was 00:00:04.4490656 for 1000000000 iterations. 
Vector1Magic time was 00:00:02.2262701 for 1000000000 iterations.

编译/环境设置：操作系统：Windows 10的64位工具链：VS2017 框架：净4.6.2 目标：任何CPU不想32位

如果64位被设置为目标，我们的研究结果更可预测的，但是比我们有Vector1Magic看到在32位目标显著恶化：

Float time was 00:00:00.6800014 for 1000000000 iterations. 
Vector1 time was 00:00:04.4572642 for 1000000000 iterations. 
Vector1Magic time was 00:00:05.7806399 for 1000000000 iterations.

对于真正的巫师，我已经包含了IL的转储位置：https://pastebin.com/sz2QLGEx

进一步的调查表明，这似乎是特定于Windows运行时，因为单声道编译器产生相同的IL。

在单声道运行时，与原始浮点数相比，这两个结构变体的性能差不多有两倍。这与我们在.Net上看到的性能有很大的不同。

这是怎么回事？

*请注意，这个问题最初包含一个有缺陷的基准过程（感谢Max Payne指出了这一点），并且已经更新以更准确地反映时间。

即时猜测这是由于结构包装，现在有更好的内存对齐。 –

您应该添加预热迭代以排除JIT或其他一次性处理的可能干扰。 – PetSerAl

如果我切换到64位，对于你的“魔法”向量，性能会变差。 – Adrian

答

这不应该发生。这显然是某种错位，迫使JIT不能像它应该那样工作。

struct Vector1 //Works fast in 32 Bit 
{ 
    public double X; 
} 

struct Vector1 //Works fast in 64 Bit and 32 Bit 
{ 
    public double X; 
    public double X2; 
}

还必须致电： Console.WriteLine（总）;，这增加了Vector1Magic时间的精确时间。问题仍然存在，为什么Vector1如此缓慢。

也许结构未针对sizeof（foo）进行优化< 64位模式下的64位。

看来，这是7年前ansered： Why is 16 byte the recommended size for struct in C#?

“这不应该发生，这显然是某种错位，迫使JIT不能像它应该那样工作。” - 这并没有真正回答这个问题。为什么会发生？这背后的推理是什么？ – Varon

然后让upvote直到有人来到谁知道如何.net内部工作。生成的IL代码非常好，所以阅读时不会将我们指向解决方案。问题更深入，在JIT优化器内部。这是一个非常有趣的发现，也许你可以在MSDN的.net开发者团队论坛上发布这个。 –

请插入一行说，Console.WriteLine（总）;在第一次循环之后。 JIT不会执行结果之后未被使用的节点。 –

为什么添加一个额外的字段结构大大提高了它的性能？

相关推荐