These days you can find a lot of articles about Unity over the net which tell you to do this or not to do that, especially when it comes to performance and optimization.

I like testing these little bits of knowledge with little Unity projects on various platforms. Sometimes I get interesting results. This post is about one of them, and I still don't understand the results I got.

TL;DR: Shit's complicated, yo!

The question

Look at Vector3.one. You might have heard that since this is a Property, it is a method call which is not inlined by IL2CPP in any way. So, your little for (var i = 0; i < 100000; i++) v += Vector3.one; in Update() might be costly.

But is it?

The test

So, I made a test to compare performance of the following:

  1. Vector3.one
  2. new Vector3(1, 1, 1)
  3. local variable
  4. static variable

The test is measuring timings of 5 loops:

static Vector3 Vector3Zero = new Vector3(1, 1, 1);  
var v3z = new Vector3(1, 1, 1);

// empty
for (var i = 0; i < n; i++) {}  
// Vector3.one
for (var i = 0; i < n; i++) { arr[i] = Vector3.one * i; }  
// new Vector(1, 1, 1)
for (var i = 0; i < n; i++) { arr[i] = new Vector3(1, 1, 1) * i; }  
// static variable
for (var i = 0; i < n; i++) { arr[i] = Vector3Zero * i; }  
// local variable
for (var i = 0; i < n; i++) { arr[i] = v3z * i; }  

Results

OS X (Mono) — MacBook Pro

Running on my MBP with a standalone non-development build I got the following timings:

  1. empty: 18 ms,
  2. Vector3.one: 345 ms,
  3. new: 325 ms,
  4. local: 226 ms,
  5. static: 239 ms.

This seems to be expected, though I am more interested in IL2CPP right now.

Android Galaxy S7 IL2CPP 32 bit

Next, I published the test to my Android test device:

  1. empty: 0 ms,
  2. Vector3.one: 387 ms,
  3. new: 177 ms,
  4. local: 147 ms,
  5. static: 150 ms.

As you see, the empty loop was optimized by the compiler. I iterated on the test a few times to make sure that it doesn't optimize anything else. As for Vector3.one, here we see that it is more than twice as slow as a local variable, which is still the fastest.

iOS iPhone 6 IL2CPP 64 bit

Next, I ran the test on my iOS test device (Android results after /):

  1. empty: 0 ms / 0 ms,
  2. Vector3.one: 47 ms / 387 ms,
  3. new: 56 ms / 177 ms,
  4. local: 35 ms / 147 ms,
  5. static: 54 ms / 150 ms.

Wait, what?!

I ran this test a lot of times with slight modifications here and there. I'd appreciate if anybody would give me a hint what I did wrong.

iOS iPhone 6 IL2CPP 32 bit

Is this a difference in hardware or 32bit vs. 64bit? Samsung S7 should be a bit faster than iPhone 6. Let's see how 32bit code performs (64bit results after /):

  1. empty: 0 ms / 0 ms,
  2. Vector3.one: 170 ms / 47 ms,
  3. new: 175 ms / 56 ms,
  4. local: 174 ms / 35 ms,
  5. static: 170 ms / 54 ms.

OK, this looks more like timings I got on Android.

Generated code

So far the only result is that you'd better use a local variable. But let's dig deeper into IL2CPP generated code.

IL2CPP takes your C# code and emits C++ code instead. This C++ code is easy to find in Xcode project but is not that easy to read. If you are interested, I recommend reading the series of blog posts about IL2CPP at Unity Blog.

Let's look at the code generated from our C# test.

Here's Vector3.one C++ code:

extern "C"  Vector3_t4282066566 Vector3_get_one_m2017759730(…) {  
    Vector3_t4282066566 L_0;
    memset(&L_0, 0, sizeof(L_0));
    Vector3__ctor_m2926210380(&L_0, (1.0f), (1.0f), (1.0f), …);
    return L_0;
}

And this is how this method is called from our code:

Vector3_t4282066566  L_14 = Vector3_get_one_m2017759730(…);  

Here's new Vector3(1, 1, 1) C++ code:

Vector3_t4282066566  L_50;  
memset(&L_50, 0, sizeof(L_50));  
Vector3__ctor_m2926210380(&L_50, (1.0f), (1.0f), (1.0f), …);  

BUT...

Since Vector3_get_one_m2017759730 and Vector3__ctor_m2926210380 are in the same compilation unit, the constructor is actually inlined in Vector3_get_one_m2017759730 method. It is compiled to the following bytecode:

fmov        s0, #1.00000000  
mov.16b     v1, v0  
mov.16b     v2, v0  
ret  

Is this why Vector3.one is not slow here?

But as our code is in another compilation unit, it is correct to say that Vector3_get_one_m2017759730 will not be inlined to our code.

If you look to any code using static variables, you will see that in IL2CPP it follows the same pattern. For example, this is the call to the static variable in my test:

IL2CPP_RUNTIME_CLASS_INIT(Test_t2603186_il2cpp_TypeInfo_var);  
Vector3_t4282066566 L_26 =  
   ((Test_t2603186_StaticFields*)Test_t2603186_il2cpp_TypeInfo_var    
    ->static_fields)->get_staticOne_2();

Luckily, the whole second line is inlined. But, the IL2CPP_RUNTIME_CLASS_INIT macro has to check if the class is initialized every time. Is this why static variables are slow(er)?

Safety checks

Let's look at a snippet of generated IL2CPP code:

Vector3U5BU5D_t215400611* L_25 = __this->get_arr_5();  
NullCheck(L_25);  
IL2CPP_ARRAY_BOUNDS_CHECK(L_25, L_26);  
Vector3_t4282066566  L_27 = Vector3_get_one_m886467710(…);  
int32_t L_28 = V_3;  
Vector3_t4282066566  L_29 = Vector3_op_Multiply_m973638459(…);  
(*(Vector3_t4282066566 *)((L_25)
   ->GetAddressAt(static_cast<il2cpp_array_size_t>(L_26)))) = L_29;

Notice lines two and three:

NullCheck(L_25);  
IL2CPP_ARRAY_BOUNDS_CHECK(L_25, L_26);  

Unity and IL2CPP add a few safety checks to generated code to make sure that you don't accidentally shoot yourself in the foot. This is usually a good thing. But does it add overhead? Yes, it does.

You can turn these safety checks off using Il2CppSetOption attribute (read more) like so:

[Il2CppSetOption(Option.NullChecks, false)]
[Il2CppSetOption(Option.ArrayBoundsChecks, false)]
[Il2CppSetOption(Option.DivideByZeroChecks, false)]
public class Test : MonoBehaviour {}  

Of course, you must be absolutely sure that your code is perfect before doing this. Let's see how much performance I got in my test on iPhone 6 by turning these things off:

  1. empty: 0 ms / 0 ms,
  2. Vector3.one: 40 ms / 47 ms,
  3. new: 56 ms / 56 ms,
  4. local: 31 ms / 35 ms,
  5. static: 51 ms / 54 ms.

Well, this is definitely good.

What we found out

64-bit

Does the fact that I build for a 64-bit platform give such speed boost?

Here I'm entering the realm where I don't really have much expertise. From a lot of googling and chatting with more experienced colleagues I got the following:

  1. It's not SIMD. Generated ARM instructions are definitely not SIMD.
  2. The 64-bit bytecode seems to be using registers instead of stack to pass function parameters.

This might be just an edge case. So, the question is still open.

Results

  1. It is really important to test things on your platform and target device.
  2. Compilers are smart.
  3. Use a local variable.
  4. You learn a lot while making tests.

... and (sadly) I still don't understand what's going on.