Introduction
Recently I've been optimizing the heck out of Noda Time. Most of the time this has been a case of the normal measurement, find bottlenecks, carefully analyse them, lather, rinse, repeat. Yesterday I had a hunch about a particular cost, and decided to experiment... leading to a surprising optimization.
Noda Time's core types are mostly value types - date/time values are naturally value types, just as DateTime and DateTimeOffset are in the BCL. Noda Time's types are a bit bigger than most value types, however - the largest being ZonedDateTime, weighing in at 40 bytes in an x64 CLR at the moment. (I can shrink it down to 32 bytes with a bit of messing around, although it's not terribly pleasant to do so.) The main reason for the bulk is that we have two reference types involved (the time zone and the calendar system), and in Noda Time 2.0 we're going to have nanosecond resolution instead of tick resolution (so we need 12 bytes just to store a point in time). While this goes against the Class Library Design Guidelines, it would be odd for the smaller types (LocalDate, LocalTime) to be value types and the larger ones to be reference types. Overall, these still feel like value types.
A lot of these value types are logically composed of each other:
- A LocalDate is a YearMonthDay and a CalendarSystem reference
- A LocalDateTime is a LocalDate and a LocalTime
- An OffsetDateTime is a LocalDateTime and an Offset
- A ZonedDateTime is an OffsetDateTime and a DateTimeZone reference
This leads to a lot of delegation, potentially - asking a ZonedDateTime for its Year could mean asking the OffsetDateTime, which would ask the LocalDateTime, which would ask the LocalDate, which would ask the YearMonthDay. Very nice from a code reuse point of view, but potentially inefficient due to copying data.
Why would there be data copying involved? Well, that's where this blog post comes in.
Behaviour of value type member invocations
When an instance member (method or property) belonging to a value type is invoked, the exact behaviour depends on the kind of expression it is called on. From the C# 5 spec, section 7.5.5 (where E is the expression the member M is invoked on, and the type declaring M is a value type):
If E is not classified as a variable, then a temporary local variable of E’s type is created and the value of E is assigned to that variable. E is then reclassified as a reference to that temporary local variable. The temporary variable is accessible as this within M, but not in any other way. Thus, only when E is a true variable is it possible for the caller to observe the changes that M makes to this.
So when is a variable not a variable? When it's readonly... from section 7.6.4 (emphasis mine) :
If T is a struct-type and I identifies an instance field of that class-type:
- If E is a value, or if the field is readonly and the reference occurs outside an instance constructor of the struct in which the field is declared, then the result is a value, namely the value of the field I in the struct instance given by E.
(There's a very similar bullet for T being a class-type; the important part is that the field type is a value type
The upshot is that if you have a method call of:
then it's effectively converted into this:
int result = tmp.Foo();
Now if the type of the field is quite a large value type, but Foo() doesn't modify the value (which it never does within my value types), that's performing a copy completely unnecessarily.
To see this in action outside Noda Time, I've built a little sample app.
Show me the code!
Our example is a simple 256-bit type, composed of 4 Int64 values. The type itself doesn't do anything useful - it just holds the four values, and exposes them via properties. We then measure how long it takes to sum the four properties lots of times.
using System.Diagnostics;
public struct Int256
{
private readonly long bits0;
private readonly long bits1;
private readonly long bits2;
private readonly long bits3;
public Int256(long bits0, long bits1, long bits2, long bits3)
{
this.bits0 = bits0;
this.bits1 = bits1;
this.bits2 = bits2;
this.bits3 = bits3;
}
public long Bits0 { get { return bits0; } }
public long Bits1 { get { return bits1; } }
public long Bits2 { get { return bits2; } }
public long Bits3 { get { return bits3; } }
}
class Test
{
private readonly Int256 value;
public Test()
{
value = new Int256(1L, 5L, 10L, 100L);
}
public long TotalValue
{
get
{
return value.Bits0 + value.Bits1 + value.Bits2 + value.Bits3;
}
}
public void RunTest()
{
// Just make sure it's JITted...
var sample = TotalValue;
Stopwatch sw = Stopwatch.StartNew();
long total = 0;
for (int i = 0; i < 1000000000; i++)
{
total += TotalValue;
}
sw.Stop();
Console.WriteLine("Total time: {0}ms", sw.ElapsedMilliseconds);
}
static void Main()
{
new Test().RunTest();
}
}
Building this from the command line with /o+ /debug- and running (in a 64-bit CLR, but no RyuJIT) this takes about 20 seconds to run on my laptop. We can make it much faster with just one small change:
{
private Int256 value;
// Code as before
}
The same test now takes about 4 seconds - a 5-fold speed improvement, just by making a field non-readonly. If we look at the IL for the TotalValue property, the copying becomes obvious. Here it is when the field is readonly:
get_TotalValue() cil managed
{
// Code size 60 (0x3c)
.maxstack 2
.locals init (valuetype Int256 V_0,
valuetype Int256 V_1,
valuetype Int256 V_2,
valuetype Int256 V_3)
IL_0000: ldarg.0
IL_0001: ldfld valuetype Int256 Test::'value'
IL_0006: stloc.0
IL_0007: ldloca.s V_0
IL_0009: call instance int64 Int256::get_Bits0()
IL_000e: ldarg.0
IL_000f: ldfld valuetype Int256 Test::'value'
IL_0014: stloc.1
IL_0015: ldloca.s V_1
IL_0017: call instance int64 Int256::get_Bits1()
IL_001c: add
IL_001d: ldarg.0
IL_001e: ldfld valuetype Int256 Test::'value'
IL_0023: stloc.2
IL_0024: ldloca.s V_2
IL_0026: call instance int64 Int256::get_Bits2()
IL_002b: add
IL_002c: ldarg.0
IL_002d: ldfld valuetype Int256 Test::'value'
IL_0032: stloc.3
IL_0033: ldloca.s V_3
IL_0035: call instance int64 Int256::get_Bits3()
IL_003a: add
IL_003b: ret
} // end of method Test::get_TotalValue
And here it is when the field's not readonly:
get_TotalValue() cil managed
{
// Code size 48 (0x30)
.maxstack 8
IL_0000: ldarg.0
IL_0001: ldflda valuetype Int256 Test::'value'
IL_0006: call instance int64 Int256::get_Bits0()
IL_000b: ldarg.0
IL_000c: ldflda valuetype Int256 Test::'value'
IL_0011: call instance int64 Int256::get_Bits1()
IL_0016: add
IL_0017: ldarg.0
IL_0018: ldflda valuetype Int256 Test::'value'
IL_001d: call instance int64 Int256::get_Bits2()
IL_0022: add
IL_0023: ldarg.0
IL_0024: ldflda valuetype Int256 Test::'value'
IL_0029: call instance int64 Int256::get_Bits3()
IL_002e: add
IL_002f: ret
} // end of method Test::get_TotalValue
Note that it's still loading the field address (ldflda) four times. You might expect that copying the field onto the stack once via a temporary variable would be faster, but that ends up at about 6.5 seconds on my machine.
There is an optimization which is even faster - moving the totalling property into Int256. That way (with the non-readonly field, still) the total time is less than a second - twenty times faster than the original code!
Conclusion
This isn't an optimization I'd recommend in general. Most code really doesn't need to be micro-optimized this hard, and most code doesn't deal with large value types like the ones in Noda Time. However, I regard Noda Time as a sort of "system level" library, and I don't ever want someone to decide not to use it on performance grounds. My benchmarks show that for potentially-frequently-called operations (such as the properties on ZonedDateTime) it really does make a difference, so I'm going to go for it.
I intend to apply a custom attribute to each of these "would normally be readonly" fields to document the intended behaviour of the field - and then when Roslyn is fully released, I'll probably write a test to validate that all of these fields would still compile if the field were made readonly (e.g. that they're never assigned to outside the constructor).
Aside from anything else, I find the subtle difference in behaviour between a readonly field and a read/write field fascinating... it's something I'd been vaguely aware of in the past, but this is the first time that it's had a practical impact on me. Maybe it'll never make any difference to your code... but it's probably worth being aware of anyway.
Perhaps in a future version of C#, they could add an attribute or keyword to tell the compiler that a particular method (such as someField.Foo() ) does not result in any changes to the object. Then it wouldn't have to assume that every method call is potentially dangerous, and it could eliminate some of these defensive copies while still keeping the field as readonly.
ReplyDeleteEnd result should be about the same as your custom attribute and future test, but built into the compiler instead of requiring an extra test to make sure you're still treating all those fields as read-only.
Thanks for the insight!
ReplyDeleteI keep coming back to this article to scratch my head and contemplate how this can really be what's happening.
ReplyDeleteYou would think that having a more restrictive type of field would allow the static compiler and JIT to optimise reading the value of the field, since it can never change after the scope of the constructor call has ended. It makes the field infinitely cache-able.
I have to assume this temporary variable nonsense exists to accommodate mutable structures, the most evil of structs.