Microsoft CLR Performance Team
Microsoft® .NET Framework
Summary: This article presents a low-level cost model for managed code execution time, based upon measured operation times, so that developers may make better informed coding decisions and write faster code. (30 printed pages)
Download the CLR Profiler. (330KB)
Introduction (and Pledge)
Towards a Cost Model for Managed Code
What Things Cost in Managed Code
Introduction (and Pledge)
There are myriad ways to implement a computation, and some are far better than others: simpler, cleaner, easier to maintain. Some ways are blazingly fast and some are astonishingly slow.
Don't perpetrate slow and fat code on the world. Don't you despise such code? Code that runs in fits and starts? Code that locks up the UI for seconds at time? Code that pegs the CPU or thrashes the disk?
Don't do it. Instead, stand up and pledge along with me:
"I promise I will not ship slow code. Speed is a feature I care about. Every day I will pay attention to the performance of my code. I will regularly and methodically measure its speed and size. I will learn, build, or buy the tools I need to do this. It's my responsibility."
(Really.) So did you promise? Good for you.
So how do you write the fastest, tightest code day in and day out? It is a matter of consciously choosing the frugal way in preference to the extravagant, bloated way, again and again, and a matter of thinking through the consequences. Any given page of code captures dozens of such small decisions.
But you can't make smart choices among alternatives if you don't know what things cost: you can't write efficient code if you don't know what things cost.
It was easier in the good old days. Good C programmers knew. Each operator and operation in C, be it assignment, integer or floating-point math, dereference, or function call, mapped more or less one-to-one to a single primitive machine operation. True, sometimes several machine instructions were required to put the right operands in the right registers, and sometimes a single instruction could capture several C operations (famously ), but you could usually write (or read) a line of C code and know where the time was going. For both code and data, the C compiler was WYWIWYG—"what you write is what you get". (The exception was, and is, function calls. If you don't know what the function costs, you don't know diddly.)
In the 1990s, to enjoy the many software engineering and productivity benefits of data abstraction, object-oriented programming, and code reuse, the PC software industry made a transition from C to C++.
C++ is a superset of C, and is "pay as you go"—the new features cost nothing if you don't use them—so C programming expertise, including one's internalized cost model, is directly applicable. If you take some working C code and recompile it for C++, the execution time and space overhead shouldn't change much.
On the other hand, C++ introduces many new language features, including constructors, destructors, new, delete, single, multiple and virtual inheritance, casts, member functions, virtual functions, overloaded operators, pointers to members, object arrays, exception handling, and compositions of same, which incur non-trivial hidden costs. For example, virtual functions cost two extra indirections per call, and add a hidden vtable pointer field to each instance. Or consider that this innocuous-looking code:
compiles into approximately thirteen implicit member function calls (hopefully inlined).
Nine years ago we explored this subject in my article C++: Under the Hood. I wrote:
"It is important to understand how your programming language is implemented. Such knowledge dispels the fear and wonder of "What on earth is the compiler doing here?"; imparts confidence to use the new features; and provides insight when debugging and learning other language features. It also gives a feel for the relative costs of different coding choices that is necessary to write the most efficient code day to day."
Now we're going to take a similar look at managed code. This article explores the low-level time and space costs of managed execution, so we can make smarter tradeoffs in our day to day coding.
And keep our promises.
Why Managed Code?
For the vast majority of native code developers, managed code is a better, more productive platform to run their software. It removes whole categories of bugs, such as heap corruptions and array-index-out-of-bound errors that so often lead to frustrating late-night debugging sessions. It supports modern requirements such as safe mobile code (via code access security) and XML Web services, and compared to the aging Win32/COM/ATL/MFC/VB, the .NET Framework is a refreshing clean slate design, where you can get more done with less effort.
For your user community, managed code enables richer, more robust applications—better living through better software.
What Is the Secret to Writing Faster Managed Code?
Just because you can get more done with less effort is not a license to abdicate your responsibility to code wisely. First, you must admit it to yourself: "I'm a newbie." You're a newbie. I'm a newbie too. We're all babes in managed code land. We're all still learning the ropes—including what things cost.
When it comes to the rich and convenient .NET Framework, it's like we're kids in the candy store. "Wow, I don't have to do all that tedious stuff, I can just '+' strings together! Wow, I can load a megabyte of XML in a couple of lines of code! Whoo-hoo!"
It's all so easy. So easy, indeed. So easy to burn megabytes of RAM parsing XML infosets just to pull a few elements out of them. In C or C++ it was so painful you'd think twice, maybe you'd build a state machine on some SAX-like API. With the .NET Framework, you just load the whole infoset in one gulp. Maybe you even do it over and over. Then maybe your application doesn't seem so fast anymore. Maybe it has a working set of many megabytes. Maybe you should have thought twice about what those easy methods cost...
Unfortunately, in my opinion, the current .NET Framework documentation does not adequately detail the performance implications of Framework types and methods—it doesn't even specify which methods might create new objects. Performance modeling is not an easy subject to cover or document; but still, the "not knowing" makes it that much harder for us to make informed decisions.
Since we're all newbies here, and since we don't know what anything costs, and since the costs are not clearly documented, what are we to do?
Measure it. The secret is to measure it and to be vigilant. We're all going to have to get into the habit of measuring the cost of things. If we go to the trouble of measuring what things cost, then we won't be the ones inadvertently calling a whizzy new method that costs ten times what we assumed it costs.
(By the way, to gain deeper insight into the performance underpinnings of the BCL (base class library) or the CLR itself, consider taking a look at the Shared Source CLI, a.k.a. Rotor. Rotor code shares a bloodline with the .NET Framework and the CLR. It's not the same code throughout, but even so, I promise you that a thoughtful study of Rotor will give you new insights into the goings on under the hood of the CLR. But be sure to review the SSCLI license first!)
If you aspire to be a cab driver in London, you first must earn The Knowledge. Students study for many months to memorize the thousands of little streets in London and learn the best routes from place to place. And they go out every day on scooters to scout around and reinforce their book learning.
Similarly, if you want to be a high performance managed code developer, you have to acquire The Managed Code Knowledge. You have to learn what each low-level operation costs. You have to learn what features like delegates and code access security cost. You have to learn the costs of the types and methods you're using, and the ones you're writing. And it doesn't hurt to discover which methods may be too costly for your application—and so avoid them.
The Knowledge isn't in any book, alas. You have to get out on your scooter and explore—that is, crank up csc, ildasm, the VS.NET debugger, the CLR Profiler, your profiler, some perf timers, and so forth, and see what your code costs in time and space.
Towards a Cost Model for Managed Code
Preliminaries aside, let's consider a cost model for managed code. That way you'll be able to look at a leaf method and tell at a glance which expressions and statements are more costly; and you'll be able to make smarter choices as you write new code.
(This will not address the transitive costs of calling your methods or methods of the .NET Framework. That will have to wait for another article on another day.)
Earlier I stated that most of the C cost model still applies in C++ scenarios. Similarly, much of the C/C++ cost model still applies to managed code.
How can that be? You know the CLR execution model. You write your code in one of several languages. You compile it to CIL (Common Intermediate Language) format, packaged into assemblies. You run the main application assembly, and it starts executing the CIL. But isn't that an order of magnitude slower, like the bytecode interpreters of old?
The Just-in-Time compiler
No, it's not. The CLR uses a JIT (just-in-time) compiler to compile each method in CIL into native x86 code and then runs the native code. Although there is a small delay for JIT compilation of each method as it is first called, every method called runs pure native code with no interpretive overhead.
Unlike a traditional off-line C++ compilation process, the time spent in the JIT compiler is a "wall clock time" delay, in each user's face, so the JIT compiler does not have the luxury of exhaustive optimization passes. Even so, the list of optimizations the JIT compiler performs is impressive:
- Constant folding
- Constant and copy propagation
- Common subexpression elimination
- Code motion of loop invariants
- Dead store and dead code elimination
- Register allocation
- Method inlining
- Loop unrolling (small loops with small bodies)
The result is comparable to traditional native code—at least in the same ballpark.
As for data, you will use a mix of value types or reference types. Value types, including integral types, floating point types, enums, and structs, typically live on the stack. They are as just as small and fast as locals and structs are in C/C++. As with C/C++, you should probably avoid passing large structs as method arguments or return values, because the copying overhead can be prohibitively expensive.
Reference types and boxed value types live in the heap. They are addressed by object references, which are simply machine pointers just like object pointers in C/C++.
So jitted managed code can be fast. With a few exceptions that we discuss below, if you have a gut feel for the cost of some expression in native C code, you won't go far wrong modeling its cost as equivalent in managed code.
I should also mention NGEN, a tool which "ahead-of-time" compiles the CIL into native code assemblies. While NGEN'ing your assemblies does not currently have a substantial impact (good or bad) on execution time, it can reduce total working set for shared assemblies that are loaded into many AppDomains and processes. (The OS can share one copy of the NGEN'd code across all clients; whereas jitted code is typically not currently shared across AppDomains or processes. But see also .)
Automatic Memory Management
Managed code's most significant departure (from native) is automatic memory management. You allocate new objects, but the CLR garbage collector (GC) automatically frees them for you when they become unreachable. GC runs now and again, often imperceptibly, generally stopping your application for just a millisecond or two—occasionally longer.
Several other articles discuss the performance implications of the garbage collector and we won't recapitulate them here. If your application follows the recommendations in these other articles, the overall cost of garbage collection can be insignificant, a few percent of execution time, competitive with or superior to traditional C++ object and . The amortized cost of creating and later automatically reclaiming an object is sufficiently low that you can create many tens of millions of small objects per second.
But object allocation is still not free. Objects take up space. Rampant object allocation leads to more frequent garbage collection cycles.
Far worse, unnecessarily retaining references to useless object graphs keeps them alive. We sometimes see modest programs with lamentable 100+ MB working sets, whose authors deny their culpability and instead attribute their poor performance to some mysterious, unidentified (and hence intractable) issue with managed code itself. It's tragic. But then an hour's study with the CLR Profiler and changes to a few lines of code cuts their heap usage by a factor of ten or more. If you're facing a large working set problem, the first step is to look in the mirror.
So do not create objects unnecessarily. Just because automatic memory management dispels the many complexities, hassles, and bugs of object allocation and freeing, because it is so fast and so convenient, we naturally tend to create more and more objects, as if they grow on trees. If you want to write really fast managed code, create objects thoughtfully and appropriately.
This also applies to API design. It is possible to design a type and its methods so they require clients to create new objects with wild abandon. Don't do that.
What Things Cost in Managed Code
Now let us consider the time cost of various low-level managed code operations.
Table 1 presents the approximate cost of a variety of low-level managed code operations, in nanoseconds, on a quiescent 1.1 GHz Pentium-III PC running Windows XP and .NET Framework v1.1 ("Everett"), gathered with a set of simple timing loops.
The test driver calls each test method, specifying a number of iterations to perform, automatically scaled to iterate between 218 and 230 iterations, as necessary to perform each test for at least 50 ms. Generally speaking, this is long enough to observe several cycles of generation 0 garbage collection in a test which does intense object allocation. The table shows results averaged over 10 trials, as well as the best (minimum time) trial for each test subject.
Each test loop is unrolled 4 to 64 times as necessary to diminish the test loop overhead. I inspected the native code generated for each test to ensure the JIT compiler was not optimizing the test away—for example, in several cases I modified the test to keep intermediate results live during and after the test loop. Similarly I made changes to preclude common subexpression elimination in several tests.
Table 1 Primitive Times (average and minimum) (ns)
|0.0||0.0||Control||2.6||2.6||new valtype L1||0.8||0.8||isinst up 1|
|1.0||1.0||Int add||4.6||4.6||new valtype L2||0.8||0.8||isinst down 0|
|1.0||1.0||Int sub||6.4||6.4||new valtype L3||6.3||6.3||isinst down 1|
|2.7||2.7||Int mul||8.0||8.0||new valtype L4||10.7||10.6||isinst (up 2) down 1|
|35.9||35.7||Int div||23.0||22.9||new valtype L5||6.4||6.4||isinst down 2|
|2.1||2.1||Int shift||22.0||20.3||new reftype L1||6.1||6.1||isinst down 3|
|2.1||2.1||long add||26.1||23.9||new reftype L2||1.0||1.0||get field|
|2.1||2.1||long sub||30.2||27.5||new reftype L3||1.2||1.2||get prop|
|34.2||34.1||long mul||34.1||30.8||new reftype L4||1.2||1.2||set field|
|50.1||50.0||long div||39.1||34.4||new reftype L5||1.2||1.2||set prop|
|5.1||5.1||long shift||22.3||20.3||new reftype empty ctor L1||0.9||0.9||get this field|
|1.3||1.3||float add||26.5||23.9||new reftype empty ctor L2||0.9||0.9||get this prop|
|1.4||1.4||float sub||38.1||34.7||new reftype empty ctor L3||1.2||1.2||set this field|
|2.0||2.0||float mul||34.7||30.7||new reftype empty ctor L4||1.2||1.2||set this prop|
|27.7||27.6||float div||38.5||34.3||new reftype empty ctor L5||6.4||6.3||get virtual prop|
|1.5||1.5||double add||22.9||20.7||new reftype ctor L1||6.4||6.3||set virtual prop|
|1.5||1.5||double sub||27.8||25.4||new reftype ctor L2||6.4||6.4||write barrier|
|2.1||2.0||double mul||32.7||29.9||new reftype ctor L3||1.9||1.9||load int array elem|
|27.7||27.6||double div||37.7||34.1||new reftype ctor L4||1.9||1.9||store int array elem|
|0.2||0.2||inlined static call||43.2||39.1||new reftype ctor L5||2.5||2.5||load obj array elem|
|6.1||6.1||static call||28.6||26.7||new reftype ctor no-inl L1||16.0||16.0||store obj array elem|
|1.1||1.0||inlined instance call||38.9||36.5||new reftype ctor no-inl L2||29.0||21.6||box int|
|6.8||6.8||instance call||50.6||47.7||new reftype ctor no-inl L3||3.0||3.0||unbox int|
|0.2||0.2||inlined this inst call||61.8||58.2||new reftype ctor no-inl L4||41.1||40.9||delegate invoke|
|6.2||6.2||this instance call||72.6||68.5||new reftype ctor no-inl L5||2.7||2.7||sum array 1000|
|5.4||5.4||virtual call||0.4||0.4||cast up 1||2.8||2.8||sum array 10000|
|5.4||5.4||this virtual call||0.3||0.3||cast down 0||2.9||2.8||sum array 100000|
|6.6||6.5||interface call||8.9||8.8||cast down 1||5.6||5.6||sum array 1000000|
|1.1||1.0||inst itf instance call||9.8||9.7||cast (up 2) down 1||3.5||3.5||sum list 1000|
|0.2||0.2||this itf instance call||8.9||8.8||cast down 2||6.1||6.1||sum list 10000|
|5.4||5.4||inst itf virtual call||8.7||8.6||cast down 3||22.0||22.0||sum list 100000|
|5.4||5.4||this itf virtual call||21.5||21.4||sum list 1000000|
A disclaimer: please do not take this data too literally. Time testing is fraught with the peril of unexpected second order effects. A chance happenstance might place the jitted code, or some crucial data, so that it spans cache lines, interferes with something else, or what have you. It's a bit like the Uncertainty Principle: times and time differences of 1 nanosecond or so are at the limits of the observable.
Another disclaimer: this data is only pertinent for small code and data scenarios that fit entirely in cache. If the "hot" parts of your application do not fit in on-chip cache, you may well have a different set of performance challenges. We have much more to say about caches near the end of the paper.
And yet another disclaimer: one of the sublime benefits of shipping your components and applications as assemblies of CIL is that your program can automatically get faster every second, and get faster every year—"faster every second" because the runtime can (in theory) retune the JIT compiled code as your program runs; and "faster ever year" because with each new release of the runtime, better, smarter, faster algorithms can take a fresh stab at optimizing your code. So if a few of these timings seem less than optimal in .NET 1.1, take heart that they should improve in subsequent releases of the product. It follows that any given code native code sequence reported in this article may change in future releases of the .NET Framework.
Tricky C# and SQL Interview Questions for mid-to-senior level Positions
A few readers have left private comments asking me what kinds of interview questions I’ve asked potential candidates. Personally, I have a lot more fun with mid-to-senior level positions as it opens the door to asking tricky C# and SQL interview questions just to see how well they know their stuff, and if they don’t know, how they handle it.
Feel free to use any of them!
Tricky SQL Interview Questions
The results from the SQL:
Describe two actions which can be undertaken with tempdb files to increase SQL Server’s performance.
Primary answers I look for:
i) tempdb files should be moved to a different physical drive from the server’s log files and production database(s) log files because of how active it is and how much I/O occurs with it.
ii) Create multiple tempdb files. It increases the number of physical I/O operations that SQL Server can push to the disk at any one time. The more I/O SQL Server can push down to the disk level, the faster the database will run.
See Increase SQL Server tempdb Performance for more information.
What the difference between UNION and UNIONALL?
UNION will remove the duplicate rows from the result set; UNIONALL does not.
You have been tasked with increasing the speed of a stored procedure that runs once a month, deleting approximately 25 million records of stale data from a table called “StaleWorkOrders”.
Your sole job is to increase the speed at which it runs: you don’t care about any sort of logging and there’s zero transaction blocks that need to be rolled back.
You’ve made an important change. One of the SQL statements below was the original code; the other is your new code:
a) Which SQL statement was originally there? And which one did you change it to?
b) Why did you make the change?
a) DELETE FROM was the original statement which you replaced with the TRUNCATE statement.
b) TRUNCATE TABLE quickly deletes all records in a table by deallocating the data pages used by the table. This reduces the resource overhead of logging the deletions, as well as the number of locks acquired, thus increasing performance. It also does not fire any triggers. In both SQL Server and Oracle Identity Columns will be reset to their starting values but sequences will not be automatically reset – this must still be done manually as a sequence is not connected to a table.
Bonus points if the interviewee knows this difference between Oracle and SQL Server when using TRUNCATE:
Tricky C# Interview Questions