Caching Vector Sizes
Nearly every program depends on loops running over data which need to be as fast as possible. A very common situation is a loop over a list whose size rarely changes. In my case my script interpreter needs to interpret code like:
j = 0;
i = 1;
while(i<=10000){
j = (i + i) + (i % 2) - (j / i);
if(j == 3)
j = j + 1;
else
j = j - 1;
i = i + 1;
}
In fact I use this little nonsense loop to test the speed of the interpreter. The body of the loop gets turned into a list of expressions to be executed, so every time the scripted loop is run, the interpreter must loop over the statements that make up the body. I want this loop to be as fast as possible; lots of scripts use loops, the interpreter had better not introduce much overhead. Naturally enough I store the body statements in an STL vector. The first time I wrote the loop I wasn't thinking, and evaluated the vector's size each iteration to see if the loop was done. Of course the number of statements in the loop doesn't change while it's running, so far better to evaluate the size once before looping over the vector. Indeed, such a change gave a nice speed up. However I could still do better: the loop still has the same number of statements each iteration, so i could pull the size call farther out. After doing so I realized that i hadn't kept any data to measure the increase in speed, so i wrote a little program to mimic my problem:
#include <iostream>
#include <vector>
#include "/usr/include/mach/mach.h"
#include "/usr/include/mach/mach_time.h"
unsigned int precalc_size;
std::vector<int> list;
void do_stuff_1(){
int dummy;
for(unsigned int i=0; i<list.size(); i++){
dummy=list[i];
}
}
void do_stuff_2(){
int dummy;
unsigned int size=list.size();
for(unsigned int i=0; i<size; i++){
dummy=list[i];
}
}
void do_stuff_3(){
int dummy;
for(unsigned int i=0; i<precalc_size; i++){
dummy=list[i];
}
}
int main(int argc,char* argv[]){
int numIter=10000;
mach_timebase_info_data_t info;
mach_timebase_info(&info);
uint64_t start,end;
//fill up the list with some numbers
list.push_back(0);
list.push_back(1);
for(int i=2; i<1e4; i++){
list.push_back(list[i-1]+i-list[i-2]);
}
//perform our tests
start = mach_absolute_time();
for(int i=0; i<numIter; i++)
do_stuff_1();
end = mach_absolute_time();
std::cout << ((end - start) * info.numer / info.denom)/1e6
<< " milliseconds\n";
start = mach_absolute_time();
for(int i=0; i<numIter; i++)
do_stuff_2();
end = mach_absolute_time();
std::cout << ((end - start) * info.numer / info.denom)/1e6
<< " milliseconds\n";
start = mach_absolute_time();
precalc_size=list.size();
for(int i=0; i<numIter; i++)
do_stuff_3();
end = mach_absolute_time();
std::cout << ((end - start) * info.numer / info.denom)/1e6
<< " milliseconds\n";
return(0);
}
This program mirrors the fact that I have two loops, but this isn't directly obvious because one is invoking the other inside of a function. So, pulling the size call outside of the outer loop requires storing the result somewhere external to the function. (Fear not, my real code is part of an object, so it remembers that cached value instead of using an ugly global variable.) There are three versions of the loop containing function so that the three implementations can be compared side by side, and the stuff that mentions 'mach' provides timing. My results were:
7670 milliseconds
3930 milliseconds
3780 milliseconds
So the first improvement gave a 49% improvement, and the second gave another 4%. I originally spotted this problem while inspecting with shark, so I happen to know that vector::size() is surprisingly expensive, and depends on calling vector::begin() and vector::end(). So, while the example above is rather extreme, as loop bodies are rarely so slight compared to the loop overhead, this can still have a bigger impact than you might guess. Don't pass up making the first optimization, and then, take a look around and see if there are any wider loops that you may not be seeing.
At this point I'd like to take a moment to pat myself on the back. Best running times for the top code exmaple:
| Language | Execution Time |
|---|---|
| Existing Avernuscript | 25000 microseconds |
| My Interpreter | 3300 microseconds |
| Compiled C++ | 245 microseconds |
I'm still 13.5 times slower than compiled code, but 7.5 times faster than the system I seek to replace. I still have a lot more that I need to add to my system, but the basic structure is proving quite satisfactorily nimble.