Script Interpreting: A Use for Unions

While I'm waiting to check all of Torque back out of my team's repository, I thought I'd write a bit about the basis of my script interpreter. One of the first problems is: how do you store values, not just the ones stored in variables, but also from the intermediate steps of operations. I decided to start off simple, with just ints, floats, and strings. After a bit of thought I concluded that unions were the best answer to the problem.

What is curious about unions is that every introductory C++ text I've ever seen has a section about them, usually with a single contrived example, but I have never before seen one used for a serious purpose, or considered using one myself. The thing is, it's obvious what structs are good for; they allow the programmer to make convenient little bundles out of items of data that belong together. For unions, a use is far less clear; after all, you can only really use one of the union's fields at a time, so if that's the case, why not just use an object of the desired type? On the other hand, if polymorphism is what you want, then classes with inheritance are what you want really.

When I sat down to work with my script interpreter idea, though, it was clear that I wanted a single datatype that I could pass around everywhere, but I wanted it to have a minimal memory footprint, since I'm likely to have a lot of them. A union solves the problem neatly, although only if used as only more sophisticated texts usually bother to show. The trouble is that by itself a union is virtually useless, since it can't tell you which of its fields you wrote data to, and thus will want to read data from in order to maintain sanity. What a typical example shows is just a plain union with some fields. It then proceeds to write some data to a field of the union, and then read it back out again on the next line. The way that unions are actually useful is if you wrap them with a struct:

struct UsefulUnion{
    short type; //0 = int, 1 = double
    union{
        int iVal;
        float fVal;
    };
};

Now you can really get some useful work done! You can create a UsefulUnion, and by keeping proper track of the type field make use of the anonymous union it contains. There is the slight bother of always checking and setting the `type', but your whole data structure fits into only 6 bytes (for a typical 32 bit system).

I fact, the UsefulUnion is basically identical to what I'm currently using for integer and floating point values in my interpreter. Strings presented a larger hurdle. First of all, I decided to use std::string as the basis for my strings: it's convenient to use, and I don't have any notion that I could really improve upon it. However, there are a couple of additional concerns that stopped me from just dropping a field std::string sVal into my union. Firstly, an std::string takes up 8 bytes (4 for its pointer to its actual string data, and 4 for its size field). Secondly, unlike with numbers there is a real incentive to make duplicate strings refer to the same object when possible. After all, many strings are used immutably, and just copied around, so there's no sense in copying the whole contents of the string unless one has to. So, my solution was to create this little data type:

struct stringWrapper{
    std::string* str;
    unsigned short count;
};

This way, my value struct only needs a 4 byte pointer to a stringWrapper, so it grows no larger. The stringWrapper is then 6 bytes and the std::string is 8, so a numeric value will require only 8 bytes even though a string value will take up 22 plus the size of the string data. The count field is used to store a reference count, so that multiple values can refer to the same string object, and as long as one of them exists the string will be maintained. The finished version of my value storage type is:

struct value{
    int type; //0=int, 1=float, 2=string
    union{
        int iVal;
        float fVal;
        stringWrapper* sVal;
    };
};

Reference counting does mean that one has to be careful and maintain the correct reference count, so I wrote some convenience functions for doing so:

void assignValue(value &lhs, const value &rhs)
{
    if(lhs.type==2 && lhs.sVal!=0){
        if(lhs.sVal==rhs.sVal && rhs.type==2)
            return;
        lhs.sVal->count--;
        if(lhs.sVal->count<=0){
            delete lhs.sVal->str;
            delete lhs.sVal;
        }
    }
    lhs.type=rhs.type;
    lhs.iVal=rhs.iVal;
    if(lhs.type==2 && lhs.sVal!=0)
        lhs.sVal->count++;
}

void setValueString(value &v,std::string* s)
{
    if(v.type==2 && v.sVal!=0){
        if(v.sVal->str==s)
            return;
        v.sVal->count--;
        if(v.sVal->count<=0){
            delete v.sVal->str;
            delete v.sVal;
        }
    }
    stringWrapper* wrap= new stringWrapper;
    wrap->count=1;
    wrap->str=s;
    v.sVal=wrap;
    v.type=2;
}

void destroyValue(value* v)
{
    if(v->type==2 && v->sVal!=0){
        v->sVal->count--;
        if(v->sVal->count<=0){
            delete v->sVal->str;
            delete v->sVal;
        }
    }
    delete v;
}

There's nothing too special about these; assignValue assigns the contents of one value to another, destroying the lhs value's string if necessary, setValueString places the given string into the given value, again destroying the old string if needed, and destroyValue deallocates the given value and its string if it had one.

On the whole I'm pretty happy with this system; it has proven to work fast and without requiring much memory. Memory mistakes (accessing null pointers and leaking memory) are relatively easy to make, but I've worked out all the ones I know of, and if I work carefully—and check my work—once I'm done none should remain and everything should work smoothly.

No Comments

Comment on this post