Generic Data Structures in C

22 April 2017

See here for a final implementation if you don’t want to read all this.

Generic data structures in C typically have a pretty unfriendly API. They either rely on void pointers and erase type information, or resort to macros to provide a semblance of the templating system found in C++.

This post will look at constructing a macro-based vector in C with a focus on ease of use. We will use modern C11 features and ample compiler extensions to see where we can take this.

A Generic Vector

First, lets define our vector type. We’ll call it qvec because its short and sweet.

#define qvec(T)             \
    struct qvec_##T {       \
        size_t cap, len;    \
        T data[];           \
    }

We take a parameter T which will represent the type that is stored in our vector. This will be templatized at compile-time, similar to how vector<T> is in C++.

The data field is a flexible array member from C99.

Note: We will forgo error checking of malloc and realloc for simplicity.

`new`

The new function should malloc enough memory for some initial members. The size of the required storage will depend on the size of T. A possible implementation could be

#define qvec_new(T, v)                                       \
do {                                                         \
    size_t initial_size = 16;                                \
    v = malloc(sizeof(qvec(T)) + sizeof(T) * initial_size);  \
    v->cap = initial_size;                                   \
    v->len = 0;                                              \
} while (0)

which we can use to initialize a vector of integers as

qvec(int) *v;
qvec_new(int, v);

The flexible array member allows us to get away with a single call to malloc which is a minor nicety. Otherwise, this is a little underwhelming. The separation of declaration and initialization is not ideal.

To make this a bit nicer, we can use statement expressions which allow multiple statements to be evaluated and used as if they were an expression. Our new definition for new would then be

#define qvec_new(T)                                                           \
({                                                                            \
    const size_t initial_size = 16;                                           \
    struct qvec_##T *v = malloc(sizeof(qvec(T)) + sizeof(T) * initial_size);  \
    v->cap = initial_size;                                                    \
    v->len = 0;                                                               \
    v;                                                                        \
})

which gives us the much more natural usage

qvec(int) *v = qvec_new(int);

Standard Functions

Lets now implement the common vector functions push, pop and at.

`pop`

pop doesn’t require any special knowledge of the type T so this is simply

#define qvec_pop(v) (v->data[--v->len])

`at`

at is slightly more interesting. When working with a C++ vector (or a standard C array), the notation array[x] is an lvalue which can be assigned to. It would be nice if our qvec has this property as well.

First, lets define the helper function

#define qvec_ref(v, i) (&v->data[i])

This returns an lvalue and so can be used with a pointer dereference. e.g. *qvec_ref(v, i) = 5.

We can wrap this in another macro to hide this dereference

#define qvec_at(v, i) (*(qvec_ref(v, i)))

`push`

push presents a small problem. If we were to generate a standard implementation

#define qvec_push(v, i)                                 \
({                                                      \
    if (v->len >= v->cap) {                             \
        v->cap *= 2;                                    \
        v = realloc(v, sizeof(?) + v->cap * sizeof(?)); \
    }                                                   \
    v->data[v->len++] = (i);                            \
})

we might be left wondering what to insert into the ? marked locations.

The second ? is less worrying. This should be sizeof(T). We could just pass the type again, but doing it on every push is not ideal. In fact, we don’t need any new information. Recall that the data field of qvec is of type T[]. Performing a dereference of this will give us the size of a single T, exactly what we want!

The first ? is more bothersome. We are interested in determining the value of sizeof(qvec(T)). We can’t use the data field here, since the T required here is the actual typename used during initialization. This would be viable if it were possible to generate a type name from an arbitrary variable but unfortunately we cannot do this.

The way to get this size is first to realise that the data member in a qvec doesn’t actually take up any space within the array, not even for a pointer.

We can confirm this by checking the following

struct {
    char a, b;
    char b[]
} foo;

printf("foo is %zu bytes\n", sizeof(foo));

which will print

foo is 2 bytes

Since this data doesn’t take any space, we can see that the other members (len and cap) have a fixed type and therefore size, regardless of the type of T.

We can separate the type of qvec into

#define qvec_base       \
    struct {            \
        size_t cap, len;\
    }

#define qvec(T)         \
    struct qvec_##T {   \
        qvec_base;      \
        T data[];       \
    }

This now allows us to query the size of the type-independent part of a qvec while retaining access to all the members in the same way.

As an aside, we can define this using less macro-wizardry if we enable the -fplan9-extensions option in GCC as documented here.

struct qvec_base {
    size_t cap, len;
}

#define qvec(T)             \
    struct qvec_##T {       \
        struct qvec_base;   \
        T data[];           \
    }

This allows embedding of existing struct definitions as an anonymous struct.

Now, finally, we can define our push function as:

#define qvec_push(v, i)                                                 \
({                                                                      \
    if (v->len >= v->cap) {                                             \
        v->cap *= 2;                                                    \
        v = realloc(v, sizeof(qvec_base) + v->cap * sizeof(*v->data));  \
    }                                                                   \
    v->data[v->len++] = (i);                                            \
})

`free`

Since we only use a single malloc to initialize the type, this is simply

#define qvec_free(v) free(v)

API so far

Lets see what this gives us so far

qvec(int) *iv = qvec_new(int);
qvec_push(iv, 5);
qvec_push(iv, 8);
printf("%d\n", qvec_at(iv, 0));
qvec_at(iv, 1) = 5;
qvec_free(iv);

and compared similar C++ vector usage

std::vector<int> iv;
iv.push_back(5);
iv.push_back(8);
printf("%d\n", iv[0]);
iv[1] = 5;

Looking okay, but lets go a bit further.

Extended Functions

Generic Printing

It is fairly common that we want to dump the values of a vector to see what is inside. If we wanted to write this for an integer vector, the following would work

#define qvec_int_print(v)               \
({                                      \
    printf("[");                        \
    for (int i = 0; i < v->len; ++i) {  \
        printf("%d", v->data[i]);       \
        if (i + 1 < v->len)             \
            printf(", ");               \
    }                                   \
    printf("]\n");                      \
})

which can be used as

qvec_print(iv); // [5, 5]

This is nice, but since it isn’t generic it has a limited use case. Fortunately for us, C11 brings some new interesting features to the table which we can use.

The C11 _Generic keyword allows rudimentary switching based on the type of its input. Think of it just as a compile-time switch statement on types.

For example, we could construct a macro to print the name of a type

#define type_name(x) _Generic((x), int: "int", float: "float")

printf("This is a %s\n", type_name(5.0f));
printf("This is a %s\n", type_name(5));

which when run would output

This is a float
This is a int

We can use this to generate the appropriate printf format specifier for the passed type.

#define GET_FMT_SPEC(x) _Generic((x), int: "%d", float: "%f", char*: "%s")

and modifying our print function

#define qvec_print(v)                   \
({                                      \
    printf("[");                        \
    for (int i = 0; i < v->len; ++i) {  \
        printf(GET_FMT_SPEC(v->data[i]), v->data[i]);\
        if (i + 1 < v->len)             \
            printf(", ");               \
    }                                   \
    printf("]\n");                      \
 })

This would now work on an integer and float qvec type with no modifications. Of course, we could extend GET_FMT_SPEC with whatever types we need.

You may recall that I mentioned that we could solve an earlier issue regarding our push function if we could generate a type name from a variable. It seems like the _Generic keyword would help is achieve this and indeed it does in part. The problem is that it is evaluated after preprocessing, so we cannot use its output as part of the preprocessor token concatenation process.

This is an easy mistake to make, since _Generic is seen pretty much solely within macro definitions for obvious reasons. This isn’t required though, the following being perfectly valid code.

int a;
float b;

printf("%s\n", _Generic(a, int: "a is an int", float: "a is a float"));
printf("%s\n", _Generic(b, int: "b is an int", float: "b is a float"));

Initializer Lists

Since C++11, vectors can now be initialized with initializer lists

std::vector<int> = {4, 5, 2, 3};

This is pretty nice. Let’s add something similar to our new function using C99 variadic macros with a GCC extension which allows an arbitrary name to be given for them.

#define QVEC_ALEN(a) (sizeof(a) / sizeof(*a))

#define qvec_new(T, xs...)                                                    \
({                                                                            \
    const size_t initial_size = 16;                                           \
    const T _xs[] = {xs};                                                     \
    struct qvec_##T *v = malloc(sizeof(qvec(T)) + sizeof(T) * QVEC_ALEN(_xs));\
    v->cap = initial_size;                                                    \
    v->len = QVEC_ALEN(_xs);                                                  \
    for (int i = 0; i < v->len; ++i)                                          \
        v->data[i] = _xs[i];                                                  \
    v;                                                                        \
})

xs here collects all arguments except the first. We assign these to a temporary array which allows us to work with the values, but also has the effect of typechecking the values.

qvec(int) *v = qvec_new(int, 4, 5, 2, 3);

Complex Objects

Suppose we have the following type

typedef struct {
    char *id;
    bool is_tasty;
} Food;

We might try and utilize C99 struct initializers to perform the following

qvec(Food) *v = qvec_new(Food);
qvec_push(v, { .id = "apple", .is_tasty = true });

This however fails to compile. Under clang, we get the following error

qvec.c:103:34: error: too many arguments provided to function-like macro
      invocation
    qvec_push(v, { "apple", 1 });
                                 ^
qvec.c:42:9: note: macro 'qvec_push' defined here
#define qvec_push(v, i)                                                       \
        ^
qvec.c:103:5: note: cannot use initializer list at the beginning of a macro
      argument
    qvec_push(v, { "apple", 1 });
    ^            ~~~~~~~~~~~~~~~~~~~~
qvec.c:103:5: error: use of undeclared identifier 'qvec_push'
    qvec_push(v, { "apple", 1 });
    ^

The reason this doesn’t work is that the C preprocessor is dumb. It doesn’t know that this is a designated initializer because it doesn’t actually know anything about the C language. Instead, it sees two arguments. The first being { .id = "apple" and the second .is_tasty = true }.

The can get around this is by using the previously mentioned variadic macros once again. Using a similar technique to the previously extended new function.

#define qvec_push(v, xs...)                                             \
({                                                                      \
    const typeof(*v->data) _xs[] = {xs};                                \
    if (v->len + QVEC_ALEN(_xs) >= v->cap) {                            \
        while (v->cap <= v->len + alen(_xs)) {                          \
            v->cap = 2 * v->cap;                                        \
        }                                                               \
        v = realloc(v, sizeof(qvec_base) + v->cap * sizeof(*v->data));  \
    }                                                                   \
    for (int i = 0; i < QVEC_ALEN(_xs); ++i) {                          \
        v->data[v->len++] = _xs[i];                                     \
    }                                                                   \
    v;                                                                  \
})

The reason variadic macros help here is that all macro arguments are gathered at once and treated as input to an array initializer. Even though individual arguments are not valid tokens, it doesn’t matter, since the full set of argments is.

Another thing to note is the use of the typeof keyword. This allows us to retrieve the type of an expression, which can be used to initialize new types. The most common example of its usage is likely within a type-generic swap macro.

#define swap(x, y)              \
do {                            \
    const typeof(x) _temp = y;  \
    y = x;                      \
    x = _temp;                  \
} while (0)

Extensions, Extensions, Extensions

Our code is already filled with compiler-specific C extensions, so we may as well go overboard.

RAII

One of the better features of C++ is the ability to utilize RAII to run destructors on block exit. This reduces the chance that leaks occur within programs and just makes using complex types much more pleasant.

The cleanup variable attribute is a GCC extension which allows a user-defined cleanup function to automatically run when the value goes out of scope.

This attribute takes one argument, a function of type void cleanup(T**) where T is the type which this attribute is declared with.

Using this with our qvec, it may look like

static inline _qvec_free(void **qvec) { free(*qvec); }

int main(void)
{
    qvec(int) __attribute__ ((cleanup(_qvec_free))) *qv = qvec_new(int);
    // No qvec_free here!
}

This is a little verbose however, so lets define our own keyword which we can use instead.

#define raii __attribute__ ((cleanup(_qvec_free)))

int main(void)
{
    raii qvec(int) *qv = qvec_new(int);
}

Note that an attribute doesn’t strictly need to be specified after the type definition.

This is nice, but if you had actually compiled the above you would get a number of type errors.

qvec.c: In function ‘main’:
qvec.c:13:12: warning: passing argument 1 of ‘_qvec_free’ from incompatible pointer type [-Wincompatible-pointer-types]
     struct {                                                                  \
            ^
qvec.c:26:40: note: in expansion of macro ‘qvec_base’
     struct qvec_##T *v = malloc(sizeof(qvec_base) + sizeof(_xs));             \
                                        ^
qvec.c:94:25: note: in expansion of macro ‘qvec_new’
     raii qvec(int) *v = qvec_new(int);
                         ^
qvec.c:88:20: note: expected ‘void **’ but argument is of type ‘struct qvec_int **’
 static inline void _qvec_free(void **qvec) { free(*qvec); }

The compiler complains because we are relying on an implicit cast to void. We know this is actually valid however, since every qvec is going to use a single call to free in order to release its memory.

As far as I’m aware, this requires a pragma at the callsite to disable this locally. This is quite inconvenient, and really loses out any usability that we may have gained from using this. The following will compile without warnings

int main(void)
{
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wincompatible-pointer-types"
    raii qvec(int) *v = qvec_new(int, 5, 4, 3);
#pragma GCC diagnotic pop
}

At this stage though, remembering to just manually free seems like a saner choice.

Type Inference

One of the nice features of C++11 onwards is the revitalization of the auto keyword. This now provides type inference which is very nice in a number of circumstances.

If we look at our vector initialization

qvec(int) *v = qvec_new(int);

we clearly have a bit of redundancy. Unfortunately the C language doesn’t support type inference… as part of the standard at least. An interesting extension is the _auto_type keyword which provides some limited type inference capabilities.

Since the auto keyword is practically useless, lets just redefine it

Redefining keywords is usually a very bad idea. Although, GCC will allow it.

#define auto __auto_type

auto iv = qvec_new(int);

Although yet again, our expectations differ to reality. This will not compile! The reason for this is that previously we were relying on the inline struct definition of qvec(T) that was declared on every initialization. Without this declaration, our new auto keyword cannot find any struct which matches the return type and must fail.

As an example, the following works fine

qvec(int) *a = qvec_new(int);
auto b = qvec_new(int);

because the qvec(int) declared the struct, so the next qvec return type can be deduced correctly. This is simply an inherent limitation with the tools we have. A simple solution would be simply forward declare our structs.

qvec(int);

int main(void)
{
    auto a = qvec_new(int); // Ok!
}

But this is one extra line to type for each qvec type required!

Drawbacks

We have a pretty good set of functions associated with our qvec so far. Usability is ok and we have a few of the more desirable features of C++ in our hands within C.

Undoubtedly however, there are some inherent problems that we just can’t solve.

Complex Container Types

We can do the following in C++

std::vector<std::vector<std::vector<int>>> v;

To do this with our qvec the following is required

typedef qvec(int) qvec_int;
typedef qvec(qvec_int) qvec_qvec_int;
qvec(qvec_qvec_int) *v = qvec_new(qvec_qvec_int);

Recall back to our new implementation. We generate a struct with a name qvec_##T where T is the type. Since this is concatenated to make an identifier, the types must be comprised only of characters which can exist within an identifier ([_0-9A-Za-z]). Any types which use other characters, such as functions, pointers and even our own qvec types must have a typedef before we can use them.

As an example, the following

qvec(char**);

expands to the invalid struct declaration

struct qvec_char** {
    size_t cap, len;
    char* data[];
};

Too Much Inlining

Since we are dealing with macros, every call is going to generate the same code at the call site. This isn’t too big a deal with our qvec, since a vector is inherently pretty simple, but if we wanted to use the same techniques to construct a generic hashmap, for example, the code duplication would be much worse.

This is where the generic containers which rely on simply generating the required functions for each type (see khash) definitely have the upper hand.

These approaches however do lose out a bit in terms of the expressiveness of the resulting API (which is our main focus here).

Which Names are Which?

Say we wanted to do the following contrived thing

void print(qvec(int) *v)
{
    qvec_print(v);
}

int main(void)
{
    qvec(int) *v = qvec_new(int, 1, 2, 3);
    print(v);
}

This will spew our a mess of errors about anonymous structs. The reason being is that the qvec(int) in the print parameter list is declaring a new anonymous struct, and the two qvec(int) declarations are completely different structures.

This can be worked around by doing a typedef at the start of your file and using this, but again at the cost of extra work for the programmer.

How about the following example. Will this qvec_new be aware of the type being used within the Foo struct?

struct Foo {
    qvec(int) *values;
};

void foo_init(Foo *v)
{
    v->value = qvec_new(int);
}

int main(void)
{
    struct Foo f;
    foo_init(&f);
}

This in fact will work potentially to some surprise. Even though this does, it still highlights a pretty important problem. Even though the API is nice and appears easy to use, there are a number of naming issues that the user must be aware of, which greatly limits its usage as a just works type of structure.

A Final Look

#include "qvec.h"

typedef char* string;

typedef struct {
    int x, y;
} Tuple;

int main(void)
{
    qvec(string) *sv = qvec_new(string, "Who", "are", "you?");
    qvec_print(sv);
    qvec_at(sv, 2) = "we?";
    qvec_print(sv);
    qvec_free(sv);

    qvec(int) *iv = qvec_new(int, 1, 2, 3, 4);
    qvec_print(iv);
    printf("%d\n", qvec_pop(iv));
    qvec_free(iv);

    qvec(Tuple) *tv = qvec_new(Tuple, { .x = 0, .y = 1 }, { 4, 2 }, { 5, 4 });
    printf("%d\n", qvec_at(tv, 1).x);
    printf("%d\n", qvec_at(tv, 2).x);
    qvec_free(tv);
}

So would I recommend using this? Probably not. If you were insistent on sticking with C however I think the best compromise would be to generate the specific instantiations (similar to what khash does). This gets rid of most of the problems specified here. Alternatively, if performance and the type-safety isn’t a big deal, then a tried and tested void* implementation would be good too.

At the end of the day though, the pragmatic solution would be to just use C++ if there are no reasons not to and call it a day. Especially if you are considering performing these types of C macro chicanery.