Having fun playing with fire, er…C

This topic takes me back to some of the “good old days” of when I was a C programmer. I found “Top 10 Ways to be Screwed by the C programming language”, by Dave Dyer, on reddit. As I read through it I got a chuckle. Here were some good ones.

Note: none of the code examples display carriage-returns in strings that are printed. For whatever reason my blog software has a real problem with them.

Accidental assignment/Accidental Booleans

        if(a=b) c; /* a always equals b, but c
                      will be executed if b!=0 */

Depending on your viewpoint, the bug in the language is that the assignment operator is too easy to confuse with the equality operator; or maybe the bug is that C doesn’t much care what constitutes a boolean expression: (a=b) is not a boolean expression! (but C doesn’t care).

No, this was not a bug in the language. It’s working the way it was intended. Fundamentally C has a simple structure. Part of that is you can evaluate expressions just about anywhere, no matter what type they are. It works the same way in for-loops:

int i;
for (printf("Starting processes"), i = 0;
     ExecuteProcess(i) && i < 6;
     printf("Processes executed successfully"), i++); 

In this example, ExecuteProcess() returns 0 for failure or a non-zero value for success. That’s how it determines “true” and “false”. 0 == “false”. The example above will work. It’s hackerish, and it’s something I tried to avoid doing in my own code just to avoid confounding others, but the language designers wanted to allow this. The way they saw it, a for-loop was just a “form” that had an initialization step, a test, a code block that would get executed (if you put one in–the fact that I didn’t is intentional), and a post-execution step. You can pour whatever you want into these slots. The if construct is no different. In this way C is a bit like Lisp, which has similar “open-ended” forms where you can put any sort of expression into the slots of constructs.

Closely related to this lack of rigor in booleans, consider this construction:

        if( 0 < a < 5) c; /* this "boolean"
                             is always true! */

Always true because (0<a) generates either 0 or 1 depending on if (0<a), then compares the result to 5, which is always true, of course. C doesn’t really have boolean expressions, it only pretends to.

I forget when was the last time I’d seen constructs like this that actually worked. Maybe BASIC? It’s useful, but he’s right. It won’t fly in C. The reason is that it looks at r-values (which is what the code inside the if statement is) as an expression to evaluate. Nothing more. It doesn’t consider the context. As such, it evaluates expressions according to some rules. A book I got years ago, called “The C Programming Language”, “C: A Reference Manual”, by Harbison and Steele, is an excellent book on C. There was one page I went to more than any other. It had a hierarchical list of all the operators you could use, and what precedence they had in relation to each other. The higher up in the list, the higher their precedence. Some operators were equal in precedence, in which case they’d just be evaluated from left to right. Dyer’s example above is one such case.

Or consider this:

if( a =! b) c;      /* this is compiled as (a = !b)
                       an assignment, rather than
                       (a != b) or (a == !b) */

Yes, it’s a typo. What do you expect? I understand that in more modern languages, like C# (maybe Java, too), this would’ve been flagged. I know in C# it flags it if it sees an assignment inside a conditional statement. I forget if it’s just a warning or a syntax error.

Unhygienic macros

        #define assign(a,b)
           a=(char)b  

        assign(x,y>>8)

becomes

        x=(char)y>>8    /* probably not what
                           you want */

I didn’t get this example. Yes, you have to be careful how you construct your macros so they fit the situations you intend to use them in. Just common sense, if you ask me. Just rewrite the macro like so: “#define assign(a, b) a=(char)(b)”. So for, “assign(x,y>>8)” you’ll get “x=(char)(y>>8)”. That wasn’t so hard, was it?

Mismatched header files
Suppose foo.h contains:

        struct foo { BOOL a};

  file F1.c  contains
        #define BOOL char
        #include "foo.h"

  file F2.c contains
        #define BOOL int
        #include "foo.h"

now, F1. and F2 disagree about the fundamental attributes of structure “foo”. If they talk to each other, You Lose!

This is just undisciplined coding at work. The place to define a value like BOOL is in something like a types.h file. You would then include that file inside foo.h. This way whenever you include foo.h you also get a consistent BOOL type built into the inclusion.

Unpredictable struct construction
Consider this bit packing struct:

    struct eeh_type    {
            uint16 size: 10;  /* 10 bits */
            uint16 code: 6;   /* 6 bits */
    };

Depending on which C compiler, and which “endian” flavor of machine you are on, this might actually be implemented as

        <10-bits><6-bits>

or as

        <6-bits><10-bits>

Also, again depending on the C compiler, machine architecture, and various mysterious preference settings,the items might be aligned to the nearest 8, 16, 32, or 64 bits. So what matters? If you are trying to match bits with a real world file, everything!

Need another way to lose big? How about this:

Rect foo = {0,1,2,3}; // assign numbers to
                      // the first four slots

You may think you know what those four slots are, but there’s at least an even chance you’ll have to discover the hard way if the structure ever changes.

Indefinite order of evaluation (contributed by Xavier @ triple-i.com)

        foo(pointer->member, pointer = &buffer[0]);

Works with gcc (and other compilers I used until I tried acc) and does not with acc. The reason is that gcc evaluates function arguments from left to right, while acc evaluates arguments from right to left. K&R and ANSI/ISO C specifications do not define the order of evaluation for function arguments. It can be left-to-right, right-to-left or anything else and is “unspecified”. Thus any code which relies on this order of evaluation is doomed to be non portable, even across compilers on the same platform.

This isn’t an entirely non controversial point of view. Read the supplementary dialog on the subject.

A fundamental misunderstanding of C I’ve seen occasionally is programmers not recognizing that it’s just a higher level assembly language. This goes for Dyer’s complaint about “fake booleans” as well. If you really want to grok why C does what it does, check out an assembly language sometime.

ANSI C might’ve helped alleviate this misunderstanding some (or maybe it just created more confusion). K&R C was the epitome of the “higher level assembly language” mindset. It’s the original C language, the one that Kernigan & Ritchie wrote (hence the name). It didn’t care about anything except for syntax. It cared nothing for types, even though you had to specify them. All types did for K&R C was tell it how much memory you wanted for a variable, and/or how big of an offset in memory you wanted between it and other variable values. I don’t know for sure, but I think it was the originator of the concept that an index into an array is just an offset into it, not an enumerator for its elements.

For example, back when I used it in college, an int was 16 bits (2 bytes), and it would allocate that on the stack for me, and keep track of its length, if I declared an int variable. That was it. It didn’t care what I put in it, just as an assembler wouldn’t check what type of value I put into a memory location.

If the compiler put adjacent fields in a struct into contiguous memory locations, it would allocate the necessary amount of memory for the struct by adding up the length of each variable specified in it, and then allocate the total amount for it. If I referenced a field within that struct, it was just an offset and a length into that memory area. Again, it didn’t care what type of value I put in it.

I discovered a couple years out of college that one could use this property of structs even in ANSI C to parse binary files. To tell you the truth, I think structs and bit fields were made for this sort of thing. Some example file reading code would be: “fread(&structVar, sizeof(structType), 1, fileHandle);”. fread()’s first parameter is of type void *, so it’ll take anything as a blob of memory. After this call, the struct variable is populated with data, and it’s easy enough to parse it by just saying “a = structVar.field1;”. Union types could also be used for parsing in memory.

In K&R C, if I defined a function, it’d look like this:

SomeFunction(arg1, arg2, arg3)
   int arg1;
   char *arg2;
   int arg3;
{/* some code */}

All the type specifiers did was tell the compiler where the offsets in the stack frame were. It cared nothing for types. It didn’t even care how many arguments I passed in to the function! I could’ve passed an 8-byte struct in for arg1, and not filled in arg2 or arg3 for all it cared. I could’ve accessed the first 2 bytes of my struct via. arg1, the next four bytes by arg2 (pointers were 4 bytes then), and the next 2 by arg3. The reason being that my 8-byte struct filled the stack frame the function was expecting. The struct could’ve been bigger. It still wouldn’t have cared. It would’ve caused problems because then I would’ve had a stack overflow, but it didn’t watch for such things, just as an assembler wouldn’t have.

It wasn’t even that necessary to declare the function in a header file. If you called the function from another module, the linker would figure out what you wanted and just set up the call. The one time you needed to declare a function was if you had a circular reference, where Function A called Function B, which called Function A.

Maybe there was a point to this madness, but I know it drove even some very smart people nuts. One possibility was you could use this property of functions to automatically do some of the work for you in separating a chunk of memory into pieces, so you could use the function arguments as a kind of template into it. It’s hackerish, again, but it would’ve worked.

Anyway, continuing:

Easily changed block scope (Suggested by Marcel van der Peijl )

    if( ... )
         foo();
     else
         bar();

which, when adding debugging statements, becomes

    if( ... )
         foo();
         /* the importance of this
            semicolon can't be overstated */
    else
         printf( "Calling bar()" );
         /* oops! the else stops here */
         bar();
         /* oops! bar is always executed */

There is a large class of similar errors, involving misplaced semicolons and brackets.

This is a programming 101 mistake. Come on! Put brackets around your code blocks for cripes sake! 🙂 I’m getting the sneaking suspicion that the people complaining about this stuff are Python programmers.

I know Python determines code blocks via. indentation. I personally find this a little dangerous, because while I’m festidious about style, in a hurry I sometimes just put aside code formatting to get something done, and then clean it up later. Worrying about formatting slows you down. Formatting doesn’t matter to C, delimiters do.

Unsafe returned values (suggested by Bill Davis <wdavis@dw3f.ess.harris.com>)

char *f() {
   char result[80];
   sprintf(result,"anything will do");
   return(result);    /* Oops! result is
                         allocated on the stack. */
}

int g()
{
   char *p;
   p = f();
   printf("f() returns: %s",p);
}

The “wonderful” thing about this bug is that it sometimes seems to be a correct program; As long as nothing has reused the particular piece of stack occupied by result.

Yeah, this is a mistake that’s often made by beginners if they don’t understand stack dynamics that well. He’s right that sometimes this will work if nothing else happens to overwrite the stack. But there is an easy way to overcome this bug: make “result” static in f(). This way it will never be overwritten. Another method is to dynamically allocate the buffer for “result” (make the “result” variable a pointer) on the heap using malloc() (but you gotta remember to free “p” later in g()!). This way only the pointer in f() is destroyed, but the buffer will remain.

When you’re writing in C or C++ you have to be more concerned with how the computer is executing what you’re going to write.

Undefined order of side effects. (suggested by michaelg@owl.WPI.EDU and others)

Even within a single expression, even with only strictly manifest side effects, C doesn’t define the order of the side effects. Therefore, depending on your compiler, I/++I might be either 0 or 1. Try this:

#include <stdio .h>                   

int foo(int n) {printf("Foo got %d", n); return(0);}
int bar(int n) {printf("Bar got %d", n); return(0);}                   

int main(int argc, char *argv[])
{
  int m = 0;
  int (*(fun_array[3]))();
  int i = 1;
  int ii = i/++i;
  printf("i/++i = %d, ",ii);
  fun_array[1] = foo;
  fun_array[2] = bar;
  (fun_array[++m])(++m);
}

Prints either i/++ i = 1 or i/++ i=0;
Prints either “Foo got 2”, or “Bar got 2”

Yeah, this is a common problem with C/C++. It’s been there forever. I took a brief course on C in college around 1991. One of the things the teacher explicitly had us try out was an example like this. The moral of the story is never use l-value expressions (an assignment like ++i) inside of an r-value expression. In other words don’t do stuff like i/++i unless you really understand the hardware. You never know how it’s going to come out.

The reason I was told this and other such weirdness exists in the language is the language designers wanted to make it possible for the compiler to optimize for the hardware. This meant not locking down certain characteristics like this. If you wanted to predict what the compiler for a particular hardware platform would do, you needed to understand how the CPU handled sequences of operators.

Utterly unsafe arrays

This is so obvious it didn’t even make the list for the first 5 years, but C’s arrays and associated memory management are completely, utterly unsafe, and even obvious cases of error are not detected.

int thisIsNuts[4];
int i;
for ( i = 0; i < 10; ++i )
{
    thisIsNuts[ i ] = 0;
    /* Isn't it great? I can use elements
       1-10 of a 4 element array, and
       no one cares */
}

Of course, there are infinitely many ways to do things like this in C.

Any decent instructional material on C would tell you to watch out for this. Again, C is just a higher level assembly language. No assembler would watch out for this either. If you ran this code on a Unix system, you’d likely get a “segmentation fault” error (the operating system would terminate the process).

Octal numbers (suggested by Paul C. Anagnostopoulos)

In C, numbers beginning with a zero are evaluated in base 8. If there are no 8’s or 9’s in the numbers, then there will be no complaints from the compiler, only screams from the programmer when he finally discovers the nature of the problem.

int numbers[] = { 001, // line up numbers for
                       // typographical
                       // clarity, lose big time
                  010,   // 8 not 10
                  014 }; // 12, not 14

Maybe this has changed. What I remember is that octal numbers were formatted with a backslash in front (I can’t represent it here). There are all sorts of type markers like this in C and C++. Heck, don’t start a number with 0x either, because that’s a marker for a hexadecimal integer. Don’t put an “f” after a number, because that makes it floating-point, etc.

Fabulously awful “standard libraries” (suggested by Pietro Gagliardi)

The default libraries in C are leftovers from the stone age of computing, when anything that worked was acceptable. They are full of time bombs waiting to explode at runtime, For an example, look no further than the “standard i/o library”, which, amazingly, is still standard.

{
  int a=1,b=2;
  char buf[10];
  sscanf("%d %d",a,b);
  // don't you mean &a,&b? Prepare to blow!
  printf("this is the result: %d %d");
  // putting at least 20 characters in
  // a 10 character buffer
  // and fetching a couple random vars
  // from the stack.
}

I ran into this bug rather frequently because I didn’t use fscanf() or sscanf() that much. It makes some sense though, because C passes everything by value. You have to pass in pointers to create “out” values. The reason this fails is that functions like scanf() and printf() allow a variable number of arguments (after the first one, which specifies formatting). Since the value types could be anything, the language could not restrict what types were in the argument list. Since variables are not references to objects, as in modern languages, but actual spots in physical memory, you had to distinguish what was a reference to a memory space (a pointer), and what was an actual value in memory. In this case, sscanf() wants a reference to a spot in memory, but Gagliardi is passing it int’s. Since C does not have runtime type checking, and since they’re coming in through a var-arg list, sscanf() cannot realize that it’s not getting pointers. It just has to assume that it’s getting them. In C++ it would be possible for a function such as this to do run-time type checking on the parameters and throw an exception if they weren’t of the correct type.

I don’t see this issue with var-args in C as a weakness in the standard library, but rather in the language. Functions like this still exist with modern languages, but today’s languages, with garbage-collected memory, use the concept of references to objects as the default, so there’s less of a problem (though it’s still possible to throw a function like this a curveball by specifying a certain number of arguments in the format string, but passing in fewer arguments in the var-arg list).

All of this reminds me of why I’ve sworn off programming in C as much as I can. I understand it, but I’ve grown beyond it. C is still used extensively in open source programming, from what I hear. I think in a way C is returning to its roots. It’s good for writing things like operating systems, device drivers, and virtual machines, because those things need to interact with the hardware at an intimate level, and C certainly doesn’t get in the way of that.

I programmed in C for a couple years in college, and then for another 4 years out in the work world, and I was mainly using it for writing utilities, applications, and servers. In terms of software engineering for those things it wasn’t the best language, but back in the 90s it’s what a lot of places used for a while, before moving on to C++. I think the main reason it got picked was it represented a “happy medium” between high level abstraction and execution speed. In the 80s, if you needed it to be fast, you wrote it in assembly language. In the 90s, you wrote it in C. From my experience C++ didn’t get reasonably fast on the hardware available at the time until the late 90s.

Language compilers, interpreters, and VM environments are now written or translated into C, like Ruby, Java, .Net, and Squeak.  I’m sure there are some others. Squeak is a little unusual. The source code for its VM is in Smalltalk, but that gets translated to C and then compiled to generate a new version of it. In a way C is becoming the new assembly language.

Advertisements

12 thoughts on “Having fun playing with fire, er…C

  1. > The reason I was told this and other such weirdness exists in the language is the language designers wanted to make it possible for the compiler to optimize for the hardware.

    That’s an oft-repeated myth, usually in connection with the autoincrement addressing modes on the PDP-11 – but Dennis Ritchie himself says otherwise[1]; according to him, they came straight from B (C’s predecessor, written by Ken Thompson) which was compiled to threaded code, with a one-to-one correspondence between high level operator and machine level subroutine.

    [1] http://www.cs.bell-labs.com/who/dmr/chist.html

  2. @gwenhwyfaer:

    Here is how the issue was stated in the book I cited in my post, “The C Programming Language” “C: A Reference Manual”, 2nd Edition, published in 1987:

    They define the pre/post-increment and the pre/post-decrement operators as “side-effect producing operations”, and, “It is, of course, bad programming style to have two side effects on the same variable in the same expression, because the order of the side effects is not defined.” (my emphasis)

    The point of the argument was not to say that these pre/post operators behave differently depending on the machine, but that you do not know how the side-effects will be applied, unless of course you understand the logic applied to C code inside the compiler itself. My understanding was this was left undefined in order to allow for hardware optimization. If it was not left undefined for this reason, I think the only conclusion that one could draw was that it was left undefined because of carelessness or an imprecise spec., which is possible. Other languages have had problems with consistency due to the use of vague language in their specifications. The book does not suggest there was vagueness in the spec., just that they left some behaviors undefined, which allowed compiler implementors latitude to produce whatever object code they wanted in these areas.

  3. Pingback: Top Posts « WordPress.com

  4. Small correction :
    “There was one page I went to more than any other. It had a hierarchical list of all the operators you could use, and what precedence they had in relation to each other. The higher up in the list, the higher their precedence. Some operators were equal in precedence, in which case they’d just be evaluated from left to right.”
    Operators with the same precedence are not necessarily evaluated left to right.
    For example x = y = z is always evaluated right to left like this : x= (y = z)

  5. Actually, “the C programming language” was written by Kernighan and Ritchie themselves. It is indeed excellent.

    It appears that “C: A Reference Manual” was written by Harbison and Steele; were you thinking of that?

    And yes, I do know the page number of the operator table in K&R’s book by heart (p. 53 in my second edition).

  6. @Monkeyget:

    Yep, you’re right. I was generalizing. The other right-associative operators are the other assignment operators (+=, -=, *=, /=, %=, >=, &=, ^=, |=), and the ternary conditional operator (? :).

    @Schipper:

    Yes, I got the title wrong. I was referring to “C: A Reference Manual”. Thanks for pointing that out. I corrected that in my post, and in my comment to gwenhwyfaer. In my book (2nd edition) the operator table is on p. 141. 🙂

  7. Good post. I read Dave Dyer’s blog post, and found myself thinking pretty much the exact same things. A note though – I think the octal notation’s backslash is now optional, and has been since either C89 or C99. Still, the gripe is pretty weak when you can use whitespace to line up numbers.

    On arrays without bounds checking at compile time, I have to wonder how Mr. Dyre would have it implemented. Is it not possible to do the following in any language with a for-loop?

    int array[4];
    for (int i =0; i 3) someFunctionThatKeepsiInBounds(&i);
         array[i] = whatever;
    }

    I think the most we could hope for is runtime bounds checking, and Unix at least segfaults when you try something like the first example. I think that’s all we could want out of a compiled language.

    One question I hope you can answer: if I have

    char i = ‘a’;
    while (i

  8. Ugh, the formatting didn’t like the greater than and less than symbols.

    That was supposed to read:

    int array[4];
    for (int i = 0; i LESS_THAN 4; i++)
    {
         i = 10;
         array[i] = whatever;
    }

    Or:

    int array[4];
    for (int i = 0; i LESS_THAN 10; i++)
    {
         if (i GREATER_THAN 3) someFunctionThatKeepsiInBounds(&i);
         array[i] = whatever;
    }

    I think the most we could hope for is runtime bounds checking, and Unix at least segfaults when you try something like the first example. I think that’s all we could want out of a compiled language.

    One question I hope you can answer: if I have

    char i = ‘a’;
    while (i LESS_THAN ‘f’)
    {
         int i = 5;
         …
         i++;
    }

    How many times does this loop, and why?

  9. @Dan D.:

    Apologies for the code formatting. I just about tear my hair out trying to format code in my posts using WordPress’s editor! It’s pretty bad. It’s not as bad as in the comments though.

    Re: using a function to keep the index in-bounds, yes it’s possible, but I’ve always found it’s better to use a more tightly coupled solution:

    int array[4];
    int i = 0; /* I don’t think you can declare the variable inside the for loop in C, can you? */
    for (i = 0; i LESS_THAN sizeof(array); i++)
    {
    /* do stuff */
    }

    sizeof() gives the length of the array directly from the declaration. This is the ideal, since if the size of “array” changes, the length from sizeof() will change with it automatically.

    sizeof() is a built-in operator in C, like the others. It only works for variables declared on the stack. If you create a buffer on the heap using malloc(), sizeof() can give you the size of the pointer to the buffer, but that’s it. When I used to program in C a lot I’d use sizeof() when I could to measure the size of my buffers. I wrote this macro to copy strings and prevent buffer overflows:

    #define StrCopy(dest, src) BACKSLASH
    strncpy(dest, src, sizeof(dest) LESS_THAN sizeof(src) ? sizeof(dest) : sizeof(src));
    dest[sizeof(dest – 1)] = ‘BACKSLASH0’

    This is safer than strcpy(), since it checks the bounds of both src and dest. I could use it like this:

    char message[12] = “Hello world”;
    char copy[10];
    StrCopy(message, copy);

    copy will be “Hello wor”, but no buffer overflow.

    Re: your question, the program would not run, because the compiler will flag your inner “int i” declaration as a redefinition of i. It doesn’t matter that you’re changing the type. The compiler pays attention to scope, and since “char i” is declared outside the block, it’s accessible inside that block as well.

    Your use of a char variable in a loop is interesting. Even the ++ operator(s) should work with that. Just remove the “int i” declaration inside the loop.

    This is one exception to what I said in my post: even in K&R C the operators pay attention to the type of your variables. So if you tried to increment a char variable with + or ++, etc. it would not overflow your variable by treating it like an int. The mathematical operators will work with most of the native types in C: char, short, int, long, signed, unsigned, * (pointer).

  10. @Dan D.:

    I deleted one of your comments here, because I found the other one you wrote that better explained what you were talking about. It got caught in this blog’s spam filter for some reason. I didn’t realize it until now.

  11. Pingback: The 150th post « Tekkie

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s