SICP: What is meant by “data”?

I want to discuss what’s introduced in Section 2.1.3 of SICP, because I think it’s a really important concept.

When we think of data in typical CS and IT settings, we think of strings and numbers, tables, column names (or a reference from an ordinal value), XML, etc. We think of things like “Item #63A955F006-006”, “Customer name: Barbara Owens”, “Invoice total: $2,336.00”.

How do we access that data? We typically use SQL, either directly or through stored procedures, using relations to serialize/deserialize data to/from a persistence engine that uses indexing (an RDBMS). This returns an organized series of strings and numbers, which are then typically organized into objects for a time by the programmer, or a framework. And then data is input into them, and/or extracted from them to display. We send data over a network, or put it into the persistence engine. Data goes back and forth between containers, each of which use different systems of access, but the data is always considered to be “dead” numbers and strings.

Section 2.1.2 introduces the idea of data abstraction, and I’m sure CS students have long been taught this concept, that data abstraction leads to more maintainable programs because you “program to the interface, not to the object”. This way you can change the underlying implementation without affecting too much how the rest of the program runs. SICP makes the same point.

In 2.1.2 it uses a simple abstraction, a rational number (a fraction), as an example. It defined 3 functions: make-rat, numer, and denom. Make-rat would create a rational number abstraction out of two numbers: a numerator and denominator. Numer and denom return the numerator and denominator from the abstraction, respectively. It calls make-rat a “constructor”, and numer and denom “selectors”.

In 2.1.3 it questions our notion of “data”, though it starts from a higher level than most people would. It asserts that data can be thought of as this constructor and set of selectors, so long as a governing principle is applied which says a relationship must exist between the selectors such that the abstraction operates the same way as the actual concept would. So if you use (in pseudo-code):

x = (make-rat n d)

then (numer x) / (denom x) must always produce the same result as n / d. The book doesn’t mention this at this point, but in order to be really useful, the abstraction would need to work the same way that a rational number would when arithmetic operators are applied to it.

So rather than creating a decimal out of a fraction, as would be the case in most languages:

x = 1 / 4 (which equals 0.25)

You can instead say:

x = (make-rat 1 4)

and this representation does not have to be a decimal, just the ratio 1 over 4, but it can be used in the same way arithmetically as the decimal value. In other words, it is data!

In Section 2.1.2, the abstraction used for a rational number was a “pair”, a cons structure (created by executing (cons a b)). In 2.1.3 it shows that the same thing that was done to create a rational number data abstraction can be applied to the concept of a “pair”:

This point of view can serve to define not only “high-level” data objects, such as rational numbers, but lower-level objects as well. Consider the notion of a pair, which we used in order to define our rational numbers. We never actually said what a pair was, only that the language supplied procedures cons, car, and cdr for operating on pairs. But the only thing we need to know about these three operations is that if we glue two objects together using cons we can retrieve the objects using car and cdr. That is, the operations satisfy the condition that, for any objects x and y, if z is (cons x y) then (car z) is x and (cdr z) is y. Indeed, we mentioned that these three procedures are included as primitives in our language. However, any triple of procedures that satisfies the above condition can be used as the basis for implementing pairs. This point is illustrated strikingly by the fact that we could implement cons, car, and cdr without using any data structures at all but only using procedures. Here are the definitions:

(define (cons x y)
  (define (dispatch m)
    (cond ((= m 0) x)
          ((= m 1) y)
          (else (error "Argument not 0 or 1 -- CONS" m))))
  dispatch) 

(define (car z) (z 0)) 

(define (cdr z) (z 1))

This use of procedures corresponds to nothing like our intuitive notion of what data should be. Nevertheless, all we need to do to show that this is a valid way to represent pairs is to verify that these procedures satisfy the condition given above.

This example also demonstrates that the ability to manipulate procedures as objects automatically provides the ability to represent compound data. This may seem a curiosity now, but procedural representations of data will play a central role in our programming repertoire. This style of programming is often called message passing.

So, just to make it clear, what “cons” does is construct a procedure that “cons” calls “dispatch” (I say it this way, because “dispatch” is defined in cons’s scope), and return that procedure to the caller of “cons”. The procedure that is returned can then be used as data since the above “car” and “cdr” procedures produce the same result as the “car” and “cdr” procedures we’ve been using in the Scheme language.

SICP mentions in this section that message passing will be used further as a basic tool in Chapter 3 for more advanced concepts. Note that I have talked about message passing before. It is central to the power of object-oriented programming.

Note as well that the text implies an analog between procedures, in a Lisp language, and objects. The above example is perhaps a bad way to write code to get that across. In the corresponding lecture for this section (Lecture 2b, “Compound Data”) Abelson used a lambda ( (lambda (m) (cond …)) ) instead of “(define (dispatch m) …)”. What they meant to get across is that just as you can use “cons” multiple times in Scheme to create as many pairs as you want, you can use this rendition of “cons” to do the same. Instead of creating multiple blobs of memory which just contain symbols and pointers (as pairs are traditionally understood), you’re creating multiple instances of the same procedure, each one with the different pair values that were assigned embedded inside them. These procedure instances exist as their own entities. There is not just one procedure established in memory called “dispatch”, which is called with pieces of memory to process, or anything like that. These procedure instances are entities “floating” in memory, and are activated when parameters are passed to them (as the above “car” and “cdr” do). These procedures are likewise garbage collected when nothing is referring to them anymore (they pass out of existence).

In the above example, “cons” is the constructor (of a procedure/object called “dispatch”), and “car” and “cdr” are the selectors. What we have here is a primitive form of object-orientation, similar to what is known in the Smalltalk world, with the mechanics of method dispatch exposed. The inference from this is that objects can also be data.

What SICP begins to introduce here is a blurring of the lines between code and data, though at this point it only goes one way, “Code is data.” Perhaps at some point down the road it will get to sophisticated notions of the inverse: “Data is code”. The book got into this concept in the first chapter just a little, but it was really simple stuff.

Getting back to our traditional notions of data, I hope this will begin to show how limited those notions are. If procedures and objects can be data, why can’t we transmit, receive, persist, and search for them just as we can with “dead” data (strings and numbers)? In our traditional notions of CS and IT (assuming they’re focused on using OOP) we use classes to define methods for dealing with data, and elemental interactions with other computing functions, such as data storage, display, interaction between users and other computers, etc. We have a clear notion of separating code (functionality) from data, with the exception of objects that get created while a program is running. In that case we put a temporary “persona” over the real data to make it easier to deal with, but the “persona” is removed when data is stored and/or we exit the program. What we in the whole field neglect is that objects can be data, but they can only really be such if we can do everything with them that we now do with the basic “dead” forms of data.

There is at least one solution out there that is designed for enterprises, and which provides inherent object persistence and search capabilities, called Gemstone. It’s been around for 20+ years. The thing is most people in IT have never heard of it, and would probably find the concept too scary and/or strange to accept. From what I’ve read, it sounds like a good concept. I think such a system would make application design, and even IT system design if it was done right, a lot more straightforward, at least for systems where OO architecture is a good fit. Disclaimer: I do not work for Gemstone, and I’m not attempting to advertise for them.

As I think about this more advanced concept, I can see some of where the language designers of Smalltalk were going with it. In it, objects and their classes, containing information, can be persisted and retrieved in the system very easily. A weakness I notice now in the language is there’s no representational way to create instance-specific functionality, as exists in the above Scheme example. Note that the values that are “consed” are inlined in “dispatch”. There’s a way to put instance-specific code into an object in Smalltalk, but it involves creating a closure–an anonymous object–and you would essentially have to copy the code for “dispatch”. You would have to treat Smalltalk like a functional language to get this ability. The reason this interests me is I remember Alan Kay saying once that he doesn’t like the idea of state being stored inside of an object. Well, the above Scheme example is a good one for demonstrating “stateless OOP”.

Edit 7-9-2010: After I posted this article I got the idea to kind of answer my own question, because it felt insensitive of me to just leave it hanging out there when there are some answers that people already know. The question was, “Why don’t we have systems of data that allow us to store, retrieve, search, and transmit procedures, objects, etc. as data?” I’ve addressed this issue before from a different perspective. Alan Kay briefly got into it in one of his speeches that’s been on the internet. Now I’m expanding on it.

The question was somewhat rhetorical to begin with. It’s become apparent to me as I’ve looked at my programming experience from a different perspective that a few major things have prevented it from being resolved. One is we view computers as devices for automation, not as a new medium. So the idea of computers adding meta-knowledge to information, and that this meta-knowledge would be valuable to the user is foreign, though slowly it’s becoming less so with time, particularly in the consumer space. What’s missing is a confidence that end users will be capable of, and want to, manipulate and modify that meta-knowledge. A second reason is that each language has had its own internal modeling system, and for quite a while language designers have gotten away with this, because most people weren’t paying attention to that. This is beginning to change, as we can see in the .Net and Java worlds, where a common modeling system for languages is preferred. A third reason is that with the introduction of the internet, openness to programming is seen as a security risk. This is because of a) bad system design, and b) in the name of trying to be “all things to all people” we have systems that try to compromise, introducing some risk, to promote backward compatibility with older models of interaction that became less secure when wide scale networking was introduced, and to cater to programmers’ desires for latitude and features so that the platform would be more widely accepted, so the thinking goes.

No one’s figured out a way to robustly allow any one modeling system to interact or integrate with other modeling systems. There have been solutions developed to allow different modeling systems to interact, either via. XML using networking protocols, or by developing other languages on top of a runtime. I’ve seen efforts to transmit objects between two specific runtimes: .Net and the JVM, but this area of research is not well developed. Each modeling system creates its own “walled garden”, usually with some sort of “escape hatch” to allow a program written in one system to interact with another (usually using a C interface or some FFI (foreign functionality interface)) which isn’t that satisfactory. It’s become a bit like what used to be a plethora of word processing or spreadsheet document file formats, each incompatible with the other.

Programmers find file format standards easier to deal with, because it involves loading binary data into a data structure, and/or parsing a text file into such a structure, but this limits what can be done with the information that’s in a document, because it takes away control from the recipient in how the information can be used and presented. As I’ve discussed earlier, even our notions of “document” are limiting. Even though it could serve a greater purpose, it seems not much time has been invested in trying to make different modeling systems seamless between each other. I’m not saying it’s easy, but it would help unleash the power of computing.

Code is encoded knowledge. Knowledge should be fungible such that if you want to take advantage of it in a modeling system that is different from the one it was originally encoded in, it should still be accessible through some structure that is compatible with the target system. I understand that if the modeling system is low level, or more primitive, the result is not going to look pretty, but it could still be accessed through some interface. What I think should be a goal is a system that makes such intermingling easy to create, rather than trying to create “runtime adapters”. This would involve understanding what makes a modeling system possible, and focusing on creating a “system for modeling systems”. You can see people striving for this in a way with the creation of alternative languages that run on the JVM, but which have features that are difficult to include with the Java language itself. Java is just used as a base layer upon which more advanced modeling systems can be built. This has happened with C as well, but Java at least allows (though the support is minimal) a relatively more dynamic environment. What I’m talking about would be in a similar vein, except it would allow for more “machine models” to be created, as different modeling systems expect to run on different machine architectures, regardless of whether they’re real or virtual.

I think the reason for this complexity is, as Alan Kay has said, we haven’t figured out what computing is yet. So what language designers create reduces everything down to one or more aspects of computing, which is embedded in the design of languages.

So my question is really a challenge for future reference.

— Mark Miller, https://tekkie.wordpress.com

This is one of a series of “bread crumb” articles I’ve written. To see more like this, go to the Bread Crumbs page.