Site map

2021-12-11 13:52:26 (Originally published at: 2013-01-29)

Programming>> Programming language related

Thoughts about a new system programming language

I started writing this and extending the list as I experience frustration working with C code.

The C language is developed around 1970, when the situation was different than now: programs are written for a single architecture and on a single system. Portability wasn't an important point then. There were also severe memory and storage constraints. For example the predecessor of C, which was B, created on a computer that had only 8kB of memory.

Nowadays things have been changed. Disc space and memory space are a not a limiting factor for the compilation, our processors are very fast. The program must run on multiple architectures and operating systems, it must be secure (exploit free), and must support computers with multiple cores. Also the language must be simple enough to make more powerful IDEs.

After some research it seems Go solves many of the mentioned problems. So my language probably remains a pet language.

(This writing assumes the reader is familiar with C and C++.)

Better grammar

Preferably an LL one. So a pascalish syntax would be nice, it's quite well known, and can be parsed with recursive descent (so basically anyone can make a parser for it by hand). It would also make it easier to make processor/refactor tools and IDEs that will complete the code for you when there is only one syntactically correct choice exists, making typing the code faster. Error recovery and giving meaningful diagnostic messages are easier.

Basic data types should have predefined size

The C and C++ standard is quite loose when defining the sizes of the data types long can be 32 or 64 bit. An int can be 16 or 32 bit. Even char can be a 32 bit number, depending on implementor.

In practice, for programs written by one person for a single platform it does not matter. But nowadays, a lot of programs must be compiled on multiple platforms, and there are file formats network protocols, where size and the encoding of the data is specified.

Though C standard defines uintXX_t types, porting to the Microsoft platform still pain, as they don't implement a C99 compiler.

I would name my types simply int8, int16, etc...; uint8, uint16, etc...; float32, float64, etc, but not short, long, medium or XXXL... If some weird platform has different size of integer number, then other types, such as int24 can be introduced. Also there would be two special types: integer and number. The former would be a generic integer number, the latter would be a generic floating point number, these would be there for the "I don't care" situations.

Cleanup section

Functions may have an optional cleanup section, which is executed before they return.

	function blahblah() : none
	{
	}
	cleanup
	{
		/*Do your cleanup here. */
	}

In C the goto end; idiom is quite common.

In C++ we have RAII, so do not need to write cleanup code. But the hidden implicit destructor call can confuse the reader of the code. C has the advantage that what you read is that will happen. If something is not there, then it won't happen. Writing the cleanup code should be a reflex like writing the closing brace immediately when you write the opening one.

Safe buffers

Buffer overflow is the worst possible bug. When it just crashes your program, you are lucky, but when it slips through the QA process and the buggy software is released, a potential attacker may forge an input that can exploit this vulnerability, and can overwrite return addresses or function pointers, this can lead to arbitrary code execution. Indeed nowadays file infecting viruses are not as mainstream as they were, instead black hat crackers seek vulnerabilities in products that are in mainstream use. When they find one, they won't report the bug, but they instead create exploit kits and sell them for $100000 when someone need a botnet, these are the so called zero day exploits. Descriptions of updates which my Linux box get are often read "possible arbitrary code execution via...". Buffer overflow bugs are everywhere.

I think there is no valid reason to index an array or a buffer outside its range. To avoid this problem, all buffers must be range checked.

The solution for this is that the new language should have a buffer type, which are not just pointers, but a pointer and a number.

So for a stack allocation the size will be known:

	var foo: array[50] of integer;

For a heap allocation the size will be known:

	var bar: array[] of integer;
	bar := new array[123] of integer;

When a sub array needed, there would be a slice operator (or a system function) that sets the beginning pointer and the size accordingly. It's usage would be slice(array, beginning_index, length):

	var x: array[100] of integer;
	var part: array[] of integer;
	part := slice(x, 2, 5);

This will creates an array descriptor, whose first element pointer points the element at 2 index and has the length 5. The origins of the slices can be tracked, as the slice function knows the source array.

Since the length is stored next to the pointer, index accesses can be checked when the program runs, and terminates the program on index overflow. The optimizer can optimize away excess checks, e.g. not checking when the programmer already did the checks. This checking feature can be turned off (in that case what we get is nothing better than what we would get from C.)

Pointer arithmetic is about traversing arrays/buffers (again I don't see any reason to use pointer arithmetic for something else). It can be checked too. But instead of calling it pointers they will be iterators:

	var arr: array[10] of integer;
	var it: iterator to integer;
	it := mk_iterator(arr, 2);
	next(it); //< to move the iterator forward
	prev(it); //< to move it backward

3 things are stored in an iterator: the pointer to the current element, the index of the element in the parent array, and the element count. Moving the iterator modifies the pointer and the index too. So running out of the array can terminate the program. This kind of thing is difficult to optimize. This check could be turned off, if you want your nanoseconds.

No implicit conversion

Data types are (almost) never converted implicitly. The following explicit conversions would be available by default.

Between built in numeric types.
Built in serialization and deserialization functions for the built in types in various encodings (primarily little-endian and big-endian), aka. conversions to and from Uint8[]. This would be the only interface to convert between raw buffers and other types. This conversion would be NOP for most cases.

In portable C code one cannot simply fread a struct from a file or a raw memory buffer, even if the struct is packed correctly, the code will break when the target architecture uses different endianness, so one end up filling the struct fields one by one then rotate when necessary anyway, so no packing needed anymore.

out and in/out arguments

I like C#'s ref and out parameters. So you don't need to explicitly use pointers in the function body. I would adopt this idea.

Initialization required

C# requires variables to be initialized before use. It can easily enforced for structs and primitive types (you will need to give them a starting value at some point when before using it). Though things are not such simple with arrays.

Type arguments

The language should support type arguments to functions and structure. It's like templates in C++, but it won't be a fully fledged metaprogramming feature.

This feature is useful, when making general purpose data structures, like trees.

Allocators

In a system there are many types of allocators. There should be a way to tell the compiler which function allocates, which one deallocates, so in a debug build allocations and deallocations can be tracked, leaks can be detected without rolling your own allocation tracker. This can be generalized to all resource allocation (like files).

No namespaces or modules

C don't have them either. We already have a way to organize our sources: the file system. Whenever a system supports namespaces or modules, the namespace name and the module name is usually the same. Namespaces are also used to solve the problems of name collisions. But I think it's not the best solution. If your program has two functions with same name, chances that they do the same thing, or you have chosen wrong name, you need to rename them.

The lack of namespaces makes the symbol name unique, making it easier to search the source code, also it would make it a bit easier to understand the code. If you see a function called foo, you can be sure, that it's the only foo in the code, you don't need to check what namespaces are the current scope in.

(Also I'm lazy I don't want to complicate my pet language.)

Set based build system

Source files cannot declare requirements by using, include, etc. After parsing the source code, the referred function names are stored as implicit, incomplete function declarations. When the parsing is finished in all source files, missing functions and their references can be identified. That's nothing new, even C can work this way, if we avoid includes, as it supports implicit declarations.

Using source sets is more flexible than building conditional macro hells in the source code. Source sets can be merged, intersected and differenced like any mathematical set. After the set calculation the final set of sources would be compiled. This would make adding new platforms and turning compilation of features on or off easier. Let's see how I imagine it.

Trivial source set

I would look like this:

    setconfig
    {"feature1.cl1", "feature2.cl1" "feature3.cl1"}

Between curly braces we simply enumerate the source files that will be compiled.

Set operations

Union, difference and intersection operator available as operators:

    setconfig
    let $union = {"X"} + {"Y"}
    let $diff = {"X"} - {"Y"}
    let $intersect = {"X"} * {"Y"}

You can also see the syntax for making aliases. These are immutable.

Multiple sets

Multiple targets can be built with a single command if multiple sets are specified:

    setconfig
    {"foo.cl1"}, {"bar.cl1"}

The result of the compilation is two programs compiled from those two source files.

Cartesian product

With foreach it's possible make up source file names with the cartesian product of two sets:

    setconfig
    let $XSet = {"A", "B"}
    let $YSet = {"X", "Y"}
    { foreach $x in $XSet, $y in $YSet => "$x$y" }

The foreach makes a comma separated list of string, so it would make this set: {"AX", "AY", "BX", "BY"}. You can see the way of the variable value substitution, it work like in PHP, or shellscripts.

A possible real world example

Let's see how does the build configuration of a possible multiplatform software would look like:

    setconfig
    
    let $freeFeatures = {"feature1.cl1", "feature2.cl1", "feature3.cl1"}
    let $proFeatures = $freeFeatures + {"features4.cl1", "features5.cl1"}
    
    let $architectures = {"-arch i386", "-arch arm"}
    let $products = {$freeFeatures, $proFeatures}
    let $platforms = {"windows", "unix", "linux", "mac"}
    let $windowManagers = {"gdm", "kde", "xfce", "windows", "osx"}
    let $platformSpecifics = {"networking", "io", "sound", "directgraphics"}
    
    foreach 
        $arch in architectures,
        $prod in $products, 
        $plat in $platforms, 
        $wm in $windowManagers 
    => 
        $prod + 
        {"platforms/$plat/$wm.cl1"} + 
        {
            foreach 
                $ps in $platformSpecifics 
            => 
                "platforms/$plat/$ps.cl1"
        } + 
        {$arch}

This creates source sets for all combinations of platforms. As you can see compiler options can be passed to the build process too.

Compilation process

First the compiler validates the source set: this basically means the provided source files must exist, and the combinations of the passed compiler options must be valid. Invalid sets are dropped.

Referring to set elements

It's possible to refer to the elements in sets, but using array like indexing. This way we can control the platfroms better. For example the operating systems and window managers:

    setconfig
    
    let $freeFeatures = {"feature1.cl1", "feature2.cl1", "feature3.cl1"}
    let $proFeatures = $freeFeatures + {"features4.cl1", "features5.cl1"}
    
    let $architectures = {"-arch i386", "-arch arm"}
    let $products = {$freeFeatures, $proFeatures}
    let $unixWms = {"gdm", "kde", "xfce"}
    let $platforms = 
    {
        {"windows", "windows"}, 
        {"unix", $unixWms}, 
        {"linux", $unixWms},
        { "mac", "osx"}
    }
    let $platformSpecifics = {"networking", "io", "sound", "directgraphics"}
    
    foreach 
        $arch in architectures,
        $prod in $products, 
        $plat in $platforms, 
    => 
        $prod + 
        {"platforms/$plat[0]/$plat[1].cl1"} + 
        {
            foreach 
                $ps in $platformSpecifics 
            => 
                "platforms/$plat/$ps.cl1"
        } + 
        {$arch}

Preprocessor not needed anymore

Preprocessor is powerful, but I think this new language wouldn't need it:

The new language would have constants: no need to #define them.
Usage of set based build: no need to use header guard, or conditional compilation, etc.
I'm thinking of adding macro functions: these are like ordinary functions but they would be always expanded inline. Better than preprocessor macros I think.

Foreach and forall loop for concurrency

I think the question of concurrency (I think of multi-core machines here), should be in the same bucket like the GPU programming and instruction pipelining: when you write an if in C you usually don't worry about the branch predictor. When you write a 3D game you usually don't program the GPU, but use an abstraction level over it (like OpenGL or DirectX). I also don't think you regularly roll Duff's devices instead of loops, the compiler does it for you when it feels it's right. I also think this should be the case for the multi-core CPU-s: let the compiler find out that the control should fork into 4 threads or not.

The implementation would be in the context of foreach loops.

    foreach (index in array)
    {
        /*blah blah*/
    }

This is the foreach loop everyone knows. This loop have the semantics, that index must start at 0 and must run to N-1, one by one. The order is important, this can be used for example for dumping out the contents of an array.

    forall (index in array)
    {
        /*blah blah*/
    }

The syntax is basically the same. But here we ask the compiler not to worry about order of the traversal. When order of the traversed does not matter, then it won't matter if all iterations are done parallel. So the compiler can generate threaded code if it want.

Do-while-next and while-next loops

These are similar to the C's for loop. But a bit more intuitive:

    while (condition)
    {
        statements
    }
    next
    {
        statements
    }

and:

    do
    {
        statements
    }
    next
    {
        statements
    }
    while (condition)

The next block executes after the main loop block. The continue command would jump to the next block. The next block can be omitted.

v2.0 features

I will be happy if my compiler ever able to churn out machine code. But I have further features to add later:

dynamic arrays
set data type (quick searching for values)
pipes (thread safe queues of communication between multiple threads of execution.)