The PDP-11

Your textbook contains notes about machine languages in general. Here are some notes specifically about the PDP-11. They are basically the content of my lectures on the subject. I hope that this plus the PDP-11 instruction reference handout will serve as a substitute for textbook coverage of the PDP-11 specifically, as opposed to the coverage of machine languages in general which I believe is adequately addressed by your textbook.


"PDP" refers to a line of computers made by Digital Equipment Corporation (DEC), over nearly two decades.

I used the PDP-11 as my example CPU for several reasons:

The PDP-11 has 16-bit words, 8-bit bytes; it's byte-addressable. It's little-endian.

It has 8 registers you can refer to in your assembly-language program, but unusually, some of them are specialized: R0 through R5 are true GPRs, R6 is the stack pointer (will be discussed later), and R7 is the PC.

Since it's a 16-bit CPU, it can address 64K.

Because the words are only 16 bits, there is some difficulty in designing instructions encompassing all of the desired possibilities, especially with respect to addressing. But the 11 has a very comprehensive addressing scheme, with indirect addressing, indexing, and all sorts of stuff. To accomplish this, we have the following fairly nifty scheme which fits address information into six bits, with some auxiliary data where needed.

These six bits are divided as follows:

Since we have eight registers, we can specify one with three bits.

Altogether, these six bits specify an "effective address" (EA). So in the case of an ADD instruction, with two operands, we get two effective addresses, and the meaning of the instruction will be EA1 <- [EA1]+[EA2].

The mode bits work as follows:
bitsoctalnameassembly syntaxEA and other semantics
0000registerRiRi
0102autoincrement(Ri)+[Ri], then Ri<-[Ri]+2 or 1
1004autodecrement-(Ri)first Ri<-[Ri]-2 or 1, then EA=new [Ri]
1106indexn(Ri)[[R7]]+[Ri], then inc PC by 2
(n follows in next memory word)

The third mode bit (always listed as zero above) is "indirect". If it is 1, it adds one extra indirection. It is indicated in the assembly language by adding a "@".

The '@' corresponds very closely to the unary operator "*" in C, for those of you who know C.

The '@' also corresponds quite closely to our square brackets in the register transfer notation. E.g. the difference between "R0 <- [R0] + [R1]" and "R0 <- [R0] + [[R1]]" is the same as the difference between "ADD R1, R0" and "ADD @R1, R0".

An alternate syntax for '@' is to add parentheses. So "ADD @R1, R0" can be written as "ADD (R1), R0" -- the assembler will generate identical machine code for these two symbolic versions of the instruction. You're likely to shy away from this syntax given that those parentheses would mean nothing in a high-level language, and they make such a large difference in the assembly language syntax. Indeed I didn't myself use that syntax back in the days when I did some assembly language programming; I preferred the '@'.

However, this is worth noting because it gives us the assembly syntax for the autoincrement and autodecrement modes. The particular syntax used to indicate autoincrement and autodecrement addressing modes is not crucial; it could just as easily be extra keywords, or funny symbols from your keyboard; but suggestive notation can be helpful, if you understand the idea behind the choice of notation. The '+' and '-' in that syntax indicates a postincrement or a predecrement, respectively, just as in Java or C. (There is no preincrement or postdecrement on the PDP-11.) Then the parentheses indicate an extra level of indirection.

So in our earlier program, where we had:

ADD @R2, R0
ADDI 2, R2    (meaning "add immediate")
we could use these nifty addressing modes to write, instead:
ADD (R2)+, R0
or if we had something like R0:=A[R2]:
ADDI A, R2     (actually this ruins R2)
MOV @R2, R0
(the usual name for the instruction on the 11 is actually "MOV", not "MOVE")
then we could use these nifty addressing modes to write, instead:
MOV A(R2), R0
with no damage to R2.

 
Now, given all this, how do we get our various addressing modes?

It's all based on the fact that the PC is R7. So if we write:

MOV (R7)+, R4
32    (i.e. the number 38 goes here, in the next memory location)
(next instruction)
then we see that the contents of R7 will be the address of that 32 (the PC gets incremented on the PDP-11 after the fetch but before the operand decoding), so the operand will be 32. Furthermore, we see that the autoincrement aspect of the address will cause the PC to be incremented AGAIN, so the execution will skip over that 32. It works out very nicely and is one of the coolest aspects of the design of the PDP-11.

Special addressing with register 7 (the PC):
mode bitsoctalnameassembly syntaxmeaning
0102immediate#nEA=[R7], then R7<-[R7]+2
0113absolute@#nEA=[[R7]], then R7<-[R7]+2
1106relativenX=[R7], then inc R7, then EA=X+[R7]
1117relative indirect@nX=[R7], then inc R7, then EA=[X+[R7]]

Let us illustrate the idea of relative addressing with this example:

locinstruction
100MOV 6(R7), R2
102[that 6 goes at loc 102] ("offset")
104...
106...
110HALT
112<value>
Actually, we would write "MOV 112, R2" for location 100 and the assembler would convert. This is one of the features of the PDP-11 assembler. Counting is for computers.

The value "6" at location 102 is an "offset", the distance from location 104 (after the 6 is fetched) to the desired target location.

(Isn't 112-104 equal to 8, not 6? No... because this is in base eight! It takes some getting used to!)

You can also use negative offsets, to represent relative addresses which are "upwards" from the current location.

Code with only relative addresses in it can be loaded at any location in memory and it will run correctly there.


We also have "pseudo-ops", which are instructions to the assembler. This distinction between ops and pseudo-ops doesn't correspond to anything in high-level languages, unless we wanted to say that everything in a high-level language is a pseudo-op.

Ops generate code. Pseudo-ops don't always. An example of a pseudo-op is "C = 2200", which says that the name C means the value 2200. Another kind of pseudo-op says where you want the following code to be. If you want your assembled program to go into memory starting at location 300, you can say that. This affects the calculations of all the addresses in your program. You can also say to leave a block of space, e.g. for data.

We also need some way to say "put the number blah here", which traditionally on the PDP-11 is called ".WORD", with the dot in front. It could be that you just say the number, but that's just not the syntax for the PDP-11 assembler, or usually.

Now, we're going to be writing down numbers a lot, in assembly-language programming. We've been talking about how much easier octal numbers are to work with than decimal numbers. In the traditional PDP-11 assembly language, numbers are considered to be octal by default, unless you put a decimal point after them, e.g. "MOV #2200., R3" puts the base ten number 2200 into R3 (i.e. binary 100010011000, or octal 4230), rather than the octal number 2200. This is not a floating-point number! It's an integer, but it's two times a thousand plus two times a hundred, rather than two times eight to the third plus two times eight squared.

Here's a sample assembly language program for the 11.

   C = 2200
   .ORG 1200
A: .WORD 3
B: .WORD 4
   .ORG 1300
   MOV A, R4
   ADD B, R4
   MOV R4, C
   HALT
   .END

The ".ORG 1200" is a pseudo-op saying we want to start putting down code at location 1200 (octal). That is, that "3" is at location 1200, and that "4" is at location 1202 (in the next word). So the symbolic constant "A" will mean 1200, and the symbolic constant "B" will mean 1202. Later, when we say "MOV A, R4", this is exactly the same as saying "MOV 1200, R4".

The ".ORG 1300" then means that we skip 60 bytes (yes 60! that is, 60 base ten, which is four less than 100 base eight...).

The subsequent MOV, ADD, and MOV instructions are each two-word instructions, because the op with its addressing mode data is the first word, and then they each contain one relative mode address which needs an offset in the following word. If any of them were to contain two relative mode addresses, e.g. "MOV A, B", they would end up being three words altogether.

So the MOV R4, C will deposit the sum of the two numbers at locations A and B, i.e. 7, into location 2200. Not when assembling! When executed.

HALT stops the CPU until you press some buttons on the console, usually to load a new program and then to execute it. On a modern computer system with a substantial operating system, or indeed even on the PDP-11 in its normal use, this would instead be an operating system call which caused the operating system to terminate this process. But that's material for CSC 209, and to avoid getting into all this here, we'll end our programs with HALT.

So what's ".END"? First of all, it's a pseudo-op -- on the 11, pseudo-ops begin with a dot, and mnemonics (symbolic codes for ops, such as MOV) don't.

".END" indicates the end of the assembly-language program. It doesn't cause any machine-language code to be emitted. The end of the assembly-language program is a different matter than the HALT statement. Of course, in a high-level language, the end of the program causes the appropriate machine-language statements to be emitted. Not so in assembly language -- you have to say everything. It's not entirely clear why the end of the file from which your assembly language program is being read can't suffice to indicate the end of the assembly language program. But people have been very slow to acknowledge the end-of-file concept, and assembly languages are old. Interestingly, FORTRAN also has separate STOP and END statements, whereas a programming language such as C or Java (or Turing) has neither, just the end of the file. One thing is, in assembly languages, sometimes there are other parameters to the END pseudo-op, such as a label representing where you think the program should be begun. This is one reason to have it, although this is hardly the most obvious syntax for such a thing. But whether it's a good idea or not, it's there, it's traditional, and assembly languages often have such a thing.

Altogether, this is also something worth ignoring in this course.

There are many possible ways to design assembly languages, but there's always a general principle of having approximately one op per line.

 
We can hand-assemble this program. For example, "MOV A, R4" becomes the two words 016704 and 177674. Remember when looking at these six-digit octal numbers that the first digit represents only one bit, so it is always zero or one. Each other digit represents three bits each, for a total of 16 bits. Thus we see that the left two octal digits of a two-operand instruction are the opcode, because that's the most significant four bits. In this case, 01 (in octal).

Page 3 of the PDP-11 reference handout says that "MOV{B} is 0|1". This means that MOVB is 11 and MOV (plain) is 01 This meaning of that 0|1 is introduced at the end of the first section on page 1 as a notational shorthand.

The next three bits, i.e. the next digit, is the addressing mode, and the following three bits, digit, is the register number. That is, these six bits comprise the "src" address, in one of the standard addressing modes from page 1. We want the first operand to use the "relative" addressing mode, which is the default in our assembly language syntax, so we choose 6 for the mode and 7 for the register. This 6 comes from the second table above, which can be used whenever we use register 7, which is the PC.

Then the next word should have the offset.

Finally, for the second operand, we have register mode, which is 0, and register 4. So, the op word is 016704.

Now, the offset word.

The operand address is 1200, and the PC will be 1304 when this is being calculated, having already fetched both words of this instruction. Thus we need a value of -104 (base 8).

The use of -104 is that when it is added to 1304, we will get 1200.

We represent -104 in two's-complement notation. Written out in all 16 bits, 104 (octal) is 0000000001000100. The one's-complement of this is then 1111111110111011, plus 1 is 1111111110111100 or 177674 (remember that the left octal digit is only one bit!).

Altogether the assembly is as follows:

locvaluelabelassembly
C = 2200
.ORG 1200
1200000003A.WORD 3
1202000004B.WORD 4
.ORG 1300
1300016704MOV A, R4
1302177674
1304066704ADD B, R4
1306177672
1310010467MOV R4, C
1312000664
1314000000HALT

(Why does it jump from 1306 to 1310? Because it is base eight!)


Branch instructions

That is the format for most instructions, but branch instructions are different. Let us look at branch instructions now.

We have conditional and unconditional branches. Unconditional branches are simple. For example:

JMP 200
is a branch to location 200. Note that the fact that the PC is incremented before instruction execution is crucial here; otherwise we'd be branching to location 202, after the increment!

So for example

JMP 200
will have the simple semantics
R7 <- 200
Note that we have one less level of indirection than usual. In a way, this is a syntactic issue. If we wanted R7 <- [200], we could write JMP @200. It is much more common to want R7 to become 200, rather than [200], from an instruction like this; thus the default is as shown above.

That instruction takes up two words, because it uses the autoincrement mode on the PC to store the number 200 in the location after the JMP opcode, like how we always do immediate addressing on the 11. So it takes up two words.

There is a relative branch instruction which always fits in one word. Actually, it's a set of instructions, called "branch instructions" (we don't actually normally call "JMP" a "branch instruction", on the PDP-11), all of whose mnemonics begin with a 'B'. They have these properties:

We have conditional and unconditional branches. Unconditional branches are simple; you just start executing at a new place.

Conditional branches are more complicated. An example we've seen is BGT, branch if greater-than.

The relative branch instruction's opcodes take up 8 bits. This only leaves 8 for the offset. Since the PC always has to be even, the least bit of the offset always has to be zero; thus, we'll use those eight bits to represent a nine-bit offset, with the least significant bit implicit. In other words, branch address = [updated R7] + 2*offsetfield

So for example, in our earlier example:

LOOP: ADD (R2)+, R0
      DEC R1
      BGT LOOP
the offset field will be -3, as these three instructions are all one-word instructions.

 
To understand the mechanism for specifying conditional branches, you first have to understand "condition codes".

When we write stuff like this BGT instruction just above, we're comparing something against zero. "GT" stands for "greater than zero", so "BGT" means "branch if greater than zero".

Branch if WHAT is greater than zero? As we said when we wrote this loop some time ago: the result of whatever the previous instruction was.

This means that the computer has some state relating to conditional branches. It doesn't look ahead when doing the decrement to see if it's going to be tested and how; rather, while doing the decrement, certain information is kept for possible use in a subsequent branch instruction. In fact this information is being computed and stored all the time, as we do almost any kind of instruction but certainly any arithmetic instruction; usually this information is discarded as we just replace it by doing another arithmetic instruction, but every once in a while we consult this information. The way we consult it is by doing a conditional branch instruction.

Condition codes on the 11 are N, Z, C, and V. These are very common codes, not just used on the 11, and we should spend a moment discussing their meaning. Each one is some particular bit in a special register in the CPU known as the condition register (or processor status word), and gets set and cleared automatically as instructions get executed.

N means negative. Remember that the leftmost bit of a two's-complement number is the "sign bit". So when we construct the hardware, we'll take the N condition code directly from that.

Thus, for example, in the DEC case, if R1 were previously zero, it would roll over to minus one, which is all bits one, and the N flag would be set. However, if R1 were previously negative, the N flag would also be set, unless it was the most negative number so that decrementing it would overflow. For example, if R1 contained -2, the DEC makes it -3 and the N flag still gets set. Thus, the N flag means the result IS negative, not that it became negative.

Therefore, the N flag is simply the leftmost bit of the result. It just stores this bit in the condition register. A BLT would test for the N flag. Branch if less than zero.

Z means zero. It's there to support a BEQ instruction. Also a BNE instruction, branch if not equal.

C is a different kind of flag. C stands for carry. Remember our multi-bit adder, in which each full adder had a carry output which was an input to the adder on its left. Then the leftmost adder had a carry output, and we weren't really sure what to do with it. Well, this is what we do with it. We throw it into the C flag. This allows an assembly programmer to do multi-word addition. If you add the least-significant word of each number together, this gives you the carry bit to feed into the addition of the next-least-significant words. On the other hand, if you're considering this to be unsigned addition, you would consider this carry bit to be an overflow.

Speaking of overflow, the V bit is the overflow bit. We use the second letter of the word "overflow" because O looks a lot like 0 and we have a lot of 0s floating around. If we're doing signed addition, then this indicates an overflow.

Example, 4-bit numbers:
0110+0011=1001
unsigned: 6+3=9
signed: 6+3=-7
So the V bit will be set, but not the C bit. (C cleared)

Example:
1011+0110=0001
unsigned: 11+6=1
signed: -5+6=1
so C set, V cleared.

The table listing opcodes specifies which ops set which condition bits. It's part of the semantics of an op.

Suppose we want to compare to a number other than zero? Or, more generally, suppose we want to do "if (a<b)", i.e. we want to branch if a is less than b, but we have no specific interest in the subtraction?

We could do:

SUB b, a
BLT wherever
but the problem is that this replaces a with a-b. We don't want to CHANGE a, we just want to compare it.

There is an instruction CMP which does this.
We can write:

CMP a, b
BLT wherever
to achieve this effect. Note that CMP computes left-operand minus right-operand, rather than the usual order!

Similarly, we may want to test against zero. If we've JUST performed the arithmetic yielding this potential zero, then we can consult the 'Z' flag; but if not, there is a TST instruction for this purpose.

"TST a" is the same as "CMP a,#0". It fetches the operand a, but it just sets all the condition bits based on a; it doesn't do any computation. CMP doesn't store the result; TST doesn't either, but TST doesn't even compute a result, it just takes its operand and sets the condition bits according to it.

What do V and C mean for a TST? Nothing, obviously.

So perhaps they'd be preserved from the previous computation, so you still could access the conditions resulting from the previous computation.

It so happens they're not; TST just clears the V and C bits. You find this out by looking at a table of the semantics of PDP-11 opcodes, e.g. the pdp-11 handout, which we'll go through next. No reasoning could lead you to the certain conclusion that they're cleared rather than preserved, although it does lead you to realize that there's no obvious way to set them to some result of the test.


What PDP-11 instructions should you know?

When you do PDP-11 programming, like you will for assignment 3, you should have the PDP-11 handout or something similar in front of you.

But here's a quick list of what you should know. Let's go through the PDP-11 handout.

You should find that the overall notes section now makes sense.

You should know the zero-operand instructions HALT and NOP. Look at the "format" box -- zero-operand instructions are all opcode.

NOP is a slightly weird one, but all machine languages have it. It comes in handy sometimes, especially in editing already-assembled machine code. You can basically "comment out" an instruction by changing it to a NOP, without having to adjust all the addresses and offsets and such of the surrounding code.

One-operand instructions:
Note that they also come in word and byte versions in most cases. In the case of a one-operand instruction, we have ten bits for the opcode, and one six-bit address in the addressing mode formats discussed above.

As you can see, condition codes need to be discussed individually for each op. But a lot of them are the same, so I wrote two standard "schemes" down in the intro section.

You should be at least vaguely aware of CLR, INC, DEC, COM, and NEG. I'm not covering shift instructions in this course, but they're rather like the shift register we looked at while discussing sequential circuits.

TST is a one-operand instruction; the other instructions you already know are mostly two-operand instructions, some of which are listed in the "one-and-a-half-operand" section. The difference between the 1.5- and 2-operand instructions is not really the number of operands, but the bit layout. In the 1.5-operand instructions, one operand can be a fully flexible address but the other has to be a register and it only has three bits to describe that operand.

You should be familiar with MOV, ADD, SUB, and CMP. We aren't going to do too much with the "bitwise ops" such as XOR, BIS, BIC, and BIT this term. But they correspond to the use of the bitwise operators "^" and "&" in C or Java.


You'll notice that almost all of the formats have the boundaries on multiples of three. This makes octal coding easy. Remember that since we have 16 bits, and 16 is not a multiple of three, the left digit ends up standing for fewer bits: just one bit in fact. So the left digit is always one or zero. The other octal digits can be anything from zero to seven.

Now when it says, for example, that ADD is 06, since the opcode is four bits wide in that format, you know that all ADD instructions will begin with 06 in octal, and all instructions whose octal code begins with 06 will be ADD instructions.

Unfortunately, the branch instructions are exceptions to this general rule, because they have 8 bits for the opcode and 8 bits for the offset. I've provided the opcode in binary, as one normally does with the PDP-11. (Actually, the PDP-11 manual itself, from DEC, provides all opcodes in binary, but it's easier if you have all the ones which work out well in octal.) So if you're trying to figure out what a given instruction is, you'll have to consider that it might be a branch instruction. And the coding into octal is a pain.

There are 15 branch instructions, 14 of them conditional, which test various possible condition bit combinations. See the handout. Includes BLT, BEQ (=0), BLE, etc, and also things like BCS and BCC.

We're going to discuss how the overflow bit relates to this in another week or two. For now, if you can assume that the subtraction does not overflow, then V=0, and if you substitute in V=0 to the weirder formulas, you will get a boolean expression which should make sense.

Given instructions such as BCS and BCC, we also have instructions such as CLC and SEC which explicitly clear or set the C condition code, and so on for the other condition codes. So you can write subroutines which use the condition bits for other things, such as to indicate an error status. You call the subroutine (procedure), then when it returns you check for carry before anything else.

The last page of the PDP-11 handout contains stuff about priority levels, which we won't get to until later in the course, and the condition codes, which you know about. There's also a list of pseudo-ops, just for reference.

The second-to-last page contains a whole bunch of miscellaneous stuff, pretty much in the order in which it's of interest to us. So far we've mostly seen JMP; "Subtract one and branch" is a nifty loop instruction even though it's highly specialized; and we're in the middle of talking about the JSR and RTS instructions in talking about subroutine linkage.


One other thing we need to discuss as background for subroutine linkage is the use of the stack. We said earlier that R6 was the stack pointer (SP).

Using the autoincrement and autodecrement addressing modes, we can do pushes and pops.

push: MOV x, -(R6)

pop: MOV (R6)+, x

Note that it's an important feature of the PDP-11 design that the autoincrement and autodecrement are opposite in whether the inc/dec occurs before or after the use of the SP in the address calculation.

Thus we see that we can use any register to maintain a stack. However, the specialized subroutine-oriented instructions use R6. So we usually use R6 for the stack.

We have to initialize R6 with the memory address of the end of a suitable area for storing the data we push on the stack. This is typically done by the operating system, and thus in this course we will not worry about how to find an appropriate memory area for this purpose. We'll suppose that the operating system has already set R6 to point to a wisely-chosen area of memory for use as the stack.

on to PDP-11 subroutine linkage


[list of course notes topics available so far]
[main course page]