Background - The Lowest-Level Language
Assembly is among the most feared of programming languages. In the realm of programming languages, this is the closest to machine code you can get. It is an imperative language, which means all of those objects, classes, if statements, and while loops from your favorite object-oriented languages like Java and C++ are replaced with jumps and labels.
Within programming paradigms, Assembly is at the bottom. Source |
In order to prepare for my upcoming Assembly class at Bloomsburg University, I have started teaching myself how to use an 8-bit Assembly Simulator, created by Marco Schweighauser. It's a great way to see exactly how your code alters the different components of the computer, such as the RAM and registers. I have created this simple tutorial to help any other students or programmers who are struggling with understanding how Assembly works or how to program in it. This is intended to be an introduction, not an advanced tutorial.
To follow this guide, you will need an understanding of a couple of things:
- A Prior Programming Language or Scripting Language like Java, Python, C++, or C
- Bits and Bytes
- Hexadecimal Numbers
- ASCII Characters
- Difference between CPU and RAM
The Layout
When you go to this website, it starts you off with some default code. If you click the green button in the top left corner that says ➤ Run , you'll be able to watch the phrase "Hello World!" appear in the output (slowly, but surely). You will also notice that upon running it, your RAM will fill up with the machine code version of the typed code. This HEX is the "assembled" version of it.
The ➧➧ Step button allows you to go through your code one line at a time. It is great for debugging.
The Reset button empties the RAM and registers, basically setting everything back to 0's.
Understanding "Assembling"
After the default program is run, you'll notice not "Hello World!" in the output section of the RAM, but some HEX numbers. The first five should look like this:
48 65 6C 6C 6F
As you could probably guess, this is actually ASCII for "Hello."
H e l l o
In fact, if you look into the assembly code part of the RAM (which is the top part), you will see the same HEX in the first line. This is the part of the code that saves the string of characters as ASCII.
You can see that the assembler doesn't save any room in the RAM for neatness or formatting.
It utilizes every byte it can! Also, take note that everything in RAM occurs in the exact same order as it does in the code.
A color-coded conversion of code being assembled to HEX. |
Let's take a look at how the assembly process works. The first line of code says JMP start . If you see its HEX conversion, it shows 1F 0F . The "1F" stands for a "Jump" command, and the "0F" stands for the location of where to jump to. We want to jump to the first command of the "start:" label, which happens to be MOV C, hello . The hex for this corresponds with 06 02 02 (don't worry about what this means yet). If you count to the position of the first byte of that command " 06 ," you will find that it is the 16th byte in RAM. Take into consideration that computers start counting with 0, and that makes it the 15th byte. If you convert 15 into HEX, you get our friend 0F . The Jump command is dynamically telling us which byte to jump to, without knowing ahead of time where the MOV command will be when it gets assembled.
The dynamic nature of labels is one of the more subtle reasons why we would program computers in Assembly, as apposed to pure HEX. Another reason would be to avoid learning the HEX form of every instruction. Rather than learning about the five different kinds of MOV commands, it would just be easier if Assembly automatically took care of that for us.
It is important to note that the code used in this Simulator does not represent all other Assembly instruction sets, such as the popular x86 architecture 8086.
Variables, Registers, and the Stack
The first thing you should know about variables is how they are stored. In a higher level programming language, when you type,
[String/var] hello = "Hello World!"
...you are not actually storing the string "Hello World!" into the variable hello. You are writing the string "Hello World" to RAM, and storing its address (location in RAM) to wherever the hello label is in RAM. That way, you can refer to the string "Hello World!" wherever you are.
In Assembly, we store variables the same way, but in a different syntax. We write data to RAM and use a label to refer registers to it. The command for writing data to RAM is DB, as in DB "Hello World!" It can be thought of as this:
Pseudocode showing how registers point to variables. |
To access and use the variables we save, we need to utilize registers. Registers are bytes of memory stored directly in the CPU. They are used to store data that requires the direct attention of the processor.
Registers A, B, C, D, IP, and SP. Flags Z, C, and F. |
In this simulator, there are three kinds of registers.
The A, B, C, and D registers are considered General Purpose Registers. These are the ones that we will directly manipulate and store values into. You can store a value, like the location of the "hello:" label, by using a MOV command, such as MOV C, hello . This command stores the address of the first command at the label hello.
The IP register is the Instruction Pointer. This register tells the CPU which instruction to execute next. It points to the next instruction by holding the command's address (location in HEX), much like how we calculated the address of the JMP start command earlier. In fact, using a JMP command can directly manipulate the IP. You can see it constantly changing as the program is stepped through.
The SP register is the Stack Pointer. This register holds the location on RAM of the last thing added to the stack. The stack is a dynamically growing and shrinking portion of RAM that is used to store address locations and values directly from registers. Think the stack of it as memory for direct use by registers. We use the PUSH and POP commands to PUSH a register's data to the stack and POP the latest stack byte into a register (the byte that the SP points to).
The Flags are single bits that are either False or True (0 or 1). They are used for conditional jumps, or the if statements of Assembly. While they are important, I will not discuss in great detail how they are used or manipulated. Just know that we should not need to interact with them directly when programming in assembly.
Click ➧➧ Step . (Step 04)
Click ➧➧ Step twice! (Steps 05 & 06)
Click ➧➧ Step . (Step 07)
The command MOV B, 0 will execute, writing the value 00 to B. In this program, Register B will be compared against each character of the string "Hello World!" as ASCII HEX within a loop. After we get every character into the output (which we will do one by one), we plan on exiting the loop and ending the program. We need to have a way of knowing when to stop printing characters.
Introducing the String Terminator! What happens to be the HEX value in RAM directly after our string? A 00 that was placed there when the hello section of code was written to RAM! Specifically, it was put there from the line DB 0 . We call this a string terminator because it signifies the end of a string. Within our character printing loop, we will make a comparison to see if we have hit the string terminator, which would then break the loop!
Are you still there? If so, pat yourself on the back! You must have some crazy dedication to learn one of the hardest and least used programming languages, and I really appreciate your choice to learn Assembly on my blog of all places. We're about 60% through and this is the more exciting part, so hang in there!
Click ➧➧ Step twice! (Steps 08 & 09)
The commands MOV A, [C] and MOV [D], A will execute. The first thing that you'll probably notice about these MOV commands are the [] brackets. These brackets mean that you should identify the value within the brackets as an address, then get the value at that address.
So lets walk through it. MOV A, [C] begins by looking for the value within Register C, which happens to be 02. It knows by the brackets to interpret 02 as an address, rather than as a value. Next, we will look for that location, which happens to hold the value 48. Lastly, we are copying that value into the Register A, successfully putting the letter 'H' into a place that we can get at easily.
Alright, not too hard to follow... let's take a look at MOV [D], A . We are taking the value at Register A (48) and copying it to the address found at Register D (E8). This address happens to be the first character slot of the output. As soon as it gets copied into RAM, it automatically updates the output with the value's ASCII equivalent.
Click ➧➧ Step twice! (Steps 10 & 11)
The commands INC C and INC D will execute. The Increment command adds one to each of the two registers. Remember that Register C is in charge of locating the letters in our stored string, and Register D is in charge of locating where to plop them down in the output. It is VERY important to keep the two increasing parallel to each other.
If either of these were to get out of sync and increase at different rates, there could be Corruption. If this happens, you might get something weird in the output like "HloWrd." That is... if you are lucky enough to come across a string terminator. Otherwise, the program would keep outputting weird letters until it either ran out of RAM, or it ran out of output space (the second of which being more likely). Trying to write data to an area that has run out of space would result in a Buffer Overflow error.
Without proper safety checks like exception handling in hand (especially at this low-level), a buffer overflow would cause some really crazy things to happen. In fact, hackers can take advantage of these flaws to compromise computers by making these programs do malicious things that they were never meant to.
Click ➧➧ Step twice! (Steps 12 & 13)
The commands CMP B, [C] and JNZ .loop will execute. The set of these two commands is practically an if statement. When the Compare (CMP) command is executed, it takes a look at the two values and sets the Flags Z, C, and F depending on how the two are compared. We then do a particular jump based on the conditions of the flags.
In this particular case, we are looking to see if we have NOT come across a zero (00) at the address of Register C. Register B was set to 00 earlier, and will be used as a reference of what we are looking for. Remember that Register C is in charge of pointing to the characters that will eventually be put into the output. If we have run into a zero, it means we have come across the string terminator! This will indicate that we should stop reading the string, exit the loop by skipping the following conditional jump, and move onto the next part of code.
What is done with the newly set flags is up to the following Conditional Jump command. There are numerous conditional jump commands, but we will be using the Jump if Not Zero JNZ command. Instead of looking for the string terminator, we execute the jump if we find anything else. This of course loops us back to the beginning of this section, .loop.
For more information on flags and conditional jumps, check the corresponding section on this webpage.
Click ➧➧ Step 7 commands times 12 character checks = eighty-five times!
Now that the program has done what we needed it to do, all that's left is some cleaning up. The commands POP B and POP A will execute. These commands move whatever values are in the stack into the specified registers. If you recall, we PUSHed some values in steps 05 & 06 to the stack. You can think of the POP command as a way to recall those stored values.
In this example, we are only moving the two 00's back into their respective registers. Notice that we are popping the data in reverse. This is good practice for making sure that we don't accidentally put the wrong data into the wrong register.
Remember that PUSH, POP, and The Stack allow you to work with way more data than just the four general registers.
Click ➧➧ Step . (Step 16)
The command RET will execute, Returning you to the command after the CALL command that we executed. This will be the case, assuming that everything you PUSHed to the stack has been accounted for by an equal number POP commands. If there are more POPs then PUSHes or vice versa, there will most definitely be a problem.
Click ➧➧ Step . (Step 17)
The command HLT will execute, Halting the program. It couldn't be any simpler than that.
Since I don't have anything else to explain about the halt, how about a fun fact? If you look at the assembled code for a Halt command, you should see 00 . Look familiar? That's because its the same byte value as a string terminator! It just goes to show that the only difference between certain bytes is the way they are interpreted and used. If we were to jump to the location of a string terminator and execute it as a command, it would result in a Halt.
The next step would be taking what you learned and applying it to something new. It took me a day to learn assembly and understand how this simulator worked, and a night to develop my first program. I encourage you to come up with a simple program to refine your understanding. Here are some ideas:
The A, B, C, and D registers are considered General Purpose Registers. These are the ones that we will directly manipulate and store values into. You can store a value, like the location of the "hello:" label, by using a MOV command, such as MOV C, hello . This command stores the address of the first command at the label hello.
The IP register is the Instruction Pointer. This register tells the CPU which instruction to execute next. It points to the next instruction by holding the command's address (location in HEX), much like how we calculated the address of the JMP start command earlier. In fact, using a JMP command can directly manipulate the IP. You can see it constantly changing as the program is stepped through.
The SP register is the Stack Pointer. This register holds the location on RAM of the last thing added to the stack. The stack is a dynamically growing and shrinking portion of RAM that is used to store address locations and values directly from registers. Think the stack of it as memory for direct use by registers. We use the PUSH and POP commands to PUSH a register's data to the stack and POP the latest stack byte into a register (the byte that the SP points to).
The Flags are single bits that are either False or True (0 or 1). They are used for conditional jumps, or the if statements of Assembly. While they are important, I will not discuss in great detail how they are used or manipulated. Just know that we should not need to interact with them directly when programming in assembly.
Stepping through "Hello World!"
We are going to dissect the default program one step at a time to get an understanding of how this program works. It is recommended to have the simulator loaded in a window alongside this blog.
For reference, here is a list on how this program intends on using the four General Registers:
For reference, here is a list on how this program intends on using the four General Registers:
Register A - (Value) Holds copies of the individual characters and places them into the output.
Register B - (Value) Always holds a 00 for reference when looking for the String Terminator.
Register C - (Address) Lets Register A know which character to copy.
Register D - (Address) Lets Register A know where in the output to place the newly copied character.
Register B - (Value) Always holds a 00 for reference when looking for the String Terminator.
Register C - (Address) Lets Register A know which character to copy.
Register D - (Address) Lets Register A know where in the output to place the newly copied character.
Click Reset . (Step 00)
Note that RAM is cleared. This should effectively give you a clean slate, along with the SP register being set back to E7 , the start of the stack. IP should be at 00 , the start of RAM.
Click ➧➧ Step . (Step 01)
The first instruction JMP start should execute, and IP will point to the next instruction at address 0F , which happens to be where the first command of the start label, "MOV C, hello"
You will see RAM fill up with the assembled code. You will also note several highlighted bytes in the RAM. These are the pre-calculated spots for the IP register to point to. These are also known as the assembled instructions. Each command (like JMP and MOV) requires a very specific amount of bytes.
Rather than saving the location of each command in the registers for the IP register to move along, the CPU knows exactly how many bytes each command requires. It then jumps to the byte directly after the full length of the current command. It interprets that new byte as a command, which would tell the CPU how many of the subsequent bytes are parameters of this new command. This information is stored in the instruction-set, or the set of commands known by the CPU.
You will see RAM fill up with the assembled code. You will also note several highlighted bytes in the RAM. These are the pre-calculated spots for the IP register to point to. These are also known as the assembled instructions. Each command (like JMP and MOV) requires a very specific amount of bytes.
Rather than saving the location of each command in the registers for the IP register to move along, the CPU knows exactly how many bytes each command requires. It then jumps to the byte directly after the full length of the current command. It interprets that new byte as a command, which would tell the CPU how many of the subsequent bytes are parameters of this new command. This information is stored in the instruction-set, or the set of commands known by the CPU.
Click ➧➧ Step . (Step 02)
The command MOV C, hello will execute, putting the location of the first command at hello into register C. You will see Register C holding 02 .
It is important to note that there is nothing distinguishing the difference between a value and an address in a register. You have to know what kind of data you put in the register, and what you want to do with it.
As discussed before, we are not putting "Hello World!" into register C, but its address in RAM, which happens to be the 3rd byte. This is represented in HEX with 02 .
IP points to the next instruction at address 12 , which happens to be the second command of the start label.
Click ➧➧ Step . (Step 03)
The command MOV D, 232 will execute, moving the HEX value of the number 232 into register D. 232 in HEX happens to be E8 . That value is the starting byte of the output section of RAM.
Anything we put in the output section of RAM will be interpreted as ASCII and displayed in the output. The program will continue to use Register D to hold the location of the output.
IP will point to the next instruction at address 15 , which is the second last command of the start label.
Step 04. Note the address 0x17 in the stack. |
The command CALL print will execute, moving the IP to the address of the label print.
Calling is different than Jumping, but similar. When you call a label or address, it means you intend on returning to where you made the CALL from. When you Jump, you don't intend on returning to that location. You can think of the CALL command like calling a method or a function in Java or Python, respectively.
In order to remember where to return to after a CALL command is finished, the address of the next instruction (0x17) is pushed to the stack. The command after CALL print is HLT , which is where we will return to once the CALL is completed. The HLT command stops the process successfully.
IP will point to the next instruction at address 38 , which is the first command of the print label.
Click ➧➧ Step twice! (Steps 05 & 06)
From this point on, the color of each command will no longer reflect the color-coded image shown above.
The commands PUSH A and PUSH B will execute, pushing whatever values held in A and B to the stack in that order. This is a useful trick in case you want to save values for later. Since both A and B were empty with 00 , that is exactly what is written to the stack (twice). The stack should show 00 00 00 17 with SP having the address of E4 . Since the stack grows down (meaning backwards), it makes sense that the SP is smaller now.
We can get these values back later by POPing them back into registers. You can see the respective POP commands at the end of .loop. Just know that for the purpose of this program, the present PUSH and POP commands are actually useless, as we have no need to retain the null bytes. It is, however, a good example of how to save data from the registers for later.
The commands PUSH A and PUSH B will execute, pushing whatever values held in A and B to the stack in that order. This is a useful trick in case you want to save values for later. Since both A and B were empty with 00 , that is exactly what is written to the stack (twice). The stack should show 00 00 00 17 with SP having the address of E4 . Since the stack grows down (meaning backwards), it makes sense that the SP is smaller now.
We can get these values back later by POPing them back into registers. You can see the respective POP commands at the end of .loop. Just know that for the purpose of this program, the present PUSH and POP commands are actually useless, as we have no need to retain the null bytes. It is, however, a good example of how to save data from the registers for later.
IP will point to the next instruction at address 1C , which is the last command of the print label.
Click ➧➧ Step . (Step 07)
The command MOV B, 0 will execute, writing the value 00 to B. In this program, Register B will be compared against each character of the string "Hello World!" as ASCII HEX within a loop. After we get every character into the output (which we will do one by one), we plan on exiting the loop and ending the program. We need to have a way of knowing when to stop printing characters.
Introducing the String Terminator! What happens to be the HEX value in RAM directly after our string? A 00 that was placed there when the hello section of code was written to RAM! Specifically, it was put there from the line DB 0 . We call this a string terminator because it signifies the end of a string. Within our character printing loop, we will make a comparison to see if we have hit the string terminator, which would then break the loop!
IP will point to the next instruction at address 1F , which is the first command of the .loop label.
Are you still there? If so, pat yourself on the back! You must have some crazy dedication to learn one of the hardest and least used programming languages, and I really appreciate your choice to learn Assembly on my blog of all places. We're about 60% through and this is the more exciting part, so hang in there!
Moving a character takes two steps. |
The commands MOV A, [C] and MOV [D], A will execute. The first thing that you'll probably notice about these MOV commands are the [] brackets. These brackets mean that you should identify the value within the brackets as an address, then get the value at that address.
So lets walk through it. MOV A, [C] begins by looking for the value within Register C, which happens to be 02. It knows by the brackets to interpret 02 as an address, rather than as a value. Next, we will look for that location, which happens to hold the value 48. Lastly, we are copying that value into the Register A, successfully putting the letter 'H' into a place that we can get at easily.
Alright, not too hard to follow... let's take a look at MOV [D], A . We are taking the value at Register A (48) and copying it to the address found at Register D (E8). This address happens to be the first character slot of the output. As soon as it gets copied into RAM, it automatically updates the output with the value's ASCII equivalent.
IP will point to the next instruction at address 25 , which is the third command of the .loop label.
Dual incrementing keeps the letters in order. |
The commands INC C and INC D will execute. The Increment command adds one to each of the two registers. Remember that Register C is in charge of locating the letters in our stored string, and Register D is in charge of locating where to plop them down in the output. It is VERY important to keep the two increasing parallel to each other.
If either of these were to get out of sync and increase at different rates, there could be Corruption. If this happens, you might get something weird in the output like "HloWrd." That is... if you are lucky enough to come across a string terminator. Otherwise, the program would keep outputting weird letters until it either ran out of RAM, or it ran out of output space (the second of which being more likely). Trying to write data to an area that has run out of space would result in a Buffer Overflow error.
Without proper safety checks like exception handling in hand (especially at this low-level), a buffer overflow would cause some really crazy things to happen. In fact, hackers can take advantage of these flaws to compromise computers by making these programs do malicious things that they were never meant to.
IP will point to the next instruction at address 29 , which is the fifth command of the .loop label.
Click ➧➧ Step twice! (Steps 12 & 13)
The commands CMP B, [C] and JNZ .loop will execute. The set of these two commands is practically an if statement. When the Compare (CMP) command is executed, it takes a look at the two values and sets the Flags Z, C, and F depending on how the two are compared. We then do a particular jump based on the conditions of the flags.
In this particular case, we are looking to see if we have NOT come across a zero (00) at the address of Register C. Register B was set to 00 earlier, and will be used as a reference of what we are looking for. Remember that Register C is in charge of pointing to the characters that will eventually be put into the output. If we have run into a zero, it means we have come across the string terminator! This will indicate that we should stop reading the string, exit the loop by skipping the following conditional jump, and move onto the next part of code.
What is done with the newly set flags is up to the following Conditional Jump command. There are numerous conditional jump commands, but we will be using the Jump if Not Zero JNZ command. Instead of looking for the string terminator, we execute the jump if we find anything else. This of course loops us back to the beginning of this section, .loop.
For more information on flags and conditional jumps, check the corresponding section on this webpage.
Click ➧➧ Step 7 commands times 12 character checks = eighty-five times!
...Or click ➤ Run until the string "Hello World!" appears in the output. (Steps 14 & 15)
Now that the program has done what we needed it to do, all that's left is some cleaning up. The commands POP B and POP A will execute. These commands move whatever values are in the stack into the specified registers. If you recall, we PUSHed some values in steps 05 & 06 to the stack. You can think of the POP command as a way to recall those stored values.
In this example, we are only moving the two 00's back into their respective registers. Notice that we are popping the data in reverse. This is good practice for making sure that we don't accidentally put the wrong data into the wrong register.
Remember that PUSH, POP, and The Stack allow you to work with way more data than just the four general registers.
Click ➧➧ Step . (Step 16)
The command RET will execute, Returning you to the command after the CALL command that we executed. This will be the case, assuming that everything you PUSHed to the stack has been accounted for by an equal number POP commands. If there are more POPs then PUSHes or vice versa, there will most definitely be a problem.
Click ➧➧ Step . (Step 17)
The command HLT will execute, Halting the program. It couldn't be any simpler than that.
Since I don't have anything else to explain about the halt, how about a fun fact? If you look at the assembled code for a Halt command, you should see 00 . Look familiar? That's because its the same byte value as a string terminator! It just goes to show that the only difference between certain bytes is the way they are interpreted and used. If we were to jump to the location of a string terminator and execute it as a command, it would result in a Halt.
Conclusion
Congratulations! You have successfully completed and walked through your first Assembly Program! Take pride in knowing that you have way more patience and passion to learn than most programmers! It's important to have a strong foundational understanding of where higher programming languages stem from.The next step would be taking what you learned and applying it to something new. It took me a day to learn assembly and understand how this simulator worked, and a night to develop my first program. I encourage you to come up with a simple program to refine your understanding. Here are some ideas:
- A "Hello [Your Name Here]!" Program
- Have a ball move from the left of the output to the right of the output.
- Create a 4 function calculator program that shows you your equation and calculated output
I wish you good luck in your low-level endeavors.
And as always, thanks for reading!
~ Dan