Somebody who keeps track of what's going on in the Asm blocks of my scripts may wonder why I'm often placing my intrinsic assembler subroutines within the .data
section of an Asm block rather than at the end of its .code
section after the "ret" instruction that exits the Asm block and returns control to Fbsl.
Well, firstly, the .data
section, when defined together with a corresponding .code
section, in fact makes the virtual machine skip everything that resides in the .data
section on enter and sets the Asm sub/function entry point to the instruction that immediately follows the .code
directive. Thus I point my instruction pointer exactly to the instruction that I want to be executed first without any additional jumps.
OK that's clear. But why not at the end of the Asm block?
You see, as everything else in the computer world, the assembly process progresses gradually from the beginning of the Asm block to its matching End Asm. Assembler subprocedures within Asm blocks are defined with their labeled names and end with their respective "ret" instructions that return control back to the main Asm code. Labels are in fact mere pointers in their human-friendly form that are replaced by the asm parser with their respective exact offset values that the parser calculates once on the fly at app load time. Ideally, to make parsing as fast and loading as short as possible, there should be just one parsing pass which resolves all labels to their respective offsets in one go.
If a subprocedure label call occurs before the parser encounters the label proper which is located at the end of the Asm block, it is unable to predict what the exact offset to the label is until it finally encounters that label in flesh! So what do I do? I reserve 8 zero bytes for this yet-to-be-determined offset and go on parsing and filling in opcodes and keeping track of my displacement and resolving other label references whose offsets I already know (e.g. backward references to the data defined in the .data
section of the Asm block) until I finally come across the subprocedure label in question. Then I simply store its offset which I am finally able to determine and go on parsing to the end of the Asm block.
If on completing pass 1 I find out that I've had no such forward references at all, I finalise the overall loading process and launch the execution.
If at least one such forward reference has been temporarily unresolved during pass 1, I launch pass 2 of my asm code parser wherein it attends to the reserved 8-byte spaces only filling them with the required forward offsets which I stored in pass 1. If at least one such 8-byte space previously reserved remains unfilled, i.e. doesn't have a matching offset ready, it means the programmer has mistakenly made a reference to a non-existent label. In this case, loading stops and a corresponding error message is sent to the console. If everything is OK meaning all lables are resolved successfully then execution may seemingly begin... And the actual parser speed with such a "one-and-a-half-pass" approach is currently up to 2MB of asm source code per second.
The actual offset value found may take less space to express than the 8-zero-byte spaces I reserved earlier, and the extra zero bytes left may stall the CPU because it regards opcode 0 as a valid instruction while this is an improper instruction at this very place! What can I do to change this situation for the better?
Well, obviously I can substitute opcode 0 with opcode &H90 for filling in the 8-byte blanks. &H90 is a "nop" instruction that has no other manifestation than simply incrementing the CPU instruction pointer until a meaningful instruction is encountered. Is it a solution? Yes, it is. But it is a cheap solution. Each "nop" requires 1 CPU cycle to execute and the ultimate number of such successive "nops" in a reserved blank may reach 6. Six extra CPU cycles to a forward reference... An infinitesimal loss by Fbsl standards but an unpardonnable and shameful waste by assembler ones!
I SWEAR HONEST TO GOD I DIDN'T PEEP UP THE FOLLOWING IDEA ANYWHERE! I TEEMED AND NURTURED IT ALL BY MYSELF!
Luckily, asm is very-very old, almost as old as the computer world itself. It has seen many funny things and some of those things are still available in its structure. Or perhaps somebody is deliberately conserving them for cases like this. There is a number of instructions that are very useful and fast. The funny thing about them is that they can also be empty, dummy, false or whatever you may call them, i.e. they may
do actually nothing to your register values and
flags except incrementing the instruction pointer by the number of opcode bytes they occupy. Thus they appear to be sort of "multi-byte nops" and they are executed at exact same ultimate speed of 1 CPU clock! And their length can be selectively varied from 2 to 7 bytes which is exactly what I need. So the current Dynamic Assembler implementation loses only 1 extra CPU clock while still losing up to 6 extra bytes in the resulting opcode length per each forward reference.
While studying some MinGW/GCC asm dumps much later on, I noticed that the GCC guys use the same trick to get some of their references straightened out and their opcode sections aligned to dword boundaries for better CPU throughput. So I am not calling myself an inventor or perhaps I just happened to re-invent another wheel.
This is why I am always trying to squeeze as much of my data and
code behind my back into the .data
section of my Asm blocks. Frankly, I'm still ashamed a little of those extra clocks and wasted bytes...
Having ultimately no forward references at all cuts my opcode shorter and makes it run somewhat faster. And this is exactly why I am not currently reporting code placed into a Dynamic Assembler .data
section with either a warning or an error.My question is:
DOES ANYONE KNOW a better way to get rid of this 1 extra clock overhead and/or of this 6-byte tail chaser (though it actually is a "head" because those fake-nop bytes precede the instruction that uses the forward reference)
And if such a better way exists, won't it cause the need to run extra assembler passes because my (or rather, our) Dynamic Assembler is JIT and we can't trade precious app load time for a questionable benefit of a few lazy extra passes
We can't afford this