[SOLVED] Revelations Regarding Dynamic Assembler Innards

(Asm- and C-related Info and Discussions)

Moderator: Mike Lobanovsky

[SOLVED] Revelations Regarding Dynamic Assembler Innards

Unread postby Mike Lobanovsky » Tue Mar 01, 2011 3:47 am

Somebody who keeps track of what's going on in the Asm blocks of my scripts may wonder why I'm often placing my intrinsic assembler subroutines within the .data section of an Asm block rather than at the end of its .code section after the "ret" instruction that exits the Asm block and returns control to Fbsl.

Well, firstly, the .data section, when defined together with a corresponding .code section, in fact makes the virtual machine skip everything that resides in the .data section on enter and sets the Asm sub/function entry point to the instruction that immediately follows the .code directive. Thus I point my instruction pointer exactly to the instruction that I want to be executed first without any additional jumps.

OK that's clear. But why not at the end of the Asm block?

You see, as everything else in the computer world, the assembly process progresses gradually from the beginning of the Asm block to its matching End Asm. Assembler subprocedures within Asm blocks are defined with their labeled names and end with their respective "ret" instructions that return control back to the main Asm code. Labels are in fact mere pointers in their human-friendly form that are replaced by the asm parser with their respective exact offset values that the parser calculates once on the fly at app load time. Ideally, to make parsing as fast and loading as short as possible, there should be just one parsing pass which resolves all labels to their respective offsets in one go.

But!

If a subprocedure label call occurs before the parser encounters the label proper which is located at the end of the Asm block, it is unable to predict what the exact offset to the label is until it finally encounters that label in flesh! So what do I do? I reserve 8 zero bytes for this yet-to-be-determined offset and go on parsing and filling in opcodes and keeping track of my displacement and resolving other label references whose offsets I already know (e.g. backward references to the data defined in the .data section of the Asm block) until I finally come across the subprocedure label in question. Then I simply store its offset which I am finally able to determine and go on parsing to the end of the Asm block.

If on completing pass 1 I find out that I've had no such forward references at all, I finalise the overall loading process and launch the execution.

If at least one such forward reference has been temporarily unresolved during pass 1, I launch pass 2 of my asm code parser wherein it attends to the reserved 8-byte spaces only filling them with the required forward offsets which I stored in pass 1. If at least one such 8-byte space previously reserved remains unfilled, i.e. doesn't have a matching offset ready, it means the programmer has mistakenly made a reference to a non-existent label. In this case, loading stops and a corresponding error message is sent to the console. If everything is OK meaning all lables are resolved successfully then execution may seemingly begin... And the actual parser speed with such a "one-and-a-half-pass" approach is currently up to 2MB of asm source code per second.

But again!

The actual offset value found may take less space to express than the 8-zero-byte spaces I reserved earlier, and the extra zero bytes left may stall the CPU because it regards opcode 0 as a valid instruction while this is an improper instruction at this very place! What can I do to change this situation for the better?

Well, obviously I can substitute opcode 0 with opcode &H90 for filling in the 8-byte blanks. &H90 is a "nop" instruction that has no other manifestation than simply incrementing the CPU instruction pointer until a meaningful instruction is encountered. Is it a solution? Yes, it is. But it is a cheap solution. Each "nop" requires 1 CPU cycle to execute and the ultimate number of such successive "nops" in a reserved blank may reach 6. Six extra CPU cycles to a forward reference... An infinitesimal loss by Fbsl standards but an unpardonnable and shameful waste by assembler ones!

I SWEAR HONEST TO GOD I DIDN'T PEEP UP THE FOLLOWING IDEA ANYWHERE! I TEEMED AND NURTURED IT ALL BY MYSELF!

Luckily, asm is very-very old, almost as old as the computer world itself. It has seen many funny things and some of those things are still available in its structure. Or perhaps somebody is deliberately conserving them for cases like this. There is a number of instructions that are very useful and fast. The funny thing about them is that they can also be empty, dummy, false or whatever you may call them, i.e. they may do actually nothing to your register values and flags except incrementing the instruction pointer by the number of opcode bytes they occupy. Thus they appear to be sort of "multi-byte nops" and they are executed at exact same ultimate speed of 1 CPU clock! And their length can be selectively varied from 2 to 7 bytes which is exactly what I need. So the current Dynamic Assembler implementation loses only 1 extra CPU clock while still losing up to 6 extra bytes in the resulting opcode length per each forward reference.

While studying some MinGW/GCC asm dumps much later on, I noticed that the GCC guys use the same trick to get some of their references straightened out and their opcode sections aligned to dword boundaries for better CPU throughput. So I am not calling myself an inventor or perhaps I just happened to re-invent another wheel. :D

This is why I am always trying to squeeze as much of my data and code behind my back into the .data section of my Asm blocks. Frankly, I'm still ashamed a little of those extra clocks and wasted bytes... :oops:

Having ultimately no forward references at all cuts my opcode shorter and makes it run somewhat faster. And this is exactly why I am not currently reporting code placed into a Dynamic Assembler .data section with either a warning or an error.

My question is: DOES ANYONE KNOW a better way to get rid of this 1 extra clock overhead and/or of this 6-byte tail chaser (though it actually is a "head" because those fake-nop bytes precede the instruction that uses the forward reference) :?: And if such a better way exists, won't it cause the need to run extra assembler passes because my (or rather, our) Dynamic Assembler is JIT and we can't trade precious app load time for a questionable benefit of a few lazy extra passes :?: We can't afford this :!:
Mike
"Я старый солдат, мадам, и не знаю слов любви."
"I am an old soldier, ma'am, and I don't know the words of love."
"Je suis un vieux soldat, madame, et je ne connais pas les mots d'amour."
"Ich bin ein alter Soldat, gnädige Frau, und ich weiß nicht die Worte der Liebe."

__________________________________________________________________________________________________________________________________________________
(3.2GHz i5 Core Quad, 8GB RAM / 2 x nVidia GTX 550Ti SLI-bridged, 2GB VRAM)
(x86 Win XP Pro Russian Sp3/x86 Win Vista Ultimate Sp2/x64 Win 7 Ultimate Sp1/Wine in x64 elementaryOS Luna)
User avatar
Mike Lobanovsky
FBSL Administrator
FBSL Administrator
 
Posts: 1823
Joined: Tue Apr 19, 2005 8:22 am
Location: Republic of Belarus

Re: Some Revelations Regarding Dynamic Assembler Innards

Unread postby Mike Lobanovsky » Wed Mar 02, 2011 6:55 am

The problem is essentially in the the jump/call instructions which may have varying byte lengths depending on, for example, what the actual offset to their jump target is. If, say, the jump is short (-128 to +127 bytes) then its length is 2 bytes, if near (greater than that in both directions), the length is 5 bytes. In case of future (a.k.a. forward) references and a straight-forward brute-force approach, this requires constant reshuffling of everything that has already been assembled every time such a reference is finally resolved. Purely because too much within an opcode byte sequence relies heavily on the relative offsets of its constituent parts. This is roughly equivalent to running as many assembly passes as there are future references in your code.

Basically, this is an optimum strategy issue. The current implementation of so-called "recursive descent parser" I'm using to parse the asm source enables me to parse and assemble it at a speed of approx. 2 to 3MB of source code per second, sometimes a little faster. This isn't so much compared to an average of 10MB per second and even more which can be benchmarked for some existing assemblers though it is quite enough to make the loading time imperceptible for relatively short and scarce Asm blocks used in Fbsl. Still, I am dodging a little to achieve this. The world-old assembler-designing methodology requires that for a 2-pass assembler like this, pass 1 be used to calculate the individual instruction lengths, build the symbol table of labels, and intercept label-related errors. Pass 2 should be used to run the actual assembly proper recalculating lengths in view of the refernces resolved and compacting the opcode into a dense opcode string without any blanks in between the adjacent instructions.

Regretfully, almost all this was written at the times when there were no such desperados like us that would want to squeeze an assembler feature within an interpreter. Everybody was happily assembling something for their 8KB-RAM ZX Spectrums using handy household appliances like cassette recorders for media storage, and time really didn't matter much... :)

Our Dynamic Assembler belongs to the so called "load-and-go" or "just-in-time" (JIT) family where time does matter, and it'd better be low.

If I follow the methodology presribed above, pass 2 will be at least as computationally intensive as pass 1. This may bring the assembly speed down to below 1MB per second. It will also expand the Fbsl binary size by another 4 or 5KB which may otherwise be wisely spent on making Dynamic Assembler more user-friendly and/or richer in features.

Now pass 2 is as "light" as possible. The penalty is 1 extra CPU clock and up to 6 extra bytes per a forward reference, which is pure waste by asm standards but which is IMHO almost undetectable against the speeds of Fbsl's interpreted environment and the overall amount of memory it uses. Currently, my average Asm block assembles to around a hundred bytes of opcode of which up to some 8 per cent are those extra "nop" bytes with 2 or 3 extra CPU clocks for a couple or so forward references I absolutely can't avoid.

I think that breaking this vicious circus of speed-for-quality trade-off and following the academic guidelines is possible only if I re-write the highly interlaced parser/assembler duo completely using some other, faster approaches yet unknown to me. Frankly, this would be a helluvalotta work and alpha/beta/gamma/.../omega headaches all over again...

Please advise.
Mike
"Я старый солдат, мадам, и не знаю слов любви."
"I am an old soldier, ma'am, and I don't know the words of love."
"Je suis un vieux soldat, madame, et je ne connais pas les mots d'amour."
"Ich bin ein alter Soldat, gnädige Frau, und ich weiß nicht die Worte der Liebe."

__________________________________________________________________________________________________________________________________________________
(3.2GHz i5 Core Quad, 8GB RAM / 2 x nVidia GTX 550Ti SLI-bridged, 2GB VRAM)
(x86 Win XP Pro Russian Sp3/x86 Win Vista Ultimate Sp2/x64 Win 7 Ultimate Sp1/Wine in x64 elementaryOS Luna)
User avatar
Mike Lobanovsky
FBSL Administrator
FBSL Administrator
 
Posts: 1823
Joined: Tue Apr 19, 2005 8:22 am
Location: Republic of Belarus

Re: [SOLVED] Revelations Regarding Dynamic Assembler Innards

Unread postby Mike Lobanovsky » Sat Mar 05, 2011 11:35 pm

Hi everybody,

Thanks for your suggestions and general buoyancy. The problem is solved.
Mike
"Я старый солдат, мадам, и не знаю слов любви."
"I am an old soldier, ma'am, and I don't know the words of love."
"Je suis un vieux soldat, madame, et je ne connais pas les mots d'amour."
"Ich bin ein alter Soldat, gnädige Frau, und ich weiß nicht die Worte der Liebe."

__________________________________________________________________________________________________________________________________________________
(3.2GHz i5 Core Quad, 8GB RAM / 2 x nVidia GTX 550Ti SLI-bridged, 2GB VRAM)
(x86 Win XP Pro Russian Sp3/x86 Win Vista Ultimate Sp2/x64 Win 7 Ultimate Sp1/Wine in x64 elementaryOS Luna)
User avatar
Mike Lobanovsky
FBSL Administrator
FBSL Administrator
 
Posts: 1823
Joined: Tue Apr 19, 2005 8:22 am
Location: Republic of Belarus

Re: [SOLVED] Revelations Regarding Dynamic Assembler Innards

Unread postby Gerome » Sun Mar 06, 2011 5:29 pm

Mike Lobanovsky wrote:Hi everybody,

Thanks for your suggestions and general buoyancy. The problem is solved.


Mike,

You're a champion! :)
Yours,

(¯`·._.·[Gerome GUILLEMIN]·._.·´¯)
:: Full SETUP w. HELP 05th of December 2011 ::
http://www.fbsl.net/setup/FBSLv3.exe [full v3.4.10 installation pack]
http://www.fbsl.net/setup/FBSLv3bin.zip [minimal upgrade to v3.4.10]
Laissons les jolies femmes aux hommes sans imagination. / Let us leave pretty women to men without imagination.(M.Proust)
The success is a defeat for the one who does not want to dance any more! (H.F. Thiefaine)
User avatar
Gerome
FBSL Administrator
FBSL Administrator
 
Posts: 3149
Joined: Sat Mar 12, 2005 9:06 pm
Location: Paris -- France


Return to 3rd-party Assembly & C Documentation & Reports

Who is online

Users browsing this forum: No registered users and 1 guest