|
在做完了初期的模拟器之后, 检测的效果不是太好.
在 STM32F429 上, 经过强优化的代码也只能将主频勉强维持在 300 KHz 左右,
这离预计的目标 1.2MHz 相距甚远.
按照兄弟们的提示, 在实际上 168MHz 的 STM32 上, 执行比大约是 560:1 , 效率太低了 ...
所以今天在彻底优化之前检查一下指令解码器部分的效率.
从 C 编译器给出的 arm 汇编指令文件来看, 基本上能知道效率低的原因了:
对 255 条 51 指令, switch 展开后使用了查表法, 所以基本上能直达解析段代码,
但紧跟着的 SFR 寄存器访问就不行了, 一长溜的 cmp 指令命中率极低,
而且对模拟 51 的寄存器操作是一大堆寄存器互相绕来绕去交换数据和运算,
所以才造成执行效率超低.
- 8002130: f7fe bc6f b.w 8000a12 <alu_instruction_decode+0x322>
- 8002134: 28e0 cmp r0, #224 ; 0xe0
- 8002136: f43f ae5f beq.w 8001df8 <alu_instruction_decode+0x1708>
- 800213a: 28f0 cmp r0, #240 ; 0xf0
- 800213c: f041 8332 bne.w 80037a4 <alu_instruction_decode+0x30b4>
- 8002140: 4b56 ldr r3, [pc, #344] ; (800229c <alu_instruction_decode+0x1bac>)
- 8002142: 4a55 ldr r2, [pc, #340] ; (8002298 <alu_instruction_decode+0x1ba8>)
- 8002144: 7818 ldrb r0, [r3, #0]
- 8002146: 4e52 ldr r6, [pc, #328] ; (8002290 <alu_instruction_decode+0x1ba0>)
- 8002148: 4b52 ldr r3, [pc, #328] ; (8002294 <alu_instruction_decode+0x1ba4>)
- 800214a: 7010 strb r0, [r2, #0]
- 800214c: 7018 strb r0, [r3, #0]
- 800214e: f996 3000 ldrsb.w r3, [r6]
- 8002152: f7fe bc5e b.w 8000a12 <alu_instruction_decode+0x322>
- 8002156: 28d0 cmp r0, #208 ; 0xd0
- 8002158: f001 8666 beq.w 8003e28 <alu_instruction_decode+0x3738>
- 800215c: d83d bhi.n 80021da <alu_instruction_decode+0x1aea>
- 800215e: 2882 cmp r0, #130 ; 0x82
- 8002160: f001 8656 beq.w 8003e10 <alu_instruction_decode+0x3720>
- 8002164: 2883 cmp r0, #131 ; 0x83
- 8002166: d12a bne.n 80021be <alu_instruction_decode+0x1ace>
- 8002168: 4a48 ldr r2, [pc, #288] ; (800228c <alu_instruction_decode+0x1b9c>)
- 800216a: 4b4a ldr r3, [pc, #296] ; (8002294 <alu_instruction_decode+0x1ba4>)
- 800216c: 6811 ldr r1, [r2, #0]
- 800216e: 781a ldrb r2, [r3, #0]
- 8002170: 7808 ldrb r0, [r1, #0]
- 8002172: 4e47 ldr r6, [pc, #284] ; (8002290 <alu_instruction_decode+0x1ba0>)
- 8002174: 4050 eors r0, r2
- 8002176: 7018 strb r0, [r3, #0]
- 8002178: f996 3000 ldrsb.w r3, [r6]
- 800217c: f7fe bc49 b.w 8000a12 <alu_instruction_decode+0x322>
- 8002180: 2881 cmp r0, #129 ; 0x81
- 8002182: f041 8334 bne.w 80037ee <alu_instruction_decode+0x30fe>
- 8002186: 4b43 ldr r3, [pc, #268] ; (8002294 <alu_instruction_decode+0x1ba4>)
- 8002188: 4a3f ldr r2, [pc, #252] ; (8002288 <alu_instruction_decode+0x1b98>)
- 800218a: 7818 ldrb r0, [r3, #0]
- 800218c: 7812 ldrb r2, [r2, #0]
- 800218e: 4e40 ldr r6, [pc, #256] ; (8002290 <alu_instruction_decode+0x1ba0>)
- 8002190: 4310 orrs r0, r2
- 8002192: 7018 strb r0, [r3, #0]
- 8002194: f996 3000 ldrsb.w r3, [r6]
- 8002198: f7fe bc3b b.w 8000a12 <alu_instruction_decode+0x322>
复制代码
现在定下的优化目标是:
对 51 SFR 寄存器的访问也使用查表法, 实在用 C 不行就直接使用内嵌的 asm.
现在要解决 switch 的展开问题, 难道要建一个 64K 的巨表来避免 arm 编译器的强行展开吗?
我先写一个脚本, 展开 51 的巨集指令表试试. |
|