Heya Acksys, I don't think there's anything yet, so I'll give it a go myself.
I can't explain everything as I don't know myself, but what I do know I'll share:
( A ) on this image I'm not sure about so I'll leave it.
( B ) - This is the per-frame breakdown of all profiler tags in code.
orxPROFILER_PUSH_MARKER("Name Goes Here");
/*... code here ...*/
orxPROFILER_POP_MARKER();
This will show up (if you have the profiler enabled) in either Debug builds, or Profiler builds.
[Render]
ShowProfiler = true
Your Push and Pop markers must be evenly matched (no push without a pop and vice versa) and you cannot embed the same marker 'inside' itself. ("markername1" & "markername2" are fine, but not another "markername1")
Okay that's enough of using them until I can think of more; onto reading the (b) section some more
All sub B items are spaced relative to the "AStar" marker, (below!) in case that's not clear
(b1) is either a minus (-) symbol, or one or more plus (+) symbols.
A minus means this marker has not been seen in the previous frame.
The number of plus symbols defines how deep in a heirarchy of pushed markers this particular marker is. In my example image "orxRender_RenderAll" is a first-level marker, and within the timing of that, "orxRender_RenderViewport" is timed as a subsection, and so on to "orxRender_RenderObject" etc.
(b2) is the name of the marker, as defined in the code above. (My own personal marker in that list is the "AStar" marker.)
These next two I might be incorrect about, but my thinking on the matter has proven useful so far, hopefully Iarwain will correct any misconceptions.
(b3) This is the average time of each run of this marker over the last frame.
(b4) This is the highest seen time for this marker. (Not sure of the reset/time frame on this)
(b5) This is the number of times this marker was seen within the last frame.
Finally the ( C ) section. Again this one I'm not 100% sure on, but I believe this is a breakdown of total time taken for the last frame. The colours of each block match directly to the colours of each marker in the ( B ) section. Each block is scaled proportional to the section above it, based upon the percentage of time it takes inside the section above. (ie: The grey '-=orxPROFILER=-' section is one frame, the Red section below that is the 'orxRENDER_RenderAll' marker, taking it's hefty percentage of the total frame time.
Thanks, Grey, that's a very useful explanation. I would like to improve performance and am trying to learn Orx Profiler and Mac OS's Instruments utility.
Hopefully iarwain can clarify b3-b5, but for the moment I think it's enough to know that higher numbers mean it's taking longer!
Grey did a pretty good analysis. I'll add a few details:
- A & B are both showing markers stats. On the B side are the markers which are always called by (ie. child of) the same marker, everytime. This way we know where to place them in the tree. On the A side are all the markers that have been called by (ie. child of) more than one unique marker. As we don't know where to place them in the tree for the info to be meaningful, we place them on the side. They'll usually account for gaps that can been seen in the tree representation.
- The top left grey bar is how much time was spent for processing the last frame. Its length is also used as a reference for all other bars, this way it's easy to see how much time was spent inside one marker, relatively to the full frame.
- b3 is the time spent inside the marker during the frame that has just been processed, it's not an average it's the very precise time spent there. So this number can vary a lot and very fast, which brings us to:
- b4: it's the longest time spent in this marker over a period of a second, ie. every second we clear it and it'll then display the longest time spent in that marker till the next second tick and so on, like outside "memory" thermometers, if you will.
- b5: it's the number of time we entered (and exited) that marker during the last processed frame (ie. the number of calls if you put a marker so that it'll start with a function and ends before its return).
Some more details: by default, as Grey mentioned, the debug and profile version of orx will come instrumented (ie. markers are internally used inside orx). The release version won't have any markers used inside orx but they still can be used by the users for profiling their own code!
If for some reason you don't want orx's internal markers in the debug version, look at the first line of orxProfiler.h, there's a comment telling you how to deactivate it (namely commenting a define so that __orxPROFILER__ is not defined).
The same define can be used on the user/game side to enable/disable the marker macros.
If someone wants to dump all the info on disk, for making stats or even for pausing the game when one specific marker is going crazy, all the info can be queried through the orxProfiler API.
A good example of that if the custom render plugin where I wrote the code for rendering the profiler on-screen.
Lastly all the greyed out markers (either in B or A) means that they've been called at least once since the start of the application (that's how we know about them) but they haven't been used during the last frame. If there are markers we never encounter while running the game, they'll never show up there as they'll never get registered to the orxProfiler module.
Now, what do we do with all those informations? Well, some are obvious to interprete (especially the markers added by the user of course), but here's also what might not be as obvious. Let's use Grey's example for a quick analysis.
- This frame took 80ms to process but over the last second, it went as high as 579ms. Something's going on, let's see if we can find something.
- The AStar marker looks pretty stable, nothing to signal there.
- orxRender_RenderViewport is called twice: there are two viewports being rendered on screen. Sometimes one can have some surprise here (which happened to me when I forgot that Scroll created a default viewport, for example, and that I added mine on top of it with the same camera -> everything was rendered twice! It might not be the case here, of course. Also we spent ~62ms trying to render those 2 viewports.
- But we only spend ~9ms actually rendering the objects (orxRender_RenderObject). We can see in the tree representation that a lot of time is spent *before* we render the first object. That probably comes from the sorting + culling before rendering them. If you have thousands and thousands of objects, as the render plugin isn't using any partitioner yet, this operation can get expensive with huge collections of objects. So here we have 1 object per tile, a better way to remove probably around ~50ms of overhead here would be to only have 1 object for the map and then do some custom rendering ourself by intercepting the render event associated to it, or even doing all the rendering in a shader, for even greater efficiency. Also orxRender_RenderObject is called 2646 times, which means after sorting/culling, we still send 2646 objects for rendering. That's a huge number, even by big 3D game engines (for example I think I remember we were "only" sending ~600-700 objects per frame down the render lane in Splinter Cell: Conviction).
- orxRender_RenderProfiler took ~0.75ms this frame but went as high as 10ms over the last second. As rendering the profiler is rather a constant task, it should never cary that much. That means an external factor affected us somehow (can be another app needing more resources, disk accesses blocking the rest of the processing, etc...)
- orxDisplay_Swap took 13ms and up to 24ms over the last frame. That's the time where we'll actually wait for VSync if we asked for it, in order to be synchronized with the display. If we miss a VSync, this value will greatly increase all of a sudden. When not waiting for VSync, that's the time we have to wait for the GPU to finish working/swapping buffers. If we are GPU bound, that's where it'll show.
- orxConfig_SelectSection / orxBank_Allocate / orxBank_Free / orxFrame_Process: those will be used a bit everywhere in the code but also extensively in the render plugin. Having ~ 10 000 thousands would explain why all those have been called many many times. Also, please note that orxBank_Allocate / orxBank_Free are still pretty efficient considering how many times they've been called in that frame. That's because we are using a "low watermark" approach and orx is not doing dynamic allocations everytime we create/delete something. Otherwise performance would become rather poor very quickly! :P In this case, either by using a partitioner in the render plugin or by reducing the count of objects in the scene (like 1 object for the map vs 10 000+) will help tremendously!
- orxTexture_CreateFromFile: oh hey you, why have you been called? That means we're still reading texture from disk probably and would explain the wide variation over the last second. Is that because something isn't kept in cache or because we're at the beginning of the application, that I can't say just from the screenshot! Though I can make an educated guess from the fact we've spend 6ms loading config files over the last second, which means it's probably the start of the game (or the start of the level, etc...).
- orxDisplay_DrawArrays: this is the number of batches we have sent to the GPU. The smaller the better. That's also one of the main bottleneck on mobile platforms and with recent 3D-accelerated engines. But don't fear, the profiler display itself can be accounted for 19 batches, just by himself. Yes I know, things could probably get much better organized, I'll check it out at some point. But the good news is that we only use 3 batches for the game rendering itself, for displaying 2646 objects, that's not too bad!
haha, yes this particular screenshot is taken immediately after the game starts, that is why orxTexture_CreateFromFile is there
As for the object rendering; I plan to tweak that quite severely over the next few days, and as suggested, write myself a shader to handle the ground rendering
Ahah, good idea! I didn't mean to sound negative by the way, simply doing an analysis of the screenshot and I think that this was a great case at is showed a few symptoms (like capturing at the start of the app, too many objects, etc...). That makes it the perfect example!
Didn't think you sounded negative -- that particular image was chosen particularly because it showed a nice mix of things. And I'm always happy to have more feedback on my work-in-progress stuff
Thanks to both for the help here. This is my first real need to optimize game code and I only have a basic idea of how it works.
Running the profiler for the first time my game was running about 28-30 fps with ~90-100 Objects in existence. The great majority of that was the Swap function. I set BlendMode = none wherever I could and reduced the number of objects (I neglected to delete some unneeded invisible ones after transition into the scene). Now we are doing about 42 fps with ~80-90 objects in existence.
(All info so far is on iPod Touch 4, BTW. I'll get info from other devices once I get this one optimized.)
The only other obvious place I see is the UI. I have a few UI info objects whose text/position are updated every frame. I'm going to try disabling the update of those completely and look at the effect on framerate. If it's significant, I'll put the UI object update on a second clock since it's not important they get updated every frame.
One interesting thing I noticed is just removing the background results in quite a few FPS gained. I guess that's because the GPU can leave all those pixels black instead of drawing them?
I was trying to follow this thread: https://forum.orx-project.org/discussion/3235. I guess this doesn't effect me since I set FrustumWidth and FrustumHeight to the dimensions of the Retina display (640x960)?
Yes, overlapping pixels are also expensive on mobile with 2D rendering. The more times a pixel gets drawn, the worse the performances are going to be: limiting the overdraw is definitely a good way of grabbing a few more FPS.
As for Godexsoft issue, it's as improbable as your with the event send counter so I have no idea why it happened to him! Compiler bugs should not be excluded (I've experienced some first hand last month on Vita with Sony compiler) though I doubt it's the case here.
Another thing that can help with framerate: using compressed textures when extra-quality doesn't matter (only works on iOS for now). Also using spritesheets (aka texture atlas) to allow bigger rendering batches are a good idea. Same for texture size: non-power-of-two textures might be slower than power-of-two ones.
- you have a lot of hashtable queries, not sure where they come from, it might be worth tracking
- you have a unusually high number of batches sent, almost 1 batch per object. Once the reason is found, you should be able to get 4ms back there with no code changes.
For the batching problem, using sprite sheets (aka texture atlas) will help a lot. Also if you use different contexts (defined by applied shaders, smoothing, blending, texture), try to group objects that use the same context at the same depth (Z axis). Not sure if what I just said is clear or not! ^^
EDIT: The unusually large orxDisplay_Swap() in the last frame comes from the fact rendering probably wasn't finished before a VSync and had to wait for the next one (16.6ms later). As far as I know, there's no way to deactivate VSync on iOS.
I have no idea about the hashtable queries, so I guess I can do some debugging and see where they come from.
Our artist sent me individual textures, so currently no sprite sheets are in use. I think making them isn't a problem, though. Especially for the uncompressed .png textures. Is there a recommended dimension for sprite sheet (textures are at 326ppi for the Retina display).
I think I understand about Z-order. So, for example, if I had a group of objects with the same texture, same blending mode, and same alpha color value, it will reduce batching to keep those at the same Z coordinate?
Thanks for taking a look. That'll be very helpful.
I have no idea about the hashtable queries, so I guess I can do some debugging and see where they come from.
So it probably comes from some orx internal module. Knowing which will be helpful to see if this number of calls can be reduced.
Our artist sent me individual textures, so currently no sprite sheets are in use. I think making them isn't a problem, though. Especially for the uncompressed .png textures. Is there a recommended dimension for sprite sheet (textures are at 326ppi for the Retina display).
Ah yes, that will be a good improvement here, with not much effort. There are a few tools out there that will allow you to do that effortlessly.
Some are free and some are commercial such as TexturePacker.
This one apparently supports template for custom exports so having it export .ini files should be pretty easy (and the author would be willing to integrate that template by default in his tool, it's just that I hadn't had the time to write this template myself).
I think I understand about Z-order. So, for example, if I had a group of objects with the same texture, same blending mode, and same alpha color value, it will reduce batching to keep those at the same Z coordinate?
That's exactly it! Even if the Z coordinate isn't the same but there aren't any object using a different context that would get interleaved there, you should be good. In the default rendering, I sort all the objects by depth and then by all the context criteria I mentioned earlier before sending them down the GPU pipe.
Thanks for taking a look. That'll be very helpful.
My pleasure! I'm pretty excited myself to see Equilibria getting released.
Us, too! Things are taking much longer than originally expected, but I hope that means we have a better chance of releasing something outstanding.. more updates soon.
Oh, is there a best way to optimize sprite sheets for Orx? For example, should they be kept to certain dimensions, or should objects that are related in some way be placed on the same sheet (besides the obvious case of consecutive frames in the same animation)?
Using power of two for width/length, beside that it doesn't matter where you place things in it.
And no, no other consideration, it's only to reduce the number of batches.
Comments
I can't explain everything as I don't know myself, but what I do know I'll share:
( A ) on this image I'm not sure about so I'll leave it.
( B ) - This is the per-frame breakdown of all profiler tags in code.
This will show up (if you have the profiler enabled) in either Debug builds, or Profiler builds.
Your Push and Pop markers must be evenly matched (no push without a pop and vice versa) and you cannot embed the same marker 'inside' itself. ("markername1" & "markername2" are fine, but not another "markername1")
Okay that's enough of using them until I can think of more; onto reading the (b) section some more
All sub B items are spaced relative to the "AStar" marker, (below!) in case that's not clear
(b1) is either a minus (-) symbol, or one or more plus (+) symbols.
A minus means this marker has not been seen in the previous frame.
The number of plus symbols defines how deep in a heirarchy of pushed markers this particular marker is. In my example image "orxRender_RenderAll" is a first-level marker, and within the timing of that, "orxRender_RenderViewport" is timed as a subsection, and so on to "orxRender_RenderObject" etc.
(b2) is the name of the marker, as defined in the code above. (My own personal marker in that list is the "AStar" marker.)
These next two I might be incorrect about, but my thinking on the matter has proven useful so far, hopefully Iarwain will correct any misconceptions.
(b3) This is the average time of each run of this marker over the last frame.
(b4) This is the highest seen time for this marker. (Not sure of the reset/time frame on this)
(b5) This is the number of times this marker was seen within the last frame.
Finally the ( C ) section. Again this one I'm not 100% sure on, but I believe this is a breakdown of total time taken for the last frame. The colours of each block match directly to the colours of each marker in the ( B ) section. Each block is scaled proportional to the section above it, based upon the percentage of time it takes inside the section above. (ie: The grey '-=orxPROFILER=-' section is one frame, the Red section below that is the 'orxRENDER_RenderAll' marker, taking it's hefty percentage of the total frame time.
Anyway, I hope that helps!
- Grey.
Hopefully iarwain can clarify b3-b5, but for the moment I think it's enough to know that higher numbers mean it's taking longer!
Thanks again.
- A & B are both showing markers stats. On the B side are the markers which are always called by (ie. child of) the same marker, everytime. This way we know where to place them in the tree. On the A side are all the markers that have been called by (ie. child of) more than one unique marker. As we don't know where to place them in the tree for the info to be meaningful, we place them on the side. They'll usually account for gaps that can been seen in the tree representation.
- The top left grey bar is how much time was spent for processing the last frame. Its length is also used as a reference for all other bars, this way it's easy to see how much time was spent inside one marker, relatively to the full frame.
- b3 is the time spent inside the marker during the frame that has just been processed, it's not an average it's the very precise time spent there. So this number can vary a lot and very fast, which brings us to:
- b4: it's the longest time spent in this marker over a period of a second, ie. every second we clear it and it'll then display the longest time spent in that marker till the next second tick and so on, like outside "memory" thermometers, if you will.
- b5: it's the number of time we entered (and exited) that marker during the last processed frame (ie. the number of calls if you put a marker so that it'll start with a function and ends before its return).
Some more details: by default, as Grey mentioned, the debug and profile version of orx will come instrumented (ie. markers are internally used inside orx). The release version won't have any markers used inside orx but they still can be used by the users for profiling their own code!
If for some reason you don't want orx's internal markers in the debug version, look at the first line of orxProfiler.h, there's a comment telling you how to deactivate it (namely commenting a define so that __orxPROFILER__ is not defined).
The same define can be used on the user/game side to enable/disable the marker macros.
If someone wants to dump all the info on disk, for making stats or even for pausing the game when one specific marker is going crazy, all the info can be queried through the orxProfiler API.
A good example of that if the custom render plugin where I wrote the code for rendering the profiler on-screen.
Lastly all the greyed out markers (either in B or A) means that they've been called at least once since the start of the application (that's how we know about them) but they haven't been used during the last frame. If there are markers we never encounter while running the game, they'll never show up there as they'll never get registered to the orxProfiler module.
Now, what do we do with all those informations? Well, some are obvious to interprete (especially the markers added by the user of course), but here's also what might not be as obvious. Let's use Grey's example for a quick analysis.
- This frame took 80ms to process but over the last second, it went as high as 579ms. Something's going on, let's see if we can find something.
- The AStar marker looks pretty stable, nothing to signal there.
- orxRender_RenderViewport is called twice: there are two viewports being rendered on screen. Sometimes one can have some surprise here (which happened to me when I forgot that Scroll created a default viewport, for example, and that I added mine on top of it with the same camera -> everything was rendered twice! It might not be the case here, of course. Also we spent ~62ms trying to render those 2 viewports.
- But we only spend ~9ms actually rendering the objects (orxRender_RenderObject). We can see in the tree representation that a lot of time is spent *before* we render the first object. That probably comes from the sorting + culling before rendering them. If you have thousands and thousands of objects, as the render plugin isn't using any partitioner yet, this operation can get expensive with huge collections of objects. So here we have 1 object per tile, a better way to remove probably around ~50ms of overhead here would be to only have 1 object for the map and then do some custom rendering ourself by intercepting the render event associated to it, or even doing all the rendering in a shader, for even greater efficiency. Also orxRender_RenderObject is called 2646 times, which means after sorting/culling, we still send 2646 objects for rendering. That's a huge number, even by big 3D game engines (for example I think I remember we were "only" sending ~600-700 objects per frame down the render lane in Splinter Cell: Conviction).
- orxRender_RenderProfiler took ~0.75ms this frame but went as high as 10ms over the last second. As rendering the profiler is rather a constant task, it should never cary that much. That means an external factor affected us somehow (can be another app needing more resources, disk accesses blocking the rest of the processing, etc...)
- orxDisplay_Swap took 13ms and up to 24ms over the last frame. That's the time where we'll actually wait for VSync if we asked for it, in order to be synchronized with the display. If we miss a VSync, this value will greatly increase all of a sudden. When not waiting for VSync, that's the time we have to wait for the GPU to finish working/swapping buffers. If we are GPU bound, that's where it'll show.
- orxConfig_SelectSection / orxBank_Allocate / orxBank_Free / orxFrame_Process: those will be used a bit everywhere in the code but also extensively in the render plugin. Having ~ 10 000 thousands would explain why all those have been called many many times. Also, please note that orxBank_Allocate / orxBank_Free are still pretty efficient considering how many times they've been called in that frame. That's because we are using a "low watermark" approach and orx is not doing dynamic allocations everytime we create/delete something. Otherwise performance would become rather poor very quickly! :P In this case, either by using a partitioner in the render plugin or by reducing the count of objects in the scene (like 1 object for the map vs 10 000+) will help tremendously!
- orxTexture_CreateFromFile: oh hey you, why have you been called? That means we're still reading texture from disk probably and would explain the wide variation over the last second. Is that because something isn't kept in cache or because we're at the beginning of the application, that I can't say just from the screenshot!
- orxDisplay_DrawArrays: this is the number of batches we have sent to the GPU. The smaller the better. That's also one of the main bottleneck on mobile platforms and with recent 3D-accelerated engines. But don't fear, the profiler display itself can be accounted for 19 batches, just by himself. Yes I know, things could probably get much better organized, I'll check it out at some point. But the good news is that we only use 3 batches for the game rendering itself, for displaying 2646 objects, that's not too bad!
If you have any questions, just let me know!
As for the object rendering; I plan to tweak that quite severely over the next few days, and as suggested, write myself a shader to handle the ground rendering
Running the profiler for the first time my game was running about 28-30 fps with ~90-100 Objects in existence. The great majority of that was the Swap function. I set BlendMode = none wherever I could and reduced the number of objects (I neglected to delete some unneeded invisible ones after transition into the scene). Now we are doing about 42 fps with ~80-90 objects in existence.
On mobile blending (especially alpha) is really expensive, that's why it's actually much easier to have fast 3D rendering than 2D one.
Any other obvious place where you could grab some ms?
The only other obvious place I see is the UI. I have a few UI info objects whose text/position are updated every frame. I'm going to try disabling the update of those completely and look at the effect on framerate. If it's significant, I'll put the UI object update on a second clock since it's not important they get updated every frame.
One interesting thing I noticed is just removing the background results in quite a few FPS gained. I guess that's because the GPU can leave all those pixels black instead of drawing them?
I was trying to follow this thread: https://forum.orx-project.org/discussion/3235. I guess this doesn't effect me since I set FrustumWidth and FrustumHeight to the dimensions of the Retina display (640x960)?
As for Godexsoft issue, it's as improbable as your with the event send counter so I have no idea why it happened to him! Compiler bugs should not be excluded (I've experienced some first hand last month on Vita with Sony compiler) though I doubt it's the case here.
Another thing that can help with framerate: using compressed textures when extra-quality doesn't matter (only works on iOS for now). Also using spritesheets (aka texture atlas) to allow bigger rendering batches are a good idea. Same for texture size: non-power-of-two textures might be slower than power-of-two ones.
Would you mind posting a profiler shot somewhere?
http://imgur.com/jMJpY
http://imgur.com/KIkp1
http://imgur.com/9rfBQ
http://imgur.com/JhLeb
These are after making the first set of changes I mentioned. I have NOT done the analysis of the UI objects I mentioned in my last email yet, though.
- you have a lot of hashtable queries, not sure where they come from, it might be worth tracking
- you have a unusually high number of batches sent, almost 1 batch per object. Once the reason is found, you should be able to get 4ms back there with no code changes.
For the batching problem, using sprite sheets (aka texture atlas) will help a lot. Also if you use different contexts (defined by applied shaders, smoothing, blending, texture), try to group objects that use the same context at the same depth (Z axis). Not sure if what I just said is clear or not! ^^
EDIT: The unusually large orxDisplay_Swap() in the last frame comes from the fact rendering probably wasn't finished before a VSync and had to wait for the next one (16.6ms later). As far as I know, there's no way to deactivate VSync on iOS.
Our artist sent me individual textures, so currently no sprite sheets are in use. I think making them isn't a problem, though. Especially for the uncompressed .png textures. Is there a recommended dimension for sprite sheet (textures are at 326ppi for the Retina display).
I think I understand about Z-order. So, for example, if I had a group of objects with the same texture, same blending mode, and same alpha color value, it will reduce batching to keep those at the same Z coordinate?
Thanks for taking a look. That'll be very helpful.
So it probably comes from some orx internal module. Knowing which will be helpful to see if this number of calls can be reduced.
Ah yes, that will be a good improvement here, with not much effort. There are a few tools out there that will allow you to do that effortlessly.
Some are free and some are commercial such as TexturePacker.
This one apparently supports template for custom exports so having it export .ini files should be pretty easy (and the author would be willing to integrate that template by default in his tool, it's just that I hadn't had the time to write this template myself).
That's exactly it! Even if the Z coordinate isn't the same but there aren't any object using a different context that would get interleaved there, you should be good. In the default rendering, I sort all the objects by depth and then by all the context criteria I mentioned earlier before sending them down the GPU pipe.
My pleasure! I'm pretty excited myself to see Equilibria getting released.
And no, no other consideration, it's only to reduce the number of batches.