Hi everyone,
In orx 1.6 release, I've been struggling with a nasty bug that occasionally crashes my game. I've chased the bug down to the following sequence of events.
I have a ScoreDisplay (scroll)object and it has a IncreaseScore method, which creates ScorePiece (scroll)objects that fly towards the ScoreDisplay. The IncreaseScore creates the ScorePiece objects as an owned child of the ScoreDisplay.
The flying towards the display is handled in ScorePiece::Update, where I query the position of the ScoreDisplay by "GetPos"ing its "Owner".
The crash happens when I delete the ScoreDisplay while there are "ScorePiece"s flying towards it. I'm printing out a lot of stuff for debug purposes and I can see that while the ScoreDisplay is being deleted, the "ScorePiece"s are still among its Owned children and their Refcount is 1. All the other children of the ScoreDisplay get deleted right after the ScoreDisplay but not the "ScorePiece"s. I can also see that in the last ScorePiece::Update that causes the crash, its GetOwner returns null.
I could dodge this bug by simply checking in ScorePiece::Update whether the owner is null, but the trouble is there are many places in the project where I assume no ScrollObject::Update would be called after the owner of an object is deleted. Is that a wrong assumption? If not, why could this be happening and how can I debug it?
Note that I've considered the possibility of memory corruption, but I've run the game under Valgrind and it reports no errors. I can also consistently reproduce the bug under different build configurations (release/debug) and I really doubt a memory corruption could manifest itself so consistently.
Thanks!
Comments
As far as I know, I never make delete calls, I always use SetLifeTime(0). I've just run a quick search on my source code and I never use orxObject_Delete
Now I remember doing some bug fixing in the object hierarchy code with respect to deletion a while ago. So that issue might have been already fixed since the 1.6 release and I'd recommend using the hg latest instead (for both orx and scroll).
That being said, you will probably have to change a couple of function calls by doing so, but it shouldn't be more than a 5-10 min job (can't remember the detail but it should be in the hg logs if you're curious).
I've just confirmed that the issue persists in the nightly build of orx and the latest commit of Scroll. I've also confirmed that the bug existed in my game for a long time, under very different conditions, so I guess this further reduces the chances of a memory corruption.
Maybe I should note at this point that I had to fork Scroll to fit it in my build setup. Scroll carrying around its own of orx headers did not suit me, so I've deleted them and changed the scroll headers to use the external headers. That's all I changed.
It is possible for a child object to be deleted a frame later than its owner. It depends where in the list it's stored as we keep the list ordered in a way that reflects their actual location in memory to minimize cache misses as much as possible.
So it is indeed possible to have an object that is "pending deletion" while its ScrollObtject::Update() function is called. In which case its lifetime will be set to 0.
As you pointed out, your work around is simply to check the owner before using it, or, and that's usually what I do most of the time in my projects, instead of storing direct pointers, I store their GUID.
I then retrieve objects through their GUID when I need them (either through orx's orxStructure_Get() or directly with Scroll's GetObject()), which will either return a valid object or orxNULL if that object doesn't exist anymore.
As a side note, doesn't it cause single frame visual glitches this way?
Thanks for clearing the issue out.
None of the structures linked to the object will get updated by orx as soon as its LifeTime has reached 0, however it might be rendered for the current frame.
The issue you're experiencing is mostly due to how Scroll handles object updates. This isn't a feature built in orx, so it loops over all the objects itself to then call the Update method on them, but this process is disconnected . A simple test in Scroll to prevent calling Update on objects that have a LifeTime set to 0 should do the trick for you and comply with your original assumption.
Thanks iarwain
In the first case, the whole hierarchy will disappear in the same frame, no delay whatsoever as all the objects will be correctly ordered for the deletion. In the second case, as children can possibly be earlier in the object list than the parent, deletion will be delayed 1 frame everytime this happens.
So, worst case scenario, it can go up to N frames to delete a whole hierarchy whose deepest depth is N.
To prevent this from happening, you can modify ScrollObject::SetLifeTime() to recursively set the LifeTime of the whole hierarchy instead, and that should fix any deletion delay as all the objects would then disappear in the same frame.
I'm not sure which issues you would have by accessing "pending deletion" objects through your smart pointers. Those object will disappear during the next object update inside orx, before any of their structures are updated, so the side effects should be minimal, shouldn't they?
As a FYI, children object deletion used to be done instantly when a parent was deleted but that brought up another issue that I can't remember at the moment. I'm sure one could find it by perusing through the changelog or the hg history, with enough motivation.
A simple example of this would be a dialog object which contains an "OK" button object. When clicked, the button object might invoke a click handler that does something with the dialog object (like changing its text). Under normal circumstances, you could assume in the button click handler code that if the button is alive (since it received a click), the dialog must also be alive.
I know these are all edge-cases, and I'm probably manufacturing concerns that will never happen. It's just that I'm trying to make some simple assumptions that would help with reasoning about the whole program so that bugs can't hide in corners.
How about I run a loop at the start of each frame (Scroll::run) and handle orxObject_Delete of all the descendants of objects pending deletion. I don't have many objects so I don't mind the performance loss, would that make sure that all objects die at the same frame with their parents?
I guess I got confused by your notion of "alive". Also I'm not sure one should make a parallel between object ownership/hierarchy and C++ classes. In the first case, it's about two separate entities whereas in the second case it's the same instance (akin to orxSTRUCTURE and its derivatives).
I see two cases with objects:
- the owner has been deleted -> in which case your smart pointer will return null and you know that there's no processing that needs to happen
- the owner has a life time of 0 (pending deletion but not dead) -> that's the one I was referring to, you can manipulate the owner which shouldn't have any serious side effect as it'll get deleted before its next internal (orx) update anyway. Worst case scenario would be a 1-frame display of the wrong text. I could easily prevent orx from displaying any 0-LifeTime object to fix this if need be.
I'm not saying there isn't any case where undesired behaviour would come out of this, it's just that I can't think of any at the moment.
In the specific case of UI, I handle all the menu communication/interaction specifics through orxINPUTs (virtual, not bound to any physical keys/buttons) and timelines. That allows me to keep the whole UI code under 20 lines of C/C++ (which is the case in Little Cells, for example) and everything else is data driven.
Well it depends, how often do you build hierarchies programmatically? It'll only happen in these cases.
Which would be fixed either by the way you propose or with a recursive SetLifeTime(0), which I personally prefer, but it's definitely up to you.
I build them quite often, especially since I've been using the behavior trees. For instance, the event of a group of birds coming together, doing something and leaving together is modeled as a behavior of the level. All the objects created by this event are the children of the level.
How about a deeply nested hierarchy in which neither the object nor its parent has any trace of deletion yet? Say an ancestor 4 layers up is being deleted at this frame, so the ancestor at the 3rd layer has lifetime 0, but the owner and the object doesn't know about that yet. In order to handle this, my smart pointer needs to check parents recursively every time it's accessed, which might be a big performance hit.
That's a nice idea I'll sure look for ways to use this from now on.
In that case yep, better to be prepared.
I often do change ownership this way (like adding particles to a container for quick cleanup).
I guess I just never had the case. All my deep hierarchies are config-made, so their deletion happens within a single frame. When I compose hierarchies in code, it never goes deeper than a single level. Also I don't think I've ever had children making any assumptions about owners, the few times I have this kind os assumptions, it's always from owner/parent to the way down. When I need communication to go up, I've always used timelines and/or inputs just to get the flexibility for the designer to be able to change hierarchies at will, as usually they don't code.
In any case, having those two simples changes should cover all your issues, I believe:
- Add the test in ScrollBase::Update() to not update objects that have a 0-LifeTime. I'll probably add it to Scroll in the end.
- Add recursivity to ScrollObject::SetLifeTime(), assuming that's what you're using, otherwise your option of having a loop check each frame works too.
Or did I miss some cases?
You're right, I agree that trying to minimize direct child->owner reliance is a nice goal on its own. Though some objects exist in a context and they'd be meaningless without it. Like the birds in my game, which try to find food pieces, so they need to access the physical world that the level owns, or the crows that try to pester the sparrows in the same level. It's logically safe to assume they would disappear with the context they're in, so I'd rather avoid checking if the context still exists all around the code.
Thanks a lot iarwain, I'll definitely implement these two. I guess adding the test to ScrollBase::Update (and the smart pointer) is a necessity even if I guaranteed that the hierarchy gets deleted in the same frame, since an object from the hierarchy could remain undeleted with lifetime=0 due to someone holding a reference to it (so its refcount>1).
Totally. That being said, when I have unique objects (like a scene or, in your case the flock (owner) of sparrows or the owner of the food pieces), I store their GUID in a runtime-only config section. Then I can retrieve them from anywhere (including from timelines). The pattern I'm using is this one:
And now I can retrieve them either in code:
or in config:
I use the same system to handle menus (using CurrentMenu/PreviousMenu IDs, but this time I store the previous one directly inside the config section of the current menu, akin to a bread crumb), for example.
Lemme know if that wasn't enough to make it work on your side. I'll add the check inside ScrollBase in the coming days on my side as well.
Adding the check solved my current bug as it manifested itself in a single step hierarchy, but I've also added the code to make sure the entire hierarchy receives SetLifeTime 0 to ensure correctness in the future. I didn't achieve this through ScrollObject::SetLifeTime since I can't be sure that's the only way objects get deleted. Some could have a finite lifetime at creation and some others could be deleted by timelines etc. I'm currently looping through all the objects at the end of every frame, but I think I'll switch to an object deletion handler that sets the lifetime of the ancestors of the deleted object.
Just came back from vacation, sorry for the late reply!
The step 0 can either run during the next object update (default mode) or right after the object has been created if the config property Immediate = true has been set for the track.
I'd love to hear of others' somewhat-hijacked uses of config/timelines.