fix: clear error state on disabled-transitively cells when ancestor recovers#8784
fix: clear error state on disabled-transitively cells when ancestor recovers#8784VishakBaddur wants to merge 2 commits intomarimo-team:mainfrom
Conversation
…ecovers Fixes marimo-team#8072 When a disabled-transitively cell's ancestor had an error and then recovered, the disabled cell permanently showed the ancestor's error state. This happened because run_stale_cells() only re-queues non-disabled cells, so disabled-transitively cells never got a chance to reset their run_result_status from 'exception' to 'disabled'. Fix: - Add is_any_ancestor_errored() to DirectedGraph - In run_stale_cells(), after building cells_to_run, reset run_result_status to 'disabled' for any disabled-transitively cell whose ancestor no longer has an error
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
for more information, see https://pre-commit.ci
| def is_any_ancestor_errored(self, cell_id: CellId_t) -> bool: | ||
| """Check if any ancestor of a cell has an error.""" | ||
| return any( | ||
| self.topology.cells[cid].run_result_status == "exception" |
There was a problem hiding this comment.
is_any_ancestor_errored() only treats run_result_status == "exception" as an error, but the runtime also uses other error-like statuses (e.g. "marimo-error" is set for semantic/registration errors). The method name/docstring says “has an error”, so this narrow check is likely to be reused incorrectly and can cause false negatives when an ancestor is still in an error state.
Consider either (a) broadening the predicate to include all statuses that should be treated as “errored” (at least "exception" and "marimo-error", possibly "interrupted" depending on intent), or (b) renaming/docstring to make it explicit that it only checks for raised exceptions.
| self.topology.cells[cid].run_result_status == "exception" | |
| self.topology.cells[cid].run_result_status in ("exception", "marimo-error") |
| # Clear stale error state from disabled-transitively cells whose | ||
| # ancestor has recovered from an error. Without this, the disabled | ||
| # cell permanently shows the ancestor error even after it is fixed. | ||
| for cid, cell_impl in self.graph.cells.items(): | ||
| if ( | ||
| self.graph.is_disabled(cid) | ||
| and not cell_impl.config.disabled | ||
| and cell_impl.run_result_status == "exception" | ||
| and not self.graph.is_any_ancestor_errored(cid) | ||
| ): |
There was a problem hiding this comment.
This loop calls self.graph.is_disabled(cid) for every cell on every run_stale_cells() invocation. DirectedGraph.is_disabled() walks parents (BFS) and can become a noticeable hot path for large notebooks.
Since this block only targets disabled-transitively cells, consider using the already-tracked runtime state (cell_impl.runtime_state == "disabled-transitively" / cell_impl.disabled_transitively) instead of recomputing is_disabled() each time, or precomputing a disabled set once and reusing it in both loops.
| # Clear stale error state from disabled-transitively cells whose | ||
| # ancestor has recovered from an error. Without this, the disabled | ||
| # cell permanently shows the ancestor error even after it is fixed. | ||
| for cid, cell_impl in self.graph.cells.items(): | ||
| if ( | ||
| self.graph.is_disabled(cid) | ||
| and not cell_impl.config.disabled | ||
| and cell_impl.run_result_status == "exception" | ||
| and not self.graph.is_any_ancestor_errored(cid) | ||
| ): | ||
| cell_impl.set_run_result_status("disabled") | ||
|
|
There was a problem hiding this comment.
This block updates cell_impl.run_result_status but does not emit any CellNotification to the frontend. The frontend’s “errored”/error UI is driven by received cell-op messages (especially error outputs), and it doesn’t observe backend run_result_status directly.
If the goal is to clear the user-visible error state for disabled-transitively cells, this likely also needs an explicit UI update (e.g., clearing/replacing the error output and/or sending a status transition that resets the frontend’s errored flag). An alternative is to include these cells in the normal _run_cells queue so they go through the runner’s standard status transitions, plus explicitly clearing their error output when they’re skipped as disabled.
| def test_is_any_ancestor_errored() -> None: | ||
| """Test that is_any_ancestor_errored correctly detects ancestor errors.""" | ||
| graph = dataflow.DirectedGraph() | ||
| # Create a chain: 0 -> 1 -> 2 | ||
| code = "x = 0" | ||
| first_cell = parse_cell(code) | ||
| graph.register_cell("0", first_cell) | ||
| code = "y = x" | ||
| second_cell = parse_cell(code) | ||
| graph.register_cell("1", second_cell) | ||
| code = "z = y" | ||
| third_cell = parse_cell(code) | ||
| graph.register_cell("2", third_cell) | ||
|
|
||
| # No errors initially | ||
| assert not graph.is_any_ancestor_errored("0") | ||
| assert not graph.is_any_ancestor_errored("1") | ||
| assert not graph.is_any_ancestor_errored("2") | ||
|
|
||
| # Set cell 0 to exception state | ||
| graph.cells["0"].set_run_result_status("exception") | ||
| assert not graph.is_any_ancestor_errored("0") # no ancestors | ||
| assert graph.is_any_ancestor_errored("1") # parent 0 has error | ||
| assert graph.is_any_ancestor_errored("2") # grandparent 0 has error | ||
|
|
||
| # Fix cell 0 - clear the error | ||
| graph.cells["0"].set_run_result_status("success") | ||
| assert not graph.is_any_ancestor_errored("0") | ||
| assert not graph.is_any_ancestor_errored("1") | ||
| assert not graph.is_any_ancestor_errored("2") |
There was a problem hiding this comment.
This test validates the new DirectedGraph.is_any_ancestor_errored() helper, but the PR’s user-facing behavior change is in Kernel.run_stale_cells() (clearing disabled-transitively cells’ stale error state when an ancestor recovers). Consider adding an integration-style runtime test that reproduces #8072 end-to-end (ancestor errors → downstream disabled-transitively cell shows error → ancestor fixed + run_stale_cells() → downstream cell no longer shows error/exception state). This would help ensure the run_stale_cells() logic stays correct as execution/notification behavior evolves.
Fixes #8072
Root Cause
When a disabled-transitively cell's ancestor had an error and then recovered, the disabled cell permanently showed the ancestor's error state.
run_stale_cells()inruntime.pyonly re-queues non-disabled cells:So disabled-transitively cells never got re-queued and never had a chance to reset their
run_result_statusfrom"exception"to"disabled".Fix
is_any_ancestor_errored()toDirectedGraphrun_stale_cells(), after buildingcells_to_run, resetrun_result_statusto"disabled"for any disabled-transitively cell whose ancestor no longer has an errorTesting
Added
test_is_any_ancestor_erroredtotests/_runtime/test_dataflow.pyverifying the new graph method correctly detects and clears ancestor error states.