---
title: "Debugging a Stack Overflow in Rails 7.2.1.1"
description: "Debugging a Stack Overflow in Rails 7.2.1.1"
canonical_url: "https://www.bigbinary.com/blog/debugging-stack-overflow-in-rails"
markdown_url: "https://www.bigbinary.com/blog/debugging-stack-overflow-in-rails.md"
---

# Debugging a Stack Overflow in Rails 7.2.1.1

Debugging a Stack Overflow in Rails 7.2.1.1

- Author: Vishnu M
- Published: November 25, 2025
- Categories: Rails, Ruby

A few weeks ago, after upgrading [NeetoCal](https://www.neeto.com/neetocal) from
Rails 7.1.5.2 to Rails 7.2.1.1, we started seeing mysterious crashes in
production with the error message `SystemStackError: stack level too deep`.

## Identifying the problem

The crash happened inside `Slots::SyncAllCalendarsService`, a service that syncs
multiple calendars concurrently using the `Async` gem. What made it particularly
puzzling was that switching from `Async::Barrier` to `Thread.new` made the error
disappear. This led us down a rabbit hole thinking it was an Async-specific
issue.

The stack trace pointed to a strange location:

```
activerecord (7.2.1.1) lib/arel/visitors/to_sql.rb:898:in `each'
  from lib/arel/visitors/to_sql.rb:898:in `each_with_index'
  from lib/arel/visitors/to_sql.rb:898:in `inject_join'
  from lib/arel/visitors/to_sql.rb:627:in `visit_Arel_Nodes_Or'
  ... 1132 levels...
```

Over a thousand levels deep into `visit_Arel_Nodes_Or`. That's unusual. Why
would visiting OR nodes in a SQL query cause such deep recursion?

The stack trace mentioned `Slots::SyncAllCalendarsService`, but that service
itself had nothing obviously wrong. It was just iterating through calendars and
syncing them. We traced deeper into the code and found the actual culprit in
Neeto's `Integrations::Icloud::SyncEventsService`:

```ruby
def abandoned_events
  pairs = event_params.map { |e| [e[:url], e[:recurrence_id]] }

  @_abandoned_events = calendar.events
    .where("start_time BETWEEN ? AND ?", start_time, end_time)
    .where.not([:url, :recurrence_id] => pairs)
end
```

This code finds "abandoned" events by excluding events that match specific URL
and recurrence ID combinations. In production, `pairs` contained 233 entries.
Nothing immediately suspicious, right?

To understand what was happening, we needed to see what Rails was actually doing
with this query. The `where.not([:url, :recurrence_id] => pairs)` syntax is
Rails' composite key feature. With 233 pairs, Rails generates SQL like this:

```sql
WHERE NOT (
  (url = 'url1' AND recurrence_id = 'rec1') OR
  (url = 'url2' AND recurrence_id = 'rec2') OR
  (url = 'url3' AND recurrence_id = 'rec3') OR
  -- ... 230 more OR clauses
)
```

233 OR clauses. That's a lot, but databases can handle it. The problem wasn't
the SQL itself - it was how Rails built the internal representation of that
query.

## Going into Rails Internals

To really understand what changed, we used `bundle open activerecord` to look at
the Rails source code. We compared the same code in Rails 7.1.5.2 and 7.2.1.1.

### What We Found in Rails 7.1.5.2

In Rails 7.1.5.2, the code that visits OR nodes looked like this:

```ruby
def visit_Arel_Nodes_Or(o, collector)
  stack = [o.right, o.left]

  while o = stack.pop
    if o.is_a?(Arel::Nodes::Or)
      stack.push o.right, o.left
    else
      visit o, collector
      collector << " OR " unless stack.empty?
    end
  end

  collector
end
```

This is an iterative approach using a manual stack. No matter how deeply nested
the OR conditions are, this code only uses a single stack frame for the
`visit_Arel_Nodes_Or` method. It's stack-safe.

### What Changed in Rails 7.2.1.1

In Rails 7.2.1.1, the same method became much simpler:

```ruby
def visit_Arel_Nodes_Or(o, collector)
  inject_join o.children, collector, " OR "
end
```

Where `inject_join` does:

```ruby
def inject_join(list, collector, join_str)
  list.each_with_index do |x, i|
    collector << join_str unless i == 0
    collector = visit(x, collector)  # Recursive call
  end
  collector
end
```

This is a recursive approach. If an OR node contains another OR node, it calls
`visit`, which calls `visit_Arel_Nodes_Or`, which calls `inject_join` again. The
recursion depth grows with the nesting depth of the OR tree.

## The Tree Structure Problem

But why is the OR tree so deeply nested? We found the answer in Active Record's
`PredicateBuilder#grouping_queries`:

```ruby
def grouping_queries(queries)
  if queries.one?
    queries.first
  else
    queries.map! { |query| query.reduce(&:and) }
    queries = queries.reduce { |result, query| Arel::Nodes::Or.new([result, query]) }
    Arel::Nodes::Grouping.new(queries)
  end
end
```

That `reduce` call is the key. It builds a left-deep nested tree:

```
Level 1:  Query1 OR Query2 = OR_Node_1
Level 2:  OR_Node_1 OR Query3 = OR_Node_2
Level 3:  OR_Node_2 OR Query4 = OR_Node_3
...
Level 232: OR_Node_231 OR Query233 = OR_Node_232
```

With 233 pairs, this creates 232 levels of nesting. Each level adds
approximately 5 stack frames when Rails traverses it recursively. That's 1,160
stack frames just for the query building.

## The Async Connection

Why did `Thread.new` work but `Async::Barrier` didn't?

We wrote a test script to measure the actual stack depth limits in different
contexts:

```ruby
def test_recursion_depth(context_name)
  depth = 0

  recurse = lambda do |n|
    depth = n
    recurse.call(n + 1)
  end

  begin
    recurse.call(0)
  rescue SystemStackError
    return depth
  end
end

thread_depth = nil
Thread.new do
  thread_depth = test_recursion_depth("thread")
end.join
puts "Thread recursion limit: ~#{thread_depth} calls"

async_depth = nil
Async do
  async_depth = test_recursion_depth("async")
end
puts "Async fiber recursion limit: ~#{async_depth} calls"
```

When we ran this, the results were revealing:

```
Thread recursion limit:      ~11910 calls
Async fiber recursion limit: ~1482 calls
```

Threads have roughly 8x more stack space than Async fibers. This is because
threads use larger stacks (1MB-8MB) while fibers are designed to be lightweight
with smaller stacks (512KB-1MB). This difference in stack allocation is why
threads can handle deeper recursion before hitting the overflow limit.

With 233 pairs requiring ~1,160 stack frames, we were right at the edge of the
Async fiber limit. But we were still well within the thread limit, which is why
switching to `Thread.new` seemed to fix it.

When we tested with 500 pairs (which some of our larger calendars had), even
`Thread.new` failed with a stack overflow. So it wasn't really a fix, it just
pushed the problem further down the road.

## The GitHub Trail

After we understood the problem, we searched the Rails repository and found the
exact history:

1. **April 4, 2024 - [PR #51492](https://github.com/rails/rails/pull/51492)**:
   This PR changed OR nodes from **binary** to **n-ary** to handle multiple
   children in a single node.

2. **September 24, 2024 -
   [Issue #53031](https://github.com/rails/rails/issues/53031)**: Someone
   reported the exact issue we were seeing. Queries with 800+ OR conditions that
   worked in Rails 7.1 now crashed in Rails 7.2.

3. **September 25, 2024 -
   [PR #53032](https://github.com/rails/rails/pull/53032)**:
   [fatkodima](https://github.com/fatkodima) fixed it with a one-line change in
   `PredicateBuilder#grouping_queries`:

**Before:**

```ruby
queries = queries.reduce { |result, query| Arel::Nodes::Or.new([result, query]) }
```

**After:**

```ruby
queries = Arel::Nodes::Or.new(queries)
```

The difference is profound. The old code using `reduce` created a deeply nested
structure like Russian nesting dolls:

```
Or([Or([Or([Query1, Query2]), Query3]), Query4])  # 232 levels with 233 pairs
```

The new code creates a flat structure:

```
Or([Query1, Query2, Query3, Query4, ...Query233])  # Just 1 level
```

With a flat structure, the recursive visitor only traverses one level instead of
232, eliminating the stack overflow. The fix is available in Rails 7.2.2+.

## Our Solution

We're currently on Rails 7.2.1.1 and upgrading immediately would require
extensive testing. So we implemented a workaround in our code:

```ruby
def abandoned_events
  return @_abandoned_events if defined?(@_abandoned_events)

  pairs = event_params.map { |e| [e[:url], e[:recurrence_id]] }
  return if pairs.empty?

  events_in_time_range = calendar.events
    .where("start_time BETWEEN ? AND ?", start_time, end_time)

  # Get candidate events using simple IN clauses
  urls = pairs.map(&:first).uniq
  recurrence_ids = pairs.map(&:last).uniq

  candidate_events = events_in_time_range
    .where(url: urls)
    .where(recurrence_id: recurrence_ids)

  # Filter to exact pairs in memory
  pairs_set = pairs.to_set
  ids_to_exclude = candidate_events.select { |event|
    pairs_set.include?([event.url, event.recurrence_id])
  }.map(&:id)

  @_abandoned_events = if ids_to_exclude.empty?
    events_in_time_range
  else
    events_in_time_range.where.not(id: ids_to_exclude)
  end
end
```

This approach:

- Uses simple IN clauses instead of complex OR conditions
- Filters the exact pairs in memory (fast with a Set)
- Excludes by ID, which is a flat list
- Never builds deeply nested Arel nodes

The trade-off is we run 2 queries instead of 1, but the queries are simpler and
more efficient. The in-memory filtering is negligible since we're already
constrained by time range.

## Measuring and Verifying

To prove our hypothesis, we wrote verification scripts that patched Rails
internals to measure what was actually happening.

### Counting Arel Visitor Calls

We created a script that patched `visit_Arel_Nodes_Or` to count how many times
it was called and how deep the recursion went:

```ruby
module Arel
  module Visitors
    class ToSql
      alias_method :original_visit_Arel_Nodes_Or, :visit_Arel_Nodes_Or

      def visit_Arel_Nodes_Or(o, collector)
        @or_call_count ||= 0
        @or_call_count += 1
        @max_depth ||= 0
        @current_depth ||= 0
        @current_depth += 1
        @max_depth = [@max_depth, @current_depth].max

        result = original_visit_Arel_Nodes_Or(o, collector)

        @current_depth -= 1
        result
      end
    end
  end
end
```

Results with 50 test pairs:

- **Rails 7.1.5.2**: `visit_Arel_Nodes_Or` was called 1 time and max recursion
  depth was 0.
- **Rails 7.2.1.1**: `visit_Arel_Nodes_Or` was called 49 times and max recursion
  depth was 100.

With 233 pairs in production:

- **Rails 7.1.5.2**: Still called 1 time, depth 0 (iterative approach)
- **Rails 7.2.1.1**: Called 232 times, depth ~466 (exceeds Async fiber limit)

### Testing the Actual Breaking Points

We ran the problematic query with different pair counts to find exactly where it
breaks:

```ruby
# Test with Async
Async do
  pairs = calendar.events.limit(count).pluck(:url, :recurrence_id)
  calendar.events.where.not([:url, :recurrence_id] => pairs).count
end.wait

# Test with Thread
Thread.new do
  pairs = calendar.events.limit(count).pluck(:url, :recurrence_id)
  calendar.events.where.not([:url, :recurrence_id] => pairs).count
end.join
```

Results:

- **Async Fiber**: Breaks at approximately 233 pairs
- **Thread.new**: Breaks at approximately 500 pairs

This confirmed that Thread.new wasn't a real solution - it just had more
headroom before hitting the same problem.

For now, we're sticking with our workaround. It works reliably, performs well,
and doesn't require upgrading Rails immediately. When we do upgrade to Rails
7.2.2+, we can revert to the original clean syntax, knowing that the Rails team
has fixed the underlying issue.

If you're on Rails 7.2.0 through 7.2.1.x and use composite key queries with
large datasets, watch out for this issue. The fix is in Rails 7.2.2+, or you can
work around it like we did.

## References

- [PR #51492 - Bug introduced](https://github.com/rails/rails/pull/51492)
- [Issue #53031 - Bug reported](https://github.com/rails/rails/issues/53031)
- [PR #53032 - Bug fixed](https://github.com/rails/rails/pull/53032)

## Links

- [Human page](https://www.bigbinary.com/blog/debugging-stack-overflow-in-rails)
