Skip to content

drain CQ after moving QP into RESET state#3261

Open
live4thee wants to merge 1 commit intoapache:masterfrom
live4thee:fix-3252
Open

drain CQ after moving QP into RESET state#3261
live4thee wants to merge 1 commit intoapache:masterfrom
live4thee:fix-3252

Conversation

@live4thee
Copy link
Copy Markdown
Contributor

Otherwise there might segfault due to the race below:

Socket::OnInputEvent()          |
  `-- ProcessEvent (bthread)    |
                                |
  [ bthread queueed ]           |  QP error -> SetFailed -> HC -> WaitAndReset()
                                |  Reset() -> _sbuf.clear()
                                |  CheckHealth() -> Revive()
                                |
                                |  Socket is now Addressable!
  RdmaEndpoint:PollCq()         |
  Socket::Address() OK!         |
  RdmaEndpoint:HandleCompletion()
  _sbuf[_sq_sent++].clear()    <= BOOM!  CQ is not drained but _sbuf is cleared.

Another possible fix is to add a generation field in RdmaEndpoint, such that:

  • each RdmaEndpoint::Reset() will advance the generation by 1;
  • the RdmaEndpoint::PollCq(m, orig_gen) will need to compare the generation.

But it will contaminate existing interface, and we need to drain CQ anyway.

What problem does this PR solve?

Issue Number: 3252

Problem Summary: see above.

What is changed and the side effects?

Changed:

  1. drain CQ after moving QP into RESET state;
  2. do not re-use QP if drain failed.

Side effects:

  • Performance effects: n/a

  • Breaking backward compatibility: none


Check List:

Otherwise there might segfault due to the race below:

```txt
Socket::OnInputEvent()          |
  `-- ProcessEvent (bthread)    |
                                |
  [ bthread queueed ]           |  QP error -> SetFailed -> HC -> WaitAndReset()
                                |  Reset() -> _sbuf.clear()
                                |  CheckHealth() -> Revive()
                                |
                                |  Socket is now Addressable!
  RdmaEndpoint:PollCq()         |
  Socket::Address() OK!         |
  RdmaEndpoint:HandleCompletion()
  _sbuf[_sq_sent++].clear()    <= BOOM!  CQ is not drained but _sbuf is cleared.
```

Another possible fix is to add a _generation_ field in RdmaEndpoint, such that:
 - each RdmaEndpoint::Reset() will advance the _generation_ by 1;
 - the RdmaEndpoint::PollCq(m, orig_gen) will need to compare the _generation_.

But it will contaminate existing interface, and we need to drain CQ anyway.

Signed-off-by: David Lee <live4thee@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant