Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refine wait_timeout override to be at cut-over only #1406

Merged

Conversation

timvaillancourt
Copy link
Collaborator

@timvaillancourt timvaillancourt commented Apr 11, 2024

Related issue: #1407

Description

This PR refines #1401 by overriding the session wait_timeout only where it is needed - at the cut-over time where an idle connection could lead to potentially-long table lock if the gh-ost process (or host running it) "freezes"/"stalls" at the cut-over stage

The change (at cut-over only):

  1. Before cut-over:
    • The server wait_timeout is fetched (via an existing select that fetched time_zone)
    • The applier session wait_timeout is set to be 3 x the lock-wait timeout
      • This is to ensure the lock is released if gh-ost stalls with a still-active connection here
  2. The cut-over proceeds as normal
  3. After cut-over, the original session wait_timeout is restored to what it was set to pre-cut-over

The --mysql-wait-timeout flag added in #1401 is removed because it is no longer needed. No release has been cut since #1401, so this isn't necessarily a breaking change

In case this PR introduced Go code changes:

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

@timvaillancourt timvaillancourt added this to the v1.1.7 milestone Apr 11, 2024
@timvaillancourt timvaillancourt changed the title Refine wait_timeout flag to be cut-over only Refine wait_timeout override to be at cut-over only Apr 11, 2024
@timvaillancourt timvaillancourt linked an issue Apr 11, 2024 that may be closed by this pull request
@timvaillancourt timvaillancourt marked this pull request as ready for review April 11, 2024 18:29
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work and in particular great analysis of this stalemate situation! This looks generally correct, but see inline comments:

  • It break gh-ost for existing users
  • It adds variables where I think we can do without
  • It adds code complexity which we can delegate to MySQL.

go/base/context.go Outdated Show resolved Hide resolved
go/cmd/gh-ost/main.go Show resolved Hide resolved
go/logic/applier.go Outdated Show resolved Hide resolved
go/logic/applier.go Outdated Show resolved Hide resolved
go/logic/applier.go Outdated Show resolved Hide resolved
@timvaillancourt
Copy link
Collaborator Author

timvaillancourt commented Apr 22, 2024

@shlomi-noach I'm curious about your feedback on one consequence of this change

Following this PR, it is possible for the session holding the lock tables to timeout (and unlock tables) before the "magic table" is dropped here

If understand this right, the lock-session hitting wait_timeout will cause the rename tables to succeed. That all sounds better than before, but the "magic table" will be left behind. The impact of this I don't fully understand

-initially-drop-old-table
    	Drop a possibly existing OLD table (remains from a previous run?) before beginning operation. Default is to panic and abort if such table exists

Would gh-ost just-fix this scenario for users with -initially-drop-old-table? Any other race-condition risks you can see the lock-release causing 🤔? 🙇

@github github deleted a comment Apr 29, 2024
@shlomi-noach
Copy link
Contributor

If understand this right, the lock-session hitting wait_timeout will cause the rename tables to succeed.

No, actually. The RENAME will not succeed, because the magic table is still in place. The RENAME statement attempts to rename original-table into magic-table. But since magic-table is there, the RENAME will fail.

The next cut-over attempt will first, before placing any locks, attempt to DropAtomicCutOverSentryTableIfExists() before re-creating it.

This should be safe.

@timvaillancourt
Copy link
Collaborator Author

If understand this right, the lock-session hitting wait_timeout will cause the rename tables to succeed.

No, actually. The RENAME will not succeed, because the magic table is still in place. The RENAME statement attempts to rename original-table into magic-table. But since magic-table is there, the RENAME will fail.

The next cut-over attempt will first, before placing any locks, attempt to DropAtomicCutOverSentryTableIfExists() before re-creating it.

This should be safe.

@shlomi-noach that makes sense (eventually)! Thanks for the validations and explanations

meiji163
meiji163 previously approved these changes Aug 8, 2024
@timvaillancourt
Copy link
Collaborator Author

@shlomi-noach / @meiji163: I believe I've addressed the PR suggestions and this is ready for another review 🙇

@timvaillancourt timvaillancourt force-pushed the mysql-wait-timeout-applier-only branch 3 times, most recently from 4d670bd to d0854d6 Compare August 14, 2024 00:24
@timvaillancourt
Copy link
Collaborator Author

Merging. @shlomi-noach let me know if I missed something and I'll make a follow-up PR 👍

@timvaillancourt timvaillancourt merged commit 48cb9ab into github:master Aug 15, 2024
7 checks passed
@timvaillancourt timvaillancourt deleted the mysql-wait-timeout-applier-only branch August 15, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cut-over locks not released when gh-ost pauses mid-cut-over
3 participants