主库(primary database)是在PG物理复制中的一个概念,相比较备库,主库的最大特征是可读可写(readable and writable)。备库(standby database)是物理复制中的一个概念,它的最大特征是只读(read-only),它和主库的内容一模一样。
把一个PG的备库(standby database)提升为主库(primary database)是非常简单的,可以在即将要被变成主库的那个备库上执行“SELECT pg_promote()”命令,就可以把该只读的备库变成可读可写的主库了。但是这背后发生了什么呢?本文带领大家探索这背后到底发生了哪些事情。
首先看一下pg_promote()函数的源代码:
/// #define PROMOTE_SIGNAL_FILE "promote"
/*
* Promotes a standby server.
*
* A result of "true" means that promotion has been completed if "wait" is
* "true", or initiated if "wait" is false.
*/
Datum
pg_promote(PG_FUNCTION_ARGS)
{
bool wait = PG_GETARG_BOOL(0);
int wait_seconds = PG_GETARG_INT32(1);
FILE *promote_file;
int i;
if (!RecoveryInProgress()) /// prompt只能在备库上做。
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("recovery is not in progress"),
errhint("Recovery control functions can only be executed during recovery.")));
if (wait_seconds <= 0)
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
errmsg("\"wait_seconds\" must not be negative or zero")));
/* create the promote signal file */
promote_file = AllocateFile(PROMOTE_SIGNAL_FILE, "w");
if (!promote_file)
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not create file \"%s\": %m",
PROMOTE_SIGNAL_FILE)));
if (FreeFile(promote_file))
ereport(ERROR,
(errcode_for_file_access(),
errmsg("could not write file \"%s\": %m",
PROMOTE_SIGNAL_FILE)));
/* signal the postmaster */
if (kill(PostmasterPid, SIGUSR1) != 0) /// 先写promote文件,再向postmaster主进程发送SIGUSR1信号。
{
(void) unlink(PROMOTE_SIGNAL_FILE);
ereport(ERROR,
(errcode(ERRCODE_SYSTEM_ERROR),
errmsg("failed to send signal to postmaster: %m")));
}
/* return immediately if waiting was not requested */
if (!wait)
PG_RETURN_BOOL(true);
/* wait for the amount of time wanted until promotion */
#define WAITS_PER_SECOND 10
for (i = 0; i < WAITS_PER_SECOND * wait_seconds; i++)
{
int rc;
ResetLatch(MyLatch);
if (!RecoveryInProgress()) /// 如果备库变成了主库,就跳出循环。
PG_RETURN_BOOL(true);
CHECK_FOR_INTERRUPTS();
rc = WaitLatch(MyLatch,
WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
1000L / WAITS_PER_SECOND,
WAIT_EVENT_PROMOTE);
/*
* Emergency bailout if postmaster has died. This is to avoid the
* necessity for manual cleanup of all postmaster children.
*/
if (rc & WL_POSTMASTER_DEATH)
ereport(FATAL,
(errcode(ERRCODE_ADMIN_SHUTDOWN),
errmsg("terminating connection due to unexpected postmaster exit"),
errcontext("while waiting on promotion")));
}
ereport(WARNING,
(errmsg_plural("server did not promote within %d second",
"server did not promote within %d seconds",
wait_seconds,
wait_seconds)));
PG_RETURN_BOOL(false);
}
这段代码的逻辑不难理解,它通过RecoveryInProgress()函数来判断你这条命令是运行在备库上,还是主库上。主库当然不需要被promoted,所以这条命令只能在备库上执行。然后它在数据库集群目录下创建一个promote的信号文件,所谓信号文件,就是这个文件的存在就意味着一个明确的信号,而这个文件的内容是不需要操心的。该信号文件被创建成功后,就通过kill(PostmasterPid, SIGUSR1)给主进程发送SIGUSR1的信号,然后就反复查询RecoveryInProgress()函数啥时候返回false,一旦该函数的返回值为false,则表明promote成功,pg_promote()函数就返回给用户一个成功信息。
很显然,我们接着要看主进程收到SIGUSR1信号后,主进程做了什么事情。我们可以看如下代码:
if (StartupPID != 0 && /// 这表明Startup进程正在运行中。
(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY) &&
CheckPromoteSignal()) /// CheckPromoteSignal()检查promote文件是否存在,如果存在就返回true。
{
/*
* Tell startup process to finish recovery.
*
* Leave the promote signal file in place and let the Startup process
* do the unlink.
*/
signal_child(StartupPID, SIGUSR2);
}
主进程收到SIGUSR1信号后,会执行上述的逻辑。上述逻辑中的CheckPromoteSignal()函数就是判断在数据库集群目录下是否有promote文件,如果该文件存在,且StartupPID != 0,且主进程处于PM_STARTUP/PM_RECOVERY/PM_HOT_STANDBY三种状态中的一种,就给startup进程发送SIGUSR2信号。
接下来我们就要看startup进程收到SIGUSR2信号后,做了哪些动作。