函数pg_backup_start()的分析
我们知道,所有的物理备份软件都是调用底层的pg_backup_start()和pg_backup_stop()来完成PG数据库的物理备份的。当用户在psql中执行SELECT pg_backup_start()函数时,对应的入口函数是xlogfuncs.c中的pg_backup_start()函数,其完整代码如下:
/*
* pg_backup_start: set up for taking an on-line backup dump
*
* Essentially what this does is to create the contents required for the
* backup_label file and the tablespace map.
*
* Permission checking for this function is managed through the normal
* GRANT system.
*/
Datum
pg_backup_start(PG_FUNCTION_ARGS)
{
text *backupid = PG_GETARG_TEXT_PP(0);
bool fast = PG_GETARG_BOOL(1);
char *backupidstr;
SessionBackupState status = get_backup_status();
MemoryContext oldcontext;
backupidstr = text_to_cstring(backupid);
if (status == SESSION_BACKUP_RUNNING)
ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
errmsg("a backup is already in progress in this session")));
/*
* backup_state and tablespace_map need to be long-lived as they are used
* in pg_backup_stop(). These are allocated in a dedicated memory context
* child of TopMemoryContext, deleted at the end of pg_backup_stop(). If
* an error happens before ending the backup, memory would be leaked in
* this context until pg_backup_start() is called again.
*/
if (backupcontext == NULL)
{
backupcontext = AllocSetContextCreate(TopMemoryContext,
"on-line backup context",
ALLOCSET_START_SMALL_SIZES);
}
else
{
backup_state = NULL;
tablespace_map = NULL;
MemoryContextReset(backupcontext);
}
oldcontext = MemoryContextSwitchTo(backupcontext);
backup_state = (BackupState *) palloc0(sizeof(BackupState));
tablespace_map = makeStringInfo();
MemoryContextSwitchTo(oldcontext);
register_persistent_abort_backup_handler();
do_pg_backup_start(backupidstr, fast, NULL, backup_state, tablespace_map);
PG_RETURN_LSN(backup_state->startpoint);
}
pg_backup_start()的入口参数有两个,第一个参数是字符串backupid,表示本次备份的信息,这个字符串仅供人类阅读,PG并不会使用它。第二个参数fast是个布尔变量。我们知道检查点(checkpoint)的最主要的工作是把共享池(shared buffer)中的所有脏页都写到磁盘上对应的数据文件中,检查点可以开足马力,尽快完成脏页的刷盘任务,但这种方式会带来磁盘I/O操作的负荷陡然增大,所以检查点的第二种模式是把刷盘任务在一定的时间间隔内尽可能均匀地分布,避免带来大量的磁盘I/O活动。前一种模式称为fast,第二种模式称为not fast。pg_backup_start()函数的第二个参数就是控制检查点的行为的。如果你指定fast = false,则可能pg_backup_start()函数要等很长时间才能返回。
上述函数的逻辑也非常简单易懂。它首先判断此时是否处于备份模式,由(status == SESSION_BACKUP_RUNNING)这个条件来决定。如果处于备份模式,说明本session在之前已经执行了pg_backup_start()操作,就直接报错退出。
然后该函数会申请一个新的内存池backupcontext,以后所有的备份操作所涉及的内存申请均在这个内存池中进行。这个内存池会在pg_backup_stop()函数中被删除掉。
真正做工作的函数是do_pg_backup_start()函数来实现。这个函数的逻辑我们在后面分析。
pg_backup_start()函数返回的值是检查点的REDO点,这个点的LSN就是未来我们用这次备份进行恢复是的起点。
-
函数do_pg_backup_start()的理解
我们知道,这个函数会先把全页写模式置为TRUE,然后执行一个检查点。但是我们翻看这个函数的源代码,并没有看见在哪里强制全页写的代码。我们看到有这段代码:
/* * Mark backup active in shared memory. We must do full-page WAL writes * during an on-line backup even if not doing so at other times, because * it's quite possible for the backup dump to obtain a "torn" (partially * written) copy of a database page if it reads the page concurrently with * our write to the same page. This can be fixed as long as the first * write to the page in the WAL sequence is a full-page write. Hence, we * increment runningBackups then force a CHECKPOINT, to ensure there are * no dirty pages in shared memory that might get dumped while the backup * is in progress without having a corresponding WAL record. (Once the * backup is complete, we need not force full-page writes anymore, since * we expect that any pages not modified during the backup interval must * have been correctly captured by the backup.) * * Note that forcing full-page writes has no effect during an online * backup from the standby. * * We must hold all the insertion locks to change the value of * runningBackups, to ensure adequate interlocking against * XLogInsertRecord(). */ WALInsertLockAcquireExclusive(); XLogCtl->Insert.runningBackups++; WALInsertLockRelease();
上述代码很简单,就是把runningBackups的值加一。因为这个值是在共享内存中,所以执行时先用自旋锁进行保护一下,加一操作完毕后,再释放自旋锁。 上面大段的注释也提供了很多信息量,它的意思是说:热备会产生torn page,就是部分写的坏块。但是如果设置了全页写,再执行一个检查点,即使读取了torn page也没有关系,因为torn page在未来恢复时可以被全页写的WAL记录修复,因为全页写的WAL记录本身就记录了这个数据块的没有损坏时的数据。
貌似runningBackups为非零值,就可以确保全页写的。这个猜想可以由下面代码得到确认:
/* * doPageWrites is this backend's local copy of (fullPageWrites || * runningBackups > 0). It is used together with RedoRecPtr to decide whether * a full-page image of a page need to be taken. * * NB: Initially this is false, and there's no guarantee that it will be * initialized to any other value before it is first used. Any code that * makes use of it must recheck the value after obtaining a WALInsertLock, * and respond appropriately if it turns out that the previous value wasn't * accurate. */ static bool doPageWrites; doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);
布尔变量doPageWrites表示是否处于全页写的模式。上面的代码清楚地表明:doPageWrites = true是由两个条件决定的,两个条件是或(OR)的关系,其中一个条件就是runningBackups > 0,即只要runningBackups非零,则就处于全页写模式。
明白了这个,我们就理解了do_pg_backup_start()函数中的那条语句,简单地把runningBackups加一,即可强制此时数据库处于全页写模式。
我们接着分析do_pg_backup_start()函数下面的代码:
/* * Force an XLOG file switch before the checkpoint, to ensure that the * WAL segment the checkpoint is written to doesn't contain pages with * old timeline IDs. That would otherwise happen if you called * pg_backup_start() right after restoring from a PITR archive: the * first WAL segment containing the startup checkpoint has pages in * the beginning with the old timeline ID. That can cause trouble at * recovery: we won't have a history file covering the old timeline if * pg_wal directory was not included in the base backup and the WAL * archive was cleared too before starting the backup. * * This also ensures that we have emitted a WAL page header that has * XLP_BKP_REMOVABLE off before we emit the checkpoint record. * Therefore, if a WAL archiver (such as pglesslog) is trying to * compress out removable backup blocks, it won't remove any that * occur after this point. * * During recovery, we skip forcing XLOG file switch, which means that * the backup taken during recovery is not available for the special * recovery case described above. */ if (!backup_started_in_recovery) RequestXLogSwitch(false);
上述代码的含义是:如果本次备份是在主库上做的,即backup_started_in_recovery=false,则执行一次WAL文件的切换。
/* * Force a CHECKPOINT. Aside from being necessary to prevent torn * page problems, this guarantees that two successive backup runs * will have different checkpoint positions and hence different * history file names, even if nothing happened in between. * * During recovery, establish a restartpoint if possible. We use * the last restartpoint as the backup starting checkpoint. This * means that two successive backup runs can have same checkpoint * positions. * * Since the fact that we are executing do_pg_backup_start() * during recovery means that checkpointer is running, we can use * RequestCheckpoint() to establish a restartpoint. * * We use CHECKPOINT_IMMEDIATE only if requested by user (via * passing fast = true). Otherwise this can take awhile. */ RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | (fast ? CHECKPOINT_IMMEDIATE : 0));
这个函数执行一次检查点,它的底层逻辑是向checkpointer进程通过kill()调用发送SIGINT信号,checkpointer进程收到这个信号后,就检查一下共享内存中的相关设置,再执行检查点。CHECKPOINT_WAIT这个标志表示必须等检查点执行完毕后,才能继续往下走。所以这个RequestCheckPoint()执行完毕后,检查点的操作也完成了。根据fast参数是true还是false,来决定是否把CHECKPOINT_IMMEDIATE传送给检查点进行。根据这个代码,我们就很清楚fast这个参数的含义了。
-
如何判断一个文件名是合法的WAL文件?
我们知道,在PG中的WAL文件的名字是有规律的,用户不能自行定义WAL文件的名字。一个合法的WAL文件名是24个字符,每个字符都是0-9, A-F中的一个。所以下面的判断函数大家就清楚了:
/* Length of XLog file name */ #define XLOG_FNAME_LEN 24 static inline bool IsXLogFileName(const char *fname) { return (strlen(fname) == XLOG_FNAME_LEN && \ strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN); }
上述代码的逻辑是非常简单的:首先判断文件名是不是24个字符长,即strlen(fname) == XLOG_FNAME_LEN,然后通过strspn这个函数,判断开始的24个字符是否都来自"0123456789ABCDEF"中的一个。如果这两个条件都满足,就判断这个文件名是合法的WAL文件。
其实这种判断逻辑依然有缺陷。我们知道WAL文件的最小体积是1MB,最大体积是1GB。当WAL文件的体积是1MB时,从后面往前数(最后一个字符的编号为1),第4、5、6、7、8个字符是全0。当WAL文件的体积是1GB时,从后往前数,第2、3、4、5、6、7、8个字符是0,上述代码没有考虑到这些规律。
########################
00000