小布源码分析札记-03：WAL相关的源码分析-PostgreSQL数据库-樱桃溪学院

小布源码分析札记-03：WAL相关的源码分析

xiaobu 5月前 224

函数pg_backup_start()的分析

我们知道，所有的物理备份软件都是调用底层的pg_backup_start()和pg_backup_stop()来完成PG数据库的物理备份的。当用户在psql中执行SELECT pg_backup_start()函数时，对应的入口函数是xlogfuncs.c中的pg_backup_start()函数，其完整代码如下：

/*
 * pg_backup_start: set up for taking an on-line backup dump
 *
 * Essentially what this does is to create the contents required for the
 * backup_label file and the tablespace map.
 *
 * Permission checking for this function is managed through the normal
 * GRANT system.
 */
Datum
pg_backup_start(PG_FUNCTION_ARGS)
{
	text	   *backupid = PG_GETARG_TEXT_PP(0);
	bool		fast = PG_GETARG_BOOL(1);
	char	   *backupidstr;
	SessionBackupState status = get_backup_status();
	MemoryContext oldcontext;

	backupidstr = text_to_cstring(backupid);

	if (status == SESSION_BACKUP_RUNNING)
		ereport(ERROR,
				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
				 errmsg("a backup is already in progress in this session")));

	/*
	 * backup_state and tablespace_map need to be long-lived as they are used
	 * in pg_backup_stop().  These are allocated in a dedicated memory context
	 * child of TopMemoryContext, deleted at the end of pg_backup_stop().  If
	 * an error happens before ending the backup, memory would be leaked in
	 * this context until pg_backup_start() is called again.
	 */
	if (backupcontext == NULL)
	{
		backupcontext = AllocSetContextCreate(TopMemoryContext,
											  "on-line backup context",
											  ALLOCSET_START_SMALL_SIZES);
	}
	else
	{
		backup_state = NULL;
		tablespace_map = NULL;
		MemoryContextReset(backupcontext);
	}

	oldcontext = MemoryContextSwitchTo(backupcontext);
	backup_state = (BackupState *) palloc0(sizeof(BackupState));
	tablespace_map = makeStringInfo();
	MemoryContextSwitchTo(oldcontext);

	register_persistent_abort_backup_handler();
	do_pg_backup_start(backupidstr, fast, NULL, backup_state, tablespace_map);

	PG_RETURN_LSN(backup_state->startpoint);
}

pg_backup_start()的入口参数有两个，第一个参数是字符串backupid，表示本次备份的信息，这个字符串仅供人类阅读，PG并不会使用它。第二个参数fast是个布尔变量。我们知道检查点(checkpoint)的最主要的工作是把共享池(shared buffer)中的所有脏页都写到磁盘上对应的数据文件中，检查点可以开足马力，尽快完成脏页的刷盘任务，但这种方式会带来磁盘I/O操作的负荷陡然增大，所以检查点的第二种模式是把刷盘任务在一定的时间间隔内尽可能均匀地分布，避免带来大量的磁盘I/O活动。前一种模式称为fast，第二种模式称为not fast。pg_backup_start()函数的第二个参数就是控制检查点的行为的。如果你指定fast = false，则可能pg_backup_start()函数要等很长时间才能返回。

上述函数的逻辑也非常简单易懂。它首先判断此时是否处于备份模式，由(status == SESSION_BACKUP_RUNNING)这个条件来决定。如果处于备份模式，说明本session在之前已经执行了pg_backup_start()操作，就直接报错退出。

然后该函数会申请一个新的内存池backupcontext，以后所有的备份操作所涉及的内存申请均在这个内存池中进行。这个内存池会在pg_backup_stop()函数中被删除掉。

真正做工作的函数是do_pg_backup_start()函数来实现。这个函数的逻辑我们在后面分析。

pg_backup_start()函数返回的值是检查点的REDO点，这个点的LSN就是未来我们用这次备份进行恢复是的起点。

最新回复 (3)

xiaobu 5月前

2楼

函数do_pg_backup_start()的理解

我们知道，这个函数会先把全页写模式置为TRUE，然后执行一个检查点。但是我们翻看这个函数的源代码，并没有看见在哪里强制全页写的代码。我们看到有这段代码：

	/*
	 * Mark backup active in shared memory.  We must do full-page WAL writes
	 * during an on-line backup even if not doing so at other times, because
	 * it's quite possible for the backup dump to obtain a "torn" (partially
	 * written) copy of a database page if it reads the page concurrently with
	 * our write to the same page.  This can be fixed as long as the first
	 * write to the page in the WAL sequence is a full-page write. Hence, we
	 * increment runningBackups then force a CHECKPOINT, to ensure there are
	 * no dirty pages in shared memory that might get dumped while the backup
	 * is in progress without having a corresponding WAL record.  (Once the
	 * backup is complete, we need not force full-page writes anymore, since
	 * we expect that any pages not modified during the backup interval must
	 * have been correctly captured by the backup.)
	 *
	 * Note that forcing full-page writes has no effect during an online
	 * backup from the standby.
	 *
	 * We must hold all the insertion locks to change the value of
	 * runningBackups, to ensure adequate interlocking against
	 * XLogInsertRecord().
	 */
	WALInsertLockAcquireExclusive();
	XLogCtl->Insert.runningBackups++;
	WALInsertLockRelease();

上述代码很简单，就是把runningBackups的值加一。因为这个值是在共享内存中，所以执行时先用自旋锁进行保护一下，加一操作完毕后，再释放自旋锁。上面大段的注释也提供了很多信息量，它的意思是说：热备会产生torn page，就是部分写的坏块。但是如果设置了全页写，再执行一个检查点，即使读取了torn page也没有关系，因为torn page在未来恢复时可以被全页写的WAL记录修复，因为全页写的WAL记录本身就记录了这个数据块的没有损坏时的数据。

貌似runningBackups为非零值，就可以确保全页写的。这个猜想可以由下面代码得到确认：

/*              
 * doPageWrites is this backend's local copy of (fullPageWrites ||
 * runningBackups > 0).  It is used together with RedoRecPtr to decide whether
 * a full-page image of a page need to be taken.
 *                      
 * NB: Initially this is false, and there's no guarantee that it will be
 * initialized to any other value before it is first used. Any code that
 * makes use of it must recheck the value after obtaining a WALInsertLock,
 * and respond appropriately if it turns out that the previous value wasn't
 * accurate.
 */
static bool doPageWrites;

doPageWrites = (Insert->fullPageWrites || Insert->runningBackups > 0);

布尔变量doPageWrites表示是否处于全页写的模式。上面的代码清楚地表明：doPageWrites = true是由两个条件决定的，两个条件是或(OR)的关系，其中一个条件就是runningBackups > 0，即只要runningBackups非零，则就处于全页写模式。

明白了这个，我们就理解了do_pg_backup_start()函数中的那条语句，简单地把runningBackups加一，即可强制此时数据库处于全页写模式。

我们接着分析do_pg_backup_start()函数下面的代码：

		/*
		 * Force an XLOG file switch before the checkpoint, to ensure that the
		 * WAL segment the checkpoint is written to doesn't contain pages with
		 * old timeline IDs.  That would otherwise happen if you called
		 * pg_backup_start() right after restoring from a PITR archive: the
		 * first WAL segment containing the startup checkpoint has pages in
		 * the beginning with the old timeline ID.  That can cause trouble at
		 * recovery: we won't have a history file covering the old timeline if
		 * pg_wal directory was not included in the base backup and the WAL
		 * archive was cleared too before starting the backup.
		 *
		 * This also ensures that we have emitted a WAL page header that has
		 * XLP_BKP_REMOVABLE off before we emit the checkpoint record.
		 * Therefore, if a WAL archiver (such as pglesslog) is trying to
		 * compress out removable backup blocks, it won't remove any that
		 * occur after this point.
		 *
		 * During recovery, we skip forcing XLOG file switch, which means that
		 * the backup taken during recovery is not available for the special
		 * recovery case described above.
		 */
		if (!backup_started_in_recovery)
			RequestXLogSwitch(false);

上述代码的含义是：如果本次备份是在主库上做的，即backup_started_in_recovery=false，则执行一次WAL文件的切换。

			/*
			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
			 * page problems, this guarantees that two successive backup runs
			 * will have different checkpoint positions and hence different
			 * history file names, even if nothing happened in between.
			 *
			 * During recovery, establish a restartpoint if possible. We use
			 * the last restartpoint as the backup starting checkpoint. This
			 * means that two successive backup runs can have same checkpoint
			 * positions.
			 *
			 * Since the fact that we are executing do_pg_backup_start()
			 * during recovery means that checkpointer is running, we can use
			 * RequestCheckpoint() to establish a restartpoint.
			 *
			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
			 * passing fast = true).  Otherwise this can take awhile.
			 */
			RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
							  (fast ? CHECKPOINT_IMMEDIATE : 0));

这个函数执行一次检查点，它的底层逻辑是向checkpointer进程通过kill()调用发送SIGINT信号，checkpointer进程收到这个信号后，就检查一下共享内存中的相关设置，再执行检查点。CHECKPOINT_WAIT这个标志表示必须等检查点执行完毕后，才能继续往下走。所以这个RequestCheckPoint()执行完毕后，检查点的操作也完成了。根据fast参数是true还是false，来决定是否把CHECKPOINT_IMMEDIATE传送给检查点进行。根据这个代码，我们就很清楚fast这个参数的含义了。

xiaobu 4月前

引用 3楼
如何判断一个文件名是合法的WAL文件？

我们知道，在PG中的WAL文件的名字是有规律的，用户不能自行定义WAL文件的名字。一个合法的WAL文件名是24个字符，每个字符都是0-9， A-F中的一个。所以下面的判断函数大家就清楚了：
```
/* Length of XLog file name */
#define XLOG_FNAME_LEN	   24

static inline bool
IsXLogFileName(const char *fname)
{
	return (strlen(fname) == XLOG_FNAME_LEN && \
			strspn(fname, "0123456789ABCDEF") == XLOG_FNAME_LEN);
}
```
上述代码的逻辑是非常简单的：首先判断文件名是不是24个字符长，即strlen(fname) == XLOG_FNAME_LEN，然后通过strspn这个函数，判断开始的24个字符是否都来自"0123456789ABCDEF"中的一个。如果这两个条件都满足，就判断这个文件名是合法的WAL文件。

其实这种判断逻辑依然有缺陷。我们知道WAL文件的最小体积是1MB，最大体积是1GB。当WAL文件的体积是1MB时，从后面往前数（最后一个字符的编号为1)，第4、5、6、7、8个字符是全0。当WAL文件的体积是1GB时，从后往前数，第2、3、4、5、6、7、8个字符是0，上述代码没有考虑到这些规律。

########################

00000
xiaobu 4月前

引用 4楼
函数pg_walfile_name()的源码分析

当我们手里有一个LSN，我们想知道它被存放在哪个WAL文件中，我们可以使用一个函数pg_walfile_name()来完成。下面的例子就是该函数的用法举例：
```
postgres=# SELECT pg_walfile_name('ABC/12345DEF');
     pg_walfile_name      
--------------------------
 0000000100000ABC00000012
(1 row)
```
从上面的结果，我们可以看到LSN为ABC/12345DEF，它所在的WAL文件是0000000100000ABC00000012。这里面的前提是：WAL文件的体积是16MB，且当前时间线是1。

下面我们看看这个函数的源码：
```
/*
 * Compute an xlog file name given a WAL location,
 * such as is returned by pg_backup_stop() or pg_switch_wal().
 */
Datum
pg_walfile_name(PG_FUNCTION_ARGS)
{
	XLogSegNo	xlogsegno;
	XLogRecPtr	locationpoint = PG_GETARG_LSN(0);
	char		xlogfilename[MAXFNAMELEN];

	if (RecoveryInProgress())
		ereport(ERROR,
				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
				 errmsg("recovery is in progress"),
				 errhint("%s cannot be executed during recovery.",
						 "pg_walfile_name()")));

	XLByteToSeg(locationpoint, xlogsegno, wal_segment_size);
	XLogFileName(xlogfilename, GetWALInsertionTimeLine(), xlogsegno,
				 wal_segment_size);

	PG_RETURN_TEXT_P(cstring_to_text(xlogfilename));
}
```
这个函数的传入参数只有一个，就是locationpoint，它是8字节的LSN。该函数只能在主库上执行，即RecoveryInProcess()返回的结果为false。不过我觉得这个限制条件没有必要。

该函数的逻辑主要有两个，第一个是XLByteToSeg，第二个是XLogFileName。我们先看看XLByteToSeg的代码：
```
/*
 * Compute a segment number from an XLogRecPtr.
 *
 * For XLByteToSeg, do the computation at face value.  For XLByteToPrevSeg,
 * a boundary byte is taken to be in the previous segment.  This is suitable
 * for deciding which segment to write given a pointer to a record end,
 * for example.
 */
#define XLByteToSeg(xlrp, logSegNo, wal_segsz_bytes) \
	logSegNo = (xlrp) / (wal_segsz_bytes)
```
XLByteToSeg是一个简单的宏定义，它是计算一个LSN的所在的WAL文件的编号，这个编号是一维的。举个例子，我们有10个数字，从0到9，每2个元素分为一组，就共计5组。我们把这5组进行编号，组号是0到4。类似的，只要把LSN除于WAL文件的大小，就得到了这个LSN所在的WAL文件的编号。

在我们的例子中，wal_segsz_bytes为2^24字节。LSN是0xABC12345EDF，这个数字除于2^24，就是把它向右移动24比特，结果是0xABC12，这个就是该LSN所在的WAL文件的编号。

拿到这个编号后，再执行XLogFileName()函数，我们看看这个函数的源代码：
```
/*
 * Generate a WAL segment file name.  Do not use this function in a helper
 * function allocating the result generated.
 */
static inline void
XLogFileName(char *fname, TimeLineID tli, XLogSegNo logSegNo, int wal_segsz_bytes)
{
	snprintf(fname, MAXFNAMELEN, "%08X%08X%08X", tli,
			 (uint32) (logSegNo / XLogSegmentsPerXLogId(wal_segsz_bytes)),
			 (uint32) (logSegNo % XLogSegmentsPerXLogId(wal_segsz_bytes)));
}
```
时间线不是我们重点考虑的，主要是如何根据logSegNo和wal_segsz_bytes的值来构建WAL文件的文件名。
```
#define XLogSegmentsPerXLogId(wal_segsz_bytes)	\
	(UINT64CONST(0x100000000) / (wal_segsz_bytes)) 
```
很显然：XLogSegmentsPerXLogId(2^24)的返回值是256。

所以，当logSegNo为0xABC12时，0xABC12/256 = 0xABC，0xABC%256 = 12，再加上时间线为1，

最后我们可以得到LSN为ABC/123456DEF时，它的WAL文件是(为了更加清晰，我把WAL文件的三部分用“.”来分割)：

00000001.00000ABC.00000012，再去掉分隔符“.”，就可以得到最终的结果是：

0000000100000ABC00000012

大家把上述的代码走一遍，必要时用笔计算一下，就对涉及的算法有清晰的理解。

发新帖

xiaobu

主题数
49

帖子数
165

注册排名
19

小布源码分析札记-03：WAL相关的源码分析

函数pg_backup_start()的分析

函数do_pg_backup_start()的理解

如何判断一个文件名是合法的WAL文件？

函数pg_walfile_name()的源码分析

xiaobu