小布源码分析札记-00 : 随笔-PostgreSQL数据库-樱桃溪学院

小布源码分析札记-00 : 随笔

xiaobu 5月前 427

WAL文件的合法尺寸是多少？

我们知道，在使用initdb创建数据库集群时，有一个参数可以指定WAL文件的大小：

postgres@ip-172-31-29-179:~$ initdb --help | grep wal-segsize
      --wal-segsize=SIZE    size of WAL segments, in megabytes

缺省情况下，SIZE的大小是16，因为单位是megabytes，所以大小是16MB。如果我想指定别的值，怎么指定呢？让我们翻看源代码，在源代码中有一个IsValidWalSegSize()的宏定义，它可以帮我们检测一个size是否是合法的WAL文件的大小，其定义在src/include/access/xlog_internal.h给出：

/* wal_segment_size can range from 1MB to 1GB */
#define WalSegMinSize 1024 * 1024
#define WalSegMaxSize 1024 * 1024 * 1024

/* check that the given size is a valid wal_segment_size */
#define IsPowerOf2(x) (x > 0 && ((x) & ((x)-1)) == 0)
#define IsValidWalSegSize(size) \
	 (IsPowerOf2(size) && \
	 ((size) >= WalSegMinSize && (size) <= WalSegMaxSize))

上述代码不难理解。其中IsPowerOf2()检测这个数值是否是2的指数倍。这种数值的二进制表示形式的特点是：只有最高位为1，其余都是0，如100,1000,10000,10000000等。IsValidWalSegSize()的另外一个检查条件是判断这个值是否在1MB和1GB之间。

根据上述代码，我们可以得知：如果指定的WAL文件的大小(单位是byte)SIZE是2的指数倍，且在1MB和1GB之间，则为合法的WAL文件的大小，譬如16MB, 32MB, 64MB, 128MB, 256MB, 512MB，1GB都是合法的大小。

具体多大的WAL文件体积能带来最好的性能，这个问题似乎没有讨论过。我维护的生产库，因为负荷很大，所以16MB的WAL文件，2秒钟就会被填满，然后创建一个新的WAL文件。在这种情况下，似乎更大尺寸的WAL文件更好，譬如64MB或者128MB。

最新回复 (14)

xiaobu 5月前

引用 2楼
CreateSharedMemoryAndSemaphores()函数的整体分析

我们知道，Oracle, PostgreSQL都会维系一个共享内存，这个共享内存占用的体积巨大，是整个数据库集群的核心。PostgreSQL在postmaster主进程中，在创建任何子进程之前，会调用CreateSharedMemoryAndSemaphores()函数创建共享内存。在一个数据库实例(database instance)的生命周期内，这个函数只调用一次。它的调用关系是：

main() -> PostmasterMain() -> CreateSharedMemoryAndSemaphores()

在本文中，我们分析一下这个函数的逻辑。

综管这个函数，它基本上分为三步，第一步是计算整块共享内存的尺寸，
```
size = CalculateShmemSize(&numSemas);
```
共享内存的尺寸是由很多配置参数决定的，如shared_buffers, wal_buffers, max_wal_senders等等。具体细节可以参考CalculateShmemSize()函数的实现。这个函数返回值size是一个8字节的无符号整数。

第二步就是根据这个尺寸创建一整块巨大的共享内存，由如下函数实现：
```
seghdr = PGSharedMemoryCreate(size, &shim);
```
我们可以把这块共享内存理解为一个巨大的C语言数组，postmaster主进程，它派生(fork)出来的子进程都可以访问这个数组。第三步是对数组初始化：
```
	/*
	 * Set up shared memory allocation mechanism
	 */
	InitShmemAllocation();
```
在InitShmemAllocation()函数中，会创建一个自旋锁
```
ShmemLock = (slock_t *) ShmemAllocUnlocked(sizeof(slock_t));
```
自旋锁ShmemLock创建成功以后，任何在这块共享内存中分配内存的操作，都要由该自旋锁进行保护，确保共享内存的分配是串行执行的。所以有两个共享内存分配函数ShmemAllocUnlocked()和ShmemAlloc()，分别是无锁分配共享内存和有锁分配共享内存。无锁分配是在ShmemLock创建之前使用，有锁分配是ShmemLock创建之后使用。读者可以对比这两个函数的相同点和不同点。整体分配内存的逻辑是一样的，不过有锁分配会使用加锁和解锁两个步骤来保护对共享内存的操作：
```
SpinLockAcquire(ShmemLock); /// 加锁

/* alloc the shared memory here */

SpinLockRelease(ShmemLock); /// 解锁
```
对共享内存的分配就是简单的指针移动。在共享内存的头部有一个结构PGShmemHeader，它里面有一个freeoffset位置指针，在这个指针的左边，低地址的内存是已经分配的；在它的右边，高地址的内存是没有分配的空闲共享内存。通过不断增加freeoffset的值来进行内存的分配。具体细节请参考共享内存分配函数ShmemAllocUnlocked()和ShmemAlloc()两个函数的细节，其实非常简单，不难理解。
xiaobu 5月前

引用 3楼
PGSharedMemoryCreate()的分析

这个函数是真正创建共享内存的函数。它的输入就是共享内存的尺寸size,单位是byte。PG有两个函数可以控制分配共享内存的行为，一个是huge_pages，它的合法值是off, on, try，表示是否使用Huge Page，缺省值是try。关于Huge Page，请自行搜索其含义。第二个参数是shared_memory_type，它的合法值是sysv和mmap，缺省值是mmap。共享内存有三种类型，具体可以参考这篇文章：

https://rhaas.blogspot.com/2012/06/absurd-shared-memory-limits.html

缺省值就是最好的配置。PG会使用mmap()函数创建匿名的共享内存，它会尝试在Huge Page上创建，如果创建失败，且huge_pages为on，就报错退出了。如果huge_pages为try，就再次使用mmap()创建非Huge Page的匿名共享内存。

在通过mmap()函数调用创建真正的共享内存后，PG还会使用shmget()再创建一个56字节的System V类型的共享内存，这是因为System V共享内存有一个独有的特性，就是可以知道有多少进程贴(attach)到本共享内存中，这个功能是通过shmctl()函数中指定IPC_STAT，然后从结构shmid_ds的成员变量shm_nattch处获得。所以当PG的数据库实例启动后，我们通过ipcs会看到56字节的共享内存：
```
postgres@ip-172-31-29-179:~$ ipcs

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00054636 98349      postgres   600        56         8                    

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
```
PG只使用了共享内存，并没有使用seamphore和message queue的技术。PG锁的全部技术都是基于自旋锁和共享内存配合来实现的。
xiaobu 5月前

引用 4楼
PG进程的当前目录辨析

我们知道，PG的数据库实例(database instance)就是一捆后台进程，加上某个目录下的一捆文件。这个目录叫做数据库集群(database cluster)，或者叫做data directory。我们在很多PG的程序中通过-D参数来指定它。

在我们操作电脑时，有一个基本概念叫做“当前目录”(current working directory)。在《The Linux Programming Interface》这本书上对“当前目录”有过定义，原文如下：

Each process has a current working directory (sometimes just referred to as the process’s working directory or current directory). This is the process’s “current location” within the single directory hierarchy, and it is from this directory that relative pathnames are interpreted for the process.

A process inherits its current working directory from its parent process. A login shell has its initial current working directory set to the location named in the home directory field of the user’s password file entry. The shell’s current working directory can be changed with the cd command.

上述这段英文精确地描述了“当前目录”的一些特性，以及子进程当前目录来自父进程的规律。在Postmaster主进程的启动阶段会调用ChangeToDataDir()函数，其代码如下：
```
/*
 * Change working directory to DataDir.  Most of the postmaster and backend
 * code assumes that we are in DataDir so it can use relative paths to access
 * stuff in and under the data directory.  For convenience during path
 * setup, however, we don't force the chdir to occur during SetDataDir.
 */
void
ChangeToDataDir(void)
{
	Assert(DataDir);

	if (chdir(DataDir) < 0) 
		ereport(FATAL,
				(errcode_for_file_access(),
				 errmsg("could not change directory to \"%s\": %m",
						DataDir)));
}
```
上述代码使用chdir()系统调用把本进程的当前目录变成指定的DataDir，由此可见，postmaster的当前目录是数据库集群的目录，因为其余的进程都是postmaster的子进程，所以它们的当前目录也是数据库集群目录。我们在设置archive_command参数时，经常要使用的两个特殊变量%p和%f，%p指的是pg_wal/00000001000000000000000A，则%f指00000001000000000000000A。因为所有的PG后台进程的当前目录都是数据库集群目录，所以pg_wal指的当然是相对数据库集群目录。

系统调用chdir()的使用细节请参考：

https://man7.org/linux/man-pages/man2/chdir.2.html

这个系统调用用法非常简单，传入的参数就是一个目录的字符串，在此不再赘述。
xiaobu 5月前

引用 5楼
PG源码的一些风格

PG的源代码相当古老，最初是上个世纪90年代，它的代码中大量使用全局变量，而且每个进程只有一个线程，即你可以把一个进程的代码理解为串行执行的指令流，不存在并发和同步的问题。

这种设计风格比较古老，谈不上好和坏。如果说好的一方面，就是整个系统非常稳定，恼人的多线程并发问题在PG中不存在。维护过PG的DBA都觉得PG非常稳定。坏的一方面，就是无法充分享有多线程带来的拼命压榨硬件性能的好处。

我们来看一段代码：
```
static ControlFileData *ControlFile = NULL;
void LocalProcessControlFile(bool reset)
{
	Assert(reset || ControlFile == NULL);
	ControlFile = palloc(sizeof(ControlFileData)); 
	/// 这个时候ControlFile是指向私有内存中的指针。
	ReadControlFile(); /// 读取控制文件中的内容到ControlFile中。
}
```
这段代码中，ControlFile是一个指针，当它分配完内存后，并没有作为参数传递给后面的ReadControlFile()函数中，而实际上这个函数使用了ControlFile这个指针。所以你在阅读源代码时要记忆一些常用的全局变量，否则你不大容易理解代码的逻辑。

我觉得ReadControlFile()函数把ControlFile指针作为输入参数，整体代码的可读性会更好。

xiaobu 5月前

6楼

文件postmaster.opts的作用

当我们使用initdb创建完一个崭新的数据库集群后，它的目录结构是这样滴：

postgres@ip-172-31-29-179:~$ ls -l data2
total 120
-rw------- 1 postgres postgres     3 Jul 27 22:26 PG_VERSION
drwx------ 5 postgres postgres  4096 Jul 27 22:26 base
drwx------ 2 postgres postgres  4096 Jul 27 22:26 global
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_commit_ts
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_dynshmem
-rw------- 1 postgres postgres  5711 Jul 27 22:26 pg_hba.conf
-rw------- 1 postgres postgres  2640 Jul 27 22:26 pg_ident.conf
drwx------ 4 postgres postgres  4096 Jul 27 22:26 pg_logical
drwx------ 4 postgres postgres  4096 Jul 27 22:26 pg_multixact
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_notify
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_replslot
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_serial
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_snapshots
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_stat
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_stat_tmp
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_subtrans
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_tblspc
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_twophase
drwx------ 4 postgres postgres  4096 Jul 27 22:26 pg_wal
drwx------ 2 postgres postgres  4096 Jul 27 22:26 pg_xact
-rw------- 1 postgres postgres    88 Jul 27 22:26 postgresql.auto.conf
-rw------- 1 postgres postgres 30690 Jul 27 22:26 postgresql.conf

我们看到，并没有postmaster.opts和postmaster.pid两个文件。当我们使用pg_ctl或者systemctl等启动工具启动数据库实例后，就会有postmaster.opts这个文件。其实这个文件不是很重要，它的作用是记录最近依次启动postgres主进程时所使用的参数。这个文件在我们诊断错误时，会提供一些有用的信息，让我们知道当时主进程使用了哪些参数来启动的。

但是作为一个简单分析源码的例子，我们分析一下这个文件的创建过程，还是一个不错的小练习。我们在postmaster.c中会看到主进程启动时会执行如下逻辑：

	/*
	 * Record postmaster options.  We delay this till now to avoid recording
	 * bogus options (eg, unusable port number).
	 */
	if (!CreateOptsFile(argc, argv, my_exec_path))
		ExitPostmaster(1);

函数CreateOptsFile()有三个入口参数，其中argc和argv就是从main()函数一路传进来的。my_exec_path是包含绝对路径的主进程的文件，如：/home/postgres/pg175/bin/postgres

上述代码的逻辑很简单，如果创建postmaster.opts失败了，就退出整个主进程。当然这种情况比较罕见，除非是因为权限问题，PG的主进程无法往数据库集群的目录中写文件。然后我们看看这个函数的代码：

/*
 * Create the opts file
 */
static bool
CreateOptsFile(int argc, char *argv[], char *fullprogname)
{
	FILE	   *fp;
	int			i;

#define OPTS_FILE	"postmaster.opts"

	if ((fp = fopen(OPTS_FILE, "w")) == NULL)
	{
		ereport(LOG,
				(errcode_for_file_access(),
				 errmsg("could not create file \"%s\": %m", OPTS_FILE)));
		return false;
	}

	fprintf(fp, "%s", fullprogname);
	for (i = 1; i < argc; i++)
		fprintf(fp, " \"%s\"", argv[i]);
	fputs("\n", fp);

	if (fclose(fp))
	{
		ereport(LOG,
				(errcode_for_file_access(),
				 errmsg("could not write file \"%s\": %m", OPTS_FILE)));
		return false;
	}

	return true;
}

上述代码的逻辑非常清晰，就是使用fopen/fprintf/fputs/fclose这几个常用的文件操作函数打开文件，往里面写入argv[]参数的值，记录以后仅供未来排错时参考使用。这个文件在数据库实例运行期间，没有任何作用。

xiaobu 5月前

引用 7楼
PG启动子进程的顺序

我们知道，postmaster主进程启动后，会首先根据配置文件中的各种参数，计算出共享内存的尺寸，然后通过CreateSharedMemoryAndSemaphores()函数创建这块共享内存。共享内存创建以后，主进程就会启动子进程。主进程启动子进程的顺序是什么样的呢？

在PostmasterMain()函数中，它首先会执行如下代码：
```
	/*
	 * If enabled, start up syslogger collection subprocess
	 */
	SysLoggerPID = SysLogger_Start(); 
/// 根据参数配置来决定是否启动日志搜集进程，缺省情况下是不启动这个进程。
```
所以第一个启动的子进程是syslogger子进程。但是要启动这个子进程，需要设置logging_collector参数，这是一个布尔变量的参数，合法的值有on/off两种，确定是否启动syslogger进程。其代码见下：
```
/*
 * Postmaster subroutine to start a syslogger subprocess.
 */
int
SysLogger_Start(void)
{
	pid_t		sysloggerPid;
	char	   *filename;

	if (!Logging_collector) /// logging_collector参数的值是 true 或者 false，决定是否启动日志进程。
		return 0;
```
从上面的代码逻辑可以知道，如果logging_collector参数的值为off，则不启动syslogger子进程。

主进程接着往下走，执行如下代码：
```
	/* Start bgwriter and checkpointer so they can help with recovery */
/// 先启动 CheckPoint 进程和 Bgwriter 进程，确保 StartUp 进程能够正常工作。
	if (CheckpointerPID == 0) 
		CheckpointerPID = StartChildProcess(B_CHECKPOINTER);
	if (BgWriterPID == 0)
		BgWriterPID = StartChildProcess(B_BG_WRITER);

	/*
	 * We're ready to rock and roll...
	 */
/// 启动 Startup 进程，如果失败，整个数据库集群无法启动。
	StartupPID = StartChildProcess(B_STARTUP); 
	Assert(StartupPID != 0);
	StartupStatus = STARTUP_RUNNING;
	pmState = PM_STARTUP;

	/* Some workers may be scheduled to start now */
	maybe_start_bgworkers();

	status = ServerLoop(); /// 这个是postmaster主进程的主要循环逻辑。

	/*
	 * ServerLoop probably shouldn't ever return, but if it does, close down.
	 */
	ExitPostmaster(status != STATUS_OK);

	abort();					/* not reached */
}
```
从上面的代码逻辑，我们可以非常清楚地看出：主进程会首先启动checkpointer子进程，然后是background writer子进程，然后是startup子进程。这三个子进程的启动都是无条件的。先启动头两个，是因为startup进程需要它们先运行。在正常的主库下，我们看不到startup子进程，这是因为这个进程完成了数据库的恢复后，就悄无声息地退出了。我们在备库和PITR时会看到startup子进程的身影。

这三个子进程启动完毕后，主进程就进入到ServerLoop()函数中，这是一个无限循环，主进程会侦听网络端口，一旦有客户进程连接进来，主进程就会派生出子进程postgres来和客户端进程一对一的对接。

所以我们得出一个结论：按照启动的顺序，主进程会启动如下子进程：

主进程会首先启动syslogger子进程，也可能不启动它，这取决于参数设置。

然后主进程无条件启动checkpointer子进程。

然后主进程会无条件启动background writer子进程。

然后主进程会无条件启动startup子进程。

xiaobu 5月前

8楼

PG主进程是如何启动子进程的？

首先我们要记住一个基本事实：

在PG的数据库实例中，只有一个主进程postmaster，其余的任何PG进程都是主进程的子进程，不存在子进程再创建孙进程的情况。如果一个子进程想创建一个子进程，如startup进程想创建walreceiver进程，startup进程会往共享内存的某个地方写一个标记，并通过信号，告知主进程。主进程接到通知后，检查它和startup进程约定的地方有个标记，它就明白了startup子进程的意图，就创建子进程walreceiver。所以walreceiver虽然是在startup子进程的请求下创建的，但它和startup子进程是兄弟关系，它们的父进程都是postmaster.

主进程创建子进程的函数是StartChildProcess()，我们看一下这个函数的代码：

/*
 * MyBackendType indicates what kind of a backend this is.
 *
 * If you add entries, please also update the child_process_kinds array in
 * launch_backend.c.
 */
typedef enum BackendType
{
	B_INVALID = 0,

	/* Backends and other backend-like processes */
	B_BACKEND,
	B_AUTOVAC_LAUNCHER,
	B_AUTOVAC_WORKER,
	B_BG_WORKER,
	B_WAL_SENDER,
	B_SLOTSYNC_WORKER,

	B_STANDALONE_BACKEND,

	/*
	 * Auxiliary processes. These have PGPROC entries, but they are not
	 * attached to any particular database, and cannot run transactions or
	 * even take heavyweight locks. There can be only one of each of these
	 * running at a time.
	 *
	 * If you modify these, make sure to update NUM_AUXILIARY_PROCS and the
	 * glossary in the docs.
	 */
	B_ARCHIVER,
	B_BG_WRITER,
	B_CHECKPOINTER,
	B_STARTUP,
	B_WAL_RECEIVER,
	B_WAL_SUMMARIZER, /// 这是PG 17才新增加的进程
	B_WAL_WRITER,

	/*
	 * Logger is not connected to shared memory and does not have a PGPROC
	 * entry.
	 */
	B_LOGGER,
} BackendType;

/*
 * StartChildProcess -- start an auxiliary process for the postmaster
 *
 * "type" determines what kind of child will be started.  All child types
 * initially go to AuxiliaryProcessMain, which will handle common setup.
 *
 * Return value of StartChildProcess is subprocess' PID, or 0 if failed
 * to start subprocess.
 */
static pid_t
StartChildProcess(BackendType type)
{
	pid_t		pid;

	pid = postmaster_child_launch(type, NULL, 0, NULL);
	if (pid < 0)
	{
		/* in parent, fork failed */
		ereport(LOG,
				(errmsg("could not fork \"%s\" process: %m", PostmasterChildName(type))));

		/*
		 * fork failure is fatal during startup, but there's no need to choke
		 * immediately if starting other child types fails.
		 */
		if (type == B_STARTUP)
			ExitPostmaster(1); /// 如果 startup 进程无法启动，就退出本进程。
		return 0;
	}

	/*
	 * in parent, successful fork
	 */
	return pid;
}

从上述代码中我们不难看出，底层是使用postmaster_child_launch()函数来执行子进程的创建。如果子进程创建失败，且要创建的子进程是B_STARTUP，这种情况比较严重，主进程就退出了，整个数据库实例无法启动。我们看看postmaster_child_launch()函数的代码逻辑：

/*
 * Start a new postmaster child process.
 *
 * The child process will be restored to roughly the same state whether
 * EXEC_BACKEND is used or not: it will be attached to shared memory if
 * appropriate, and fds and other resources that we've inherited from
 * postmaster that are not needed in a child process have been closed.
 *
 * 'startup_data' is an optional contiguous chunk of data that is passed to
 * the child process.
 */
pid_t
postmaster_child_launch(BackendType child_type,
						char *startup_data, size_t startup_data_len,
						ClientSocket *client_sock)
{
	pid_t		pid;

	Assert(IsPostmasterEnvironment && !IsUnderPostmaster);

#ifdef EXEC_BACKEND
	pid = internal_forkexec(child_process_kinds[child_type].name,
							startup_data, startup_data_len, client_sock);
	/* the child process will arrive in SubPostmasterMain */
#else							/* !EXEC_BACKEND */
	pid = fork_process(); /// 核心就是调用 fork()来创建子进程
	if (pid == 0)				/* child */
	{
		/* Close the postmaster's sockets */
		ClosePostmasterPorts(child_type == B_LOGGER);

		/* Detangle from postmaster */
		InitPostmasterChild();

		/* Detach shared memory if not needed. */
		if (!child_process_kinds[child_type].shmem_attach)
		{
			dsm_detach_all();
			PGSharedMemoryDetach();
		}

		/*
		 * Enter the Main function with TopMemoryContext.  The startup data is
		 * allocated in PostmasterContext, so we cannot release it here yet.
		 * The Main function will do it after it's done handling the startup
		 * data.
		 */
		MemoryContextSwitchTo(TopMemoryContext);

		if (client_sock)
		{
			MyClientSocket = palloc(sizeof(ClientSocket));
			memcpy(MyClientSocket, client_sock, sizeof(ClientSocket));
		}

		/*
		 * Run the appropriate Main function
		 */
		child_process_kinds[child_type].main_fn(startup_data, startup_data_len); /// 通过函数指针调用不同的子进程入口函数。
		pg_unreachable();		/* main_fn never returns */
	}
#endif							/* EXEC_BACKEND */
	return pid;
}

嘟从上面的逻辑可以看出，创建子进程使用了fork_process()函数，我们再来看看这个函数的代码：

/*
 * Wrapper for fork(). Return values are the same as those for fork():
 * -1 if the fork failed, 0 in the child process, and the PID of the
 * child in the parent process.  Signals are blocked while forking, so
 * the child must unblock.
 */
pid_t
fork_process(void)
{
	pid_t		result;
	const char *oomfilename;
	sigset_t	save_mask;

#ifdef LINUX_PROFILE
	struct itimerval prof_itimer;
#endif

	/*
	 * Flush stdio channels just before fork, to avoid double-output problems.
	 */
	fflush(NULL);

#ifdef LINUX_PROFILE

	/*
	 * Linux's fork() resets the profiling timer in the child process. If we
	 * want to profile child processes then we need to save and restore the
	 * timer setting.  This is a waste of time if not profiling, however, so
	 * only do it if commanded by specific -DLINUX_PROFILE switch.
	 */
	getitimer(ITIMER_PROF, &prof_itimer);
#endif

	/*
	 * We start postmaster children with signals blocked.  This allows them to
	 * install their own handlers before unblocking, to avoid races where they
	 * might run the postmaster's handler and miss an important control
	 * signal. With more analysis this could potentially be relaxed.
	 */
	sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
	result = fork(); /// 这个是关键点，调用 fork()系统调用来创建子进程。
	if (result == 0)
	{
		/* fork succeeded, in child */
		MyProcPid = getpid();
#ifdef LINUX_PROFILE
		setitimer(ITIMER_PROF, &prof_itimer, NULL);
#endif

		/*
		 * By default, Linux tends to kill the postmaster in out-of-memory
		 * situations, because it blames the postmaster for the sum of child
		 * process sizes *including shared memory*.  (This is unbelievably
		 * stupid, but the kernel hackers seem uninterested in improving it.)
		 * Therefore it's often a good idea to protect the postmaster by
		 * setting its OOM score adjustment negative (which has to be done in
		 * a root-owned startup script).  Since the adjustment is inherited by
		 * child processes, this would ordinarily mean that all the
		 * postmaster's children are equally protected against OOM kill, which
		 * is not such a good idea.  So we provide this code to allow the
		 * children to change their OOM score adjustments again.  Both the
		 * file name to write to and the value to write are controlled by
		 * environment variables, which can be set by the same startup script
		 * that did the original adjustment.
		 */
		oomfilename = getenv("PG_OOM_ADJUST_FILE");

		if (oomfilename != NULL)
		{
			/*
			 * Use open() not stdio, to ensure we control the open flags. Some
			 * Linux security environments reject anything but O_WRONLY.
			 */
			int			fd = open(oomfilename, O_WRONLY, 0);

			/* We ignore all errors */
			if (fd >= 0)
			{
				const char *oomvalue = getenv("PG_OOM_ADJUST_VALUE");
				int			rc;

				if (oomvalue == NULL)	/* supply a useful default */
					oomvalue = "0";

				rc = write(fd, oomvalue, strlen(oomvalue));
				(void) rc;
				close(fd);
			}
		}

		/* do post-fork initialization for random number generation */
		pg_strong_random_init();
	}
	else
	{
		/* in parent, restore signal mask */
		sigprocmask(SIG_SETMASK, &save_mask, NULL);
	}

	return result;
}

这里面最最关键的地方就是result = fork(); 自此实锤了，我们可以得出结论：

PG的主进程是通过fork()这个著名的系统调用来创建子进程的。子进程会继承父进程的很多资源，譬如在父进程中创建的共享内存，会被子进程自动拥有。还有主进程的一些私有内存，也会被子进程拥有。具体细节可以查看父子进程的关系相关的文档。

xiaobu 5月前

引用 9楼
PageGetMaxOffsetNumber()函数分析

这个函数是或者一个数据页上有多少条记录。它的代码非常容易理解，具体如下：
```
/*
 * PageGetMaxOffsetNumber
 *		Returns the maximum offset number used by the given page.
 *		Since offset numbers are 1-based, this is also the number
 *		of items on the page.
 *
 *		NOTE: if the page is not initialized (pd_lower == 0), we must
 *		return zero to ensure sane behavior.
 */
static inline OffsetNumber
PageGetMaxOffsetNumber(Page page)
{
	PageHeader	pageheader = (PageHeader) page;

	if (pageheader->pd_lower <= SizeOfPageHeaderData)
		return 0;
	else
		return (pageheader->pd_lower - SizeOfPageHeaderData) / sizeof(ItemIdData);
}
```
根据我们对一个数据页的结构的理解，一个数据页共计8192字节，其中开始的24字节就是数据页的头部，即SizeOfPageHeaderData为24，数据页头部24字节中有一个指针pg_lower指向了空闲空间的下限，低地址，所以pg_lower - 24即为记录指针的总体积s，每个记录指针的大小是4字节，所以s/4即为这个数据页上的记录条数。上述代码完美地展示了这个逻辑。

xiaobu 4月前

10楼

全局变量IsUnderPostmaster的含义

这个变量是一个布尔型变量，其定义在backend/utils/init/globals.c中

bool		IsUnderPostmaster = false;

然后它在backend/utils/init/miscinit.c的InitPostmasterChild()函数中被唯一地赋值为true。

/*
 * Initialize the basic environment for a postmaster child
 *
 * Should be called as early as possible after the child's startup. However,
 * on EXEC_BACKEND builds it does need to be after read_backend_variables().
 */
void
InitPostmasterChild(void)
{
	IsUnderPostmaster = true;	/* we are a postmaster subprocess now */

这个函数在postmaster_child_launch()中被调用：

/*      
 * Start a new postmaster child process.
 *      
 * The child process will be restored to roughly the same state whether
 * EXEC_BACKEND is used or not: it will be attached to shared memory if
 * appropriate, and fds and other resources that we've inherited from
 * postmaster that are not needed in a child process have been closed.
 *
 * 'child_slot' is the PMChildFlags array index reserved for the child
 * process.  'startup_data' is an optional contiguous chunk of data that is
 * passed to the child process.
 */      
pid_t    
postmaster_child_launch(BackendType child_type, int child_slot, 
                                                const void *startup_data, size_t startup_data_len,
                                                ClientSocket *client_sock)
{       
        pid_t           pid;
 
        Assert(IsPostmasterEnvironment && !IsUnderPostmaster);
         
        /* Capture time Postmaster initiates process creation for logging */
        if (IsExternalConnectionBackend(child_type))
                ((BackendStartupData *) startup_data)->fork_started = GetCurrentTimestamp();

        pid = fork_process();
        if (pid == 0)                           /* child */
        {
                /* Capture and transfer timings that may be needed for logging */
                if (IsExternalConnectionBackend(child_type))
                {
                        conn_timing.socket_create =
                                ((BackendStartupData *) startup_data)->socket_created;
                        conn_timing.fork_start =
                                ((BackendStartupData *) startup_data)->fork_started;
                        conn_timing.fork_end = GetCurrentTimestamp();
                }

                /* Close the postmaster's sockets */
                ClosePostmasterPorts(child_type == B_LOGGER);

                /* Detangle from postmaster */
                InitPostmasterChild();

                /* Detach shared memory if not needed. */
                if (!child_process_kinds[child_type].shmem_attach)
                {

从上述代码我们可以看出，postmaster主进程通过postmaster_child_launch()产生子进程，子进程启动后，在开始阶段就调用了InitPostmasterChild()，从而设置IsUnderPostmaster为true。

从上述代码的分析中我们可以看出：

IsUnderPostmaster，就是在Postmaster之下的意思，凡是由postmaster主进程派生的子进程，其IsUnderPostmaster必定为true。所以通过判断这个变量是否为true，就可以知道这个进程是否为postmaster的亲儿子。

xiaobu 4月前

引用 11楼

什么是MVCC？

Percona面试时问了一个问题：请你谈谈MVCC。

这篇blog介绍了MVCC的总体概念：

https://rhaas.blogspot.com/2017/12/mvcc-and-vacuum.html

我引述其中精彩的片段：

The basic idea is simple: instead of locking a row that we want to update, let’s just create a new version of it which, initially, is visible only to the transaction which created it. Once the updating transaction commits, we’ll make the new row visible to all new transactions that start after that point, while existing transactions continue to see the old row. That way, every transaction sees what is in essence a consistent snapshot of the database. Finally, when there are no transactions remaining that can see the old version of the row, we can discard it. This approach has much better concurrency than two-phase locking, which is why it is used by many modern database systems.

以上介绍了MVCC的基本思想。

xiaobu 4月前

12楼

pg_waldump工具是如何寻找WAL文件所在的目录的？

工具pg_waldump是一个帮助我们查看WAL记录内容的有力工具，下面是使用pg_waldump的一个例子：

postgres@ip-172-31-29-179:~$ pg_waldump -n 1 -s 0/01B9CFE0
rmgr: Transaction len (rec/tot):     34/    34, tx:        754, 
lsn: 0/01B9CFE0, prev 0/01B9CFA0, 
desc: COMMIT 2025-08-08 17:04:50.945780 UTC

上述命令就是查看LSN等于0/01B9CFE0的WAL记录的信息，很显然，当用户指定某个LSN时，pg_waldump要到某个地方寻找对应的WAL文件，那么pg_waldump是如何寻找WAL文件的呢？具体的逻辑在identify_target_directory()中，我们看如下的代码：

/*
 * Identify the target directory.
 *
 * Try to find the file in several places:
 * if directory != NULL:
 *	 directory /
 *	 directory / XLOGDIR /
 * else
 *	 .
 *	 XLOGDIR /
 *	 $PGDATA / XLOGDIR /
 *
 * The valid target directory is returned.
 */
static char *
identify_target_directory(char *directory, char *fname)
{
	char		fpath[MAXPGPATH];

	if (directory != NULL)
	{
		if (search_directory(directory, fname))
			return pg_strdup(directory);

		/* directory / XLOGDIR */
		snprintf(fpath, MAXPGPATH, "%s/%s", directory, XLOGDIR);
		if (search_directory(fpath, fname))
			return pg_strdup(fpath);
	}
	else
	{
		const char *datadir;

		/* current directory */
		if (search_directory(".", fname))
			return pg_strdup(".");
		/* XLOGDIR */
		if (search_directory(XLOGDIR, fname))
			return pg_strdup(XLOGDIR);

		datadir = getenv("PGDATA");
		/* $PGDATA / XLOGDIR */
		if (datadir != NULL)
		{
			snprintf(fpath, MAXPGPATH, "%s/%s", datadir, XLOGDIR); /// #define XLOGDIR				"pg_wal"
			if (search_directory(fpath, fname))
				return pg_strdup(fpath);
		}
	}

	/* could not locate WAL file */
	if (fname)
		pg_fatal("could not locate WAL file \"%s\"", fname);
	else
		pg_fatal("could not find any WAL file");

	return NULL;				/* not reached */
}

上述代码就是pg_waldump寻找WAL文件的逻辑。如果我们没有使用-p参数指定WAL文件所在的目录，上述代码中，directory为NULL，就走下面的逻辑。我们可以很清楚地看到，pg_waldump会先在当前目录下寻找任何一个合法的WAL文件，如果没有找到，就去当前目录下的pg_wal目录下去找任何一个合法的WAL文件。如果当前目录下没有任何合法的WAL文件，当前目录下也没有一个子目录叫做pg_wal，则在PGDATA指定的目录下的pg_wal目录下去寻找任何一个合法的WAL文件。如果这几个地方都找不到，则报错退出。

什么叫做合法的WAL文件呢？只要这个文件的文件名是24个字符长，且都是0-9,A-F这些字符组成，就视为合法的WAL文件名。

我们再看看search_directory()函数的逻辑：

/*
 * Try to find fname in the given directory. Returns true if it is found,
 * false otherwise. If fname is NULL, search the complete directory for any
 * file with a valid WAL file name. If file is successfully opened, set the
 * wal segment size.
 */
static bool
search_directory(const char *directory, const char *fname)
{
	int			fd = -1;
	DIR		   *xldir;

	/* open file if valid filename is provided */
	if (fname != NULL)
		fd = open_file_in_directory(directory, fname);

	/*
	 * A valid file name is not passed, so search the complete directory.  If
	 * we find any file whose name is a valid WAL file name then try to open
	 * it.  If we cannot open it, bail out.
	 */
	else if ((xldir = opendir(directory)) != NULL)
	{
		struct dirent *xlde;

		while ((xlde = readdir(xldir)) != NULL)
		{
			if (IsXLogFileName(xlde->d_name))
			{
				fd = open_file_in_directory(directory, xlde->d_name);
				fname = pg_strdup(xlde->d_name);
				break;
			}
		}

		closedir(xldir);
	}

	/* set WalSegSz if file is successfully opened */
	if (fd >= 0)
	{
		PGAlignedXLogBlock buf;
		int			r;

		r = read(fd, buf.data, XLOG_BLCKSZ);
		if (r == XLOG_BLCKSZ)
		{
			XLogLongPageHeader longhdr = (XLogLongPageHeader) buf.data;

			WalSegSz = longhdr->xlp_seg_size;

			if (!IsValidWalSegSize(WalSegSz))
			{
				pg_log_error(ngettext("invalid WAL segment size in WAL file \"%s\" (%d byte)",
									  "invalid WAL segment size in WAL file \"%s\" (%d bytes)",
									  WalSegSz),
							 fname, WalSegSz);
				pg_log_error_detail("The WAL segment size must be a power of two between 1 MB and 1 GB.");
				exit(1);
			}
		}
		else if (r < 0)
			pg_fatal("could not read file \"%s\": %m",
					 fname);
		else
			pg_fatal("could not read file \"%s\": read %d of %d",
					 fname, r, XLOG_BLCKSZ);
		close(fd);
		return true;
	}

	return false;
}

这个函数的逻辑也非常好理解。它打开一个目录，搜索任何一个合法的WAL文件，判断一个WAL文件是否是合法的，就是通过IsXLogFileName()函数来进行的。打开一个合法的WAL文件后，读取开始的8KB的数据，获取开始的XLogLongPageHeader结构，然后判断其中的xlp_seg_size是否是合法的WAL文件的大小，通过IsValidWalSegSize()来判断。如果是，则这个目录里可能有我们未来要寻找的WAL文件。

xiaobu 4月前

13楼

控制文件的尺寸

PG的控制文件是数据库集群目录下的global/pg_control文件。这个文件的尺寸是8192字节，但实际上只使用了几百个字节。

/*
 * Physical size of the pg_control file.  Note that this is considerably
 * bigger than the actually used size (ie, sizeof(ControlFileData)).
 * The idea is to keep the physical size constant independent of format
 * changes, so that ReadControlFile will deliver a suitable wrong-version
 * message instead of a read error if it's looking at an incompatible file.
 */
#define PG_CONTROL_FILE_SIZE		8192

上述常量就定义了控制文件的尺寸。

xiaobu 4月前

引用 14楼

一个不错的插件 pg_wal_recovery，通过这个插件，你可以深入理解PG的恢复过程。

https://dev.to/misachi/fun-with-postgres-recovery-f5e
xiaobu 4月前

引用 15楼
内存池中数据页的编号问题

PG对内存中的数据页的编号设计很简单，我们看如下源代码：
```
/*
 * Buffer identifiers.
 *
 * Zero is invalid, positive is the index of a shared buffer (1..NBuffers),
 * negative is the index of a local buffer (-1 .. -NLocBuffer).
 */
typedef int Buffer;

#define InvalidBuffer	0
```
从上面的源代码中我们可以看到，数据类型Buffer就是一个整型，当它为正数时，表示在共享内存中的数据页的编号，当它为负数时，表示为在私有内存中的数据页的编号，0表示为无效的数据页编号。

共享池的大小由shared_buffers这个参数决定，它的单位是字节。如果每个数据页的大小是BLCKSZ，单位也是字节，则共享池中共计有shared_buffers / BLCKSZ个数据页，这个值就是NBuffers。

私有内存中也有数据页，它的大小是NLocBuffer来决定，编号是从-1到-NLocBuffer。整体设计可以用下图来表示：

理解了这个设计，我们很容易理解下面的代码：
```
static inline bool
BufferIsValid(Buffer bufnum)
{
	Assert(bufnum <= NBuffers);
	Assert(bufnum >= -NLocBuffer);

	return bufnum != InvalidBuffer; /// #define InvalidBuffer	0
}
```
这个代码判断一个数据页的编号是否有效，很简单，就是判断它不为0，且在-NLocBuffer和NBuffers之间。

发新帖

xiaobu

主题数
49

帖子数
165

注册排名
19

小布源码分析札记-00 : 随笔

WAL文件的合法尺寸是多少？

CreateSharedMemoryAndSemaphores()函数的整体分析

PGSharedMemoryCreate()的分析

PG进程的当前目录辨析

PG源码的一些风格

文件postmaster.opts的作用

PG启动子进程的顺序

PG主进程是如何启动子进程的？

PageGetMaxOffsetNumber()函数分析

全局变量IsUnderPostmaster的含义

什么是MVCC？

pg_waldump工具是如何寻找WAL文件所在的目录的？

控制文件的尺寸

内存池中数据页的编号问题

xiaobu