Oracle为了记录数据的老版本,采用了UNDO机制,但PG没有采用这个机制,而是把老版本的数据都塞进了数据块中。当用户执行UPDATE操作时,PG实际上是先删除老记录,再插入一条新记录,即:
UPDATE = DELETE + INSERT
因为有了这个设计,所以PG需要Vacuum,这是PG特有的操作,也是面试PG岗位必考的问题。我在本帖中对这部分代码进行了分析,总结了一些心得体会,供大家参考。
Vacuum的工作内容
Vacuum的工作内容,下面的链接总结的很好:
https://www.interdb.jp/pg/pgsql06/01.html
- 去掉死亡记录,包括堆表和索引上的死亡记录。
- 防止Transaction ID Wraparound的事情发生。
- 其它操作,更新FSM/VM,更新统计信息,方便查询优化器选择最优的执行计划。
Vacuum的相关参数maintenance_work_mem
这个参数控制着一个进程的私有内存(就是通过malloc/free申请和释放的内存)体积的上限。如果没有指定单位,它的缺省单位是KB。这块内存主要用于一些数据库的维护工作,常见的维护工作有三个:
- VACUUM
- CREATE INDEX
- ALTER TABLE ADD FOREIGN KEY
在VACUUM的开始阶段,会构造一个“死亡记录的指针”数组。当我们想定位一条死亡记录时,需要知道它所在的块号,和它在这个数据块上是第几条记录,所以需要6个字节来描述这个信息,即4+2结构,4个字节描述块号,2个字节描述是第几条记录。死亡记录指针数组就是一个巨大的数组,其每个成员都是6字节的指针。这个数组是在Vacuum阶段一次性分配的,它的总体大小即为maintenance_work_mem控制。
Vacuum的相关参数autovacuum_work_mem
这个参数的含义和maintenance_work_mem是一样的,但它仅适用于自动vacuum的进程。我们知道我们可以手工vacuum,也可以依靠自动autovacuum。自动vacuum就是在后台创建几个进程,定期扫描表进行vacuum。自动vacuum的后台进程是一个autovacuum launcher进程和多个autovauum worker进程组成,前者起到管理和协调作用,后者是真正干活的进程。如果参数autovacuum_work_mem的值是-1,则autovacuum_worker进程使用maintenance_work_mem来控制死亡记录指针数组的体积。我们可以看如下代码:
/*
* Allocate dead_items and dead_items_info (either using palloc, or in dynamic
* shared memory). Sets both in vacrel for caller.
*
* Also handles parallel initialization as part of allocating dead_items in
* DSM when required.
*/
static void
dead_items_alloc(LVRelState *vacrel, int nworkers)
{
VacDeadItemsInfo *dead_items_info;
int vac_work_mem = AmAutoVacuumWorkerProcess() &&
autovacuum_work_mem != -1 ?
autovacuum_work_mem : maintenance_work_mem;
变量vac_work_mem表示用于死亡记录指针的数组可用的内存体积。从上面的代码可以很清楚地看到它的值是如何设置的。
- 如果本进程是autovacuum worker进程,且autovacuum_work_mem的值不是-1,则使用autovacuum_work_mem的值。
- 如果第一条不成立,则使用maintenance_work_mem的值。
那么怎么判断本进程是不是autovacuum worker进程呢?这就要看AmAutoVacuumWorkerProcess()的定义了。
/*
* MyBackendType indicates what kind of a backend this is.
*
* If you add entries, please also update the child_process_kinds array in
* launch_backend.c.
*/
typedef enum BackendType
{
B_INVALID = 0,
/* Backends and other backend-like processes */
B_BACKEND,
B_AUTOVAC_LAUNCHER,
B_AUTOVAC_WORKER,
B_BG_WORKER,
B_WAL_SENDER,
B_SLOTSYNC_WORKER,
B_STANDALONE_BACKEND,
/*
* Auxiliary processes. These have PGPROC entries, but they are not
* attached to any particular database, and cannot run transactions or
* even take heavyweight locks. There can be only one of each of these
* running at a time.
*
* If you modify these, make sure to update NUM_AUXILIARY_PROCS and the
* glossary in the docs.
*/
B_ARCHIVER,
B_BG_WRITER,
B_CHECKPOINTER,
B_STARTUP,
B_WAL_RECEIVER,
B_WAL_SUMMARIZER, /// 这是PG 17才新增加的进程
B_WAL_WRITER,
/*
* Logger is not connected to shared memory and does not have a PGPROC
* entry.
*/
B_LOGGER,
} BackendType;
#define BACKEND_NUM_TYPES (B_LOGGER + 1)
extern PGDLLIMPORT BackendType MyBackendType;
#define AmRegularBackendProcess() (MyBackendType == B_BACKEND)
#define AmAutoVacuumLauncherProcess() (MyBackendType == B_AUTOVAC_LAUNCHER)
#define AmAutoVacuumWorkerProcess() (MyBackendType == B_AUTOVAC_WORKER)
#define AmBackgroundWorkerProcess() (MyBackendType == B_BG_WORKER)
#define AmWalSenderProcess() (MyBackendType == B_WAL_SENDER)
#define AmLogicalSlotSyncWorkerProcess() (MyBackendType == B_SLOTSYNC_WORKER)
#define AmArchiverProcess() (MyBackendType == B_ARCHIVER)
#define AmBackgroundWriterProcess() (MyBackendType == B_BG_WRITER)
#define AmCheckpointerProcess() (MyBackendType == B_CHECKPOINTER)
#define AmStartupProcess() (MyBackendType == B_STARTUP)
#define AmWalReceiverProcess() (MyBackendType == B_WAL_RECEIVER)
#define AmWalSummarizerProcess() (MyBackendType == B_WAL_SUMMARIZER)
#define AmWalWriterProcess() (MyBackendType == B_WAL_WRITER)
原来每个进程都有一个全局变量MyBackendType,它的初始值为0,当这个进程启动后,会把这个变量设置为本进程的类型。通过判断这个变量的值,就可以知道本进程是何方神圣。
dead_items_info->max_bytes = vac_work_mem * 1024L;
通过这个代码我们知道,vac_work_mem的单位是1KB,dead_items_info->max_bytes中记录的值的单位是字节,所以需要乘以1024,把KB转化成字节。
参数autovacuum_max_workers表示在全局范围内,可以同时存在多少个autovacuum worker进程,所以在进行autovacuum操作时,所需要的内存是autovacuum_max_workers X autovacuum_work_mem,或者autovacuum_max_worker X maintenance_work_mem。
-
参数autovacuum_naptime
这个参数是一个整型,其含义在官方文档中的描述如下:
Specifies the minimum delay between autovacuum runs on any given database. In each round the daemon examines the database and issues VACUUM and ANALYZE commands as needed for tables in that database. If this value is specified without units, it is taken as seconds. The default is one minute (1min). This parameter can only be set in the postgresql.conf file or on the server command line.
我们知道关于autovacuum有两种类型的进程,一个是autovacuum launcher,它是一个管理进程,数据库实例启动后它就会启动,我们简称它为AVL。另外一种进程是autovacuum worker进程,它是具体干活的,我们简称之为AVW进程。
AVL进程的主程序就是一个无限循环,直到整个数据库实例被关闭时才停止。每次循环,它都要休眠一段时间,这个时间就是autovacuum_naptime这个参数规定的。你可以通过把这个参数变小,来提高AVL的活动频度,对于一些需要经常vacuum操作的情景比较有利。下面是代码细节:
/* loop until shutdown request */ while (!ShutdownRequestPending) { struct timeval nap; /* * This loop is a bit different from the normal use of WaitLatch, * because we'd like to sleep before the first launch of a child * process. So it's WaitLatch, then ResetLatch, then check for * wakening conditions. */ launcher_determine_sleep(!dlist_is_empty(&AutoVacuumShmem->av_freeWorkers), false, &nap); /* * Wait until naptime expires or we get some type of signal (all the * signal handlers will wake us by calling SetLatch). */ (void) WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, (nap.tv_sec * 1000L) + (nap.tv_usec / 1000L), WAIT_EVENT_AUTOVACUUM_MAIN); ResetLatch(MyLatch); }
上述代码就是循环的主体部分。其中有一个时间变量nap,它的值由launcher_determin_sleep函数来计算出来,然后通过WaitLatch来进行休眠。下面我们看看launcher_determin_sleep()函数的代码:
/* * Determine the time to sleep, based on the database list. * * The "canlaunch" parameter indicates whether we can start a worker right now, * for example due to the workers being all busy. If this is false, we will * cause a long sleep, which will be interrupted when a worker exits. */ static void launcher_determine_sleep(bool canlaunch, bool recursing, struct timeval *nap) { /* * We sleep until the next scheduled vacuum. We trust that when the * database list was built, care was taken so that no entries have times * in the past; if the first entry has too close a next_worker value, or a * time in the past, we will sleep a small nominal time. */ if (!canlaunch) { nap->tv_sec = autovacuum_naptime; nap->tv_usec = 0; } else if (!dlist_is_empty(&DatabaseList)) { TimestampTz current_time = GetCurrentTimestamp(); TimestampTz next_wakeup; avl_dbase *avdb; long secs; int usecs; avdb = dlist_tail_element(avl_dbase, adl_node, &DatabaseList); next_wakeup = avdb->adl_next_worker; TimestampDifference(current_time, next_wakeup, &secs, &usecs); nap->tv_sec = secs; nap->tv_usec = usecs; } else { /* list is empty, sleep for whole autovacuum_naptime seconds */ nap->tv_sec = autovacuum_naptime; nap->tv_usec = 0; } /* * If the result is exactly zero, it means a database had an entry with * time in the past. Rebuild the list so that the databases are evenly * distributed again, and recalculate the time to sleep. This can happen * if there are more tables needing vacuum than workers, and they all take * longer to vacuum than autovacuum_naptime. * * We only recurse once. rebuild_database_list should always return times * in the future, but it seems best not to trust too much on that. */ if (nap->tv_sec == 0 && nap->tv_usec == 0 && !recursing) { rebuild_database_list(InvalidOid); launcher_determine_sleep(canlaunch, true, nap); return; } /* The smallest time we'll allow the launcher to sleep. */ if (nap->tv_sec <= 0 && nap->tv_usec <= MIN_AUTOVAC_SLEEPTIME * 1000) { nap->tv_sec = 0; nap->tv_usec = MIN_AUTOVAC_SLEEPTIME * 1000; } /* * If the sleep time is too large, clamp it to an arbitrary maximum (plus * any fractional seconds, for simplicity). This avoids an essentially * infinite sleep in strange cases like the system clock going backwards a * few years. */ if (nap->tv_sec > MAX_AUTOVAC_SLEEPTIME) nap->tv_sec = MAX_AUTOVAC_SLEEPTIME; }
如果入口参数canlaunch为false,则整个代码非常简单,就是nap的值被设置为autovacuum_naptime规定的值,在函数最后,检查一下这个值,防止它过大,最大值不能超过MAX_AUTOVAC_SLEEPTIME,其定义如下:
/* the minimum allowed time between two awakenings of the launcher */ #define MIN_AUTOVAC_SLEEPTIME 100.0 /* milliseconds */ #define MAX_AUTOVAC_SLEEPTIME 300 /* seconds */
由此可知,你能控制的autovacuum_naptime的取值范围是100毫秒到300秒之间,缺省值是1分钟。大家可以参考Robert Haas的一篇优秀的blog:
https://rhaas.blogspot.com/2019/02/tuning-autovacuumnaptime.html
在这个blog中,他提到了一个很好的案例,你可以在面试时吹嘘一下这是你的功劳:
Since the previous paragraph suggests that increase autovacuum_naptime is a really bad idea, you might reasonably wonder whether decreasing autovacuum_naptime is a really good idea. I have run across one case where I recommended such a settings change and it helped a lot. In that case, the customer had a very small table which was used as a queue, with items added and removed constantly at very high speed. It bloated very rapidly with the default settings, but changing autovacuum_naptime to 15 seconds fixed the problem.
结论:
我们可以调整autovacuum_naptime的值来控制auto vacuum活动的频度。对于数据修改非常激烈的数据库,这个值应该越小越好。
-
参数autovacuum_work_mem和maintenance_work_mem的关系
这两个参数在PG17以后,进行了巨大的改变。在PG17之前,这两个参数最大不能超过1GB,但是17取消了这个限制。
具体可以参考这个文档:
当你决定对一张表进行vacuum时,你需要考虑maintenance_work_mem的值的大小。你可以使用SET SESSION命令来设置。VACUUM会分配一个大数组,里面每个成员都是死亡记录的位置信息,每个成员占据6个字节,这个数组被称为“死亡记录数组”。假设表中有10条死亡记录,则就需要60个字节。所以maintenance_work_mem设置为多大呢,取决于表中有多少死亡记录。你可以通过下面的SQL来查询一个表中的死亡记录:
SELECT relname AS table_name, n_dead_tup AS dead_records FROM pg_stat_all_tables WHERE relname='state';
你把n_deap_tup的数值乘以6,就是死亡数组的体积,就是你应该设置的maintenance_work_mem的值的大小。
-
autovacuum的参数log_autovacuum_min_duration
这个参数的官方解释如下:
Causes each action executed by autovacuum to be logged if it ran for at least the specified amount of time. Setting this to zero logs all autovacuum actions. -1 disables logging autovacuum actions. If this value is specified without units, it is taken as milliseconds. For example, if you set this to 250ms then all automatic vacuums and analyzes that run 250ms or longer will be logged. In addition, when this parameter is set to any value other than -1, a message will be logged if an autovacuum action is skipped due to a conflicting lock or a concurrently dropped relation. The default is 10min. Enabling this parameter can be helpful in tracking autovacuum activity. This parameter can only be set in the postgresql.conf file or on the server command line; but the setting can be overridden for individual tables by changing table storage parameters.
其含义是:
这是一个时间参数,缺省值是10分钟,缺省单位是毫秒(millisecond),即千分之一秒。当一次autovacuum的持续时间等于或者超过这个参数规定的时间后,PG将在日志中记录本次autovacuum的信息。
int Log_autovacuum_min_duration = 600000; /// 缺省值是10分钟
上述定义是在autovacuum.c中给出的。