做有积累的事情

MySQL B-tree Height Issues in Large Single Tables

2024-09-04T00:00:00+00:00

Some older DBAs may remember that in the past, it was recommended that a MySQL table should not exceed 5 million rows. Many DBAs worry that as tables grow larger, the B-tree height will increase dramatically, thus affecting performance.

In reality, the B-tree is a very flat structure. Most B-trees do not exceed 4 levels. Let’s examine this with an example of a common sysbench table:

CREATE TABLE `sbtest1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=10958 DEFAULT CHARSET=latin1;

In InnoDB, there are two main types of pages: leaf pages and non-leaf pages.

The format of the leaf page is as follows: each record mainly consists of a Record Header and a Record Body. The Record Header is primarily used in conjunction with DD (data dictionary) information to support the Record Body. The Record Body contains the main content of the record.

In a 16KB page of a sysbench-like table, the approximate number of rows that can be stored in a leaf page is calculated as:

(16 * 1024 - 200 (for the page header, tail, and directory slot length)) / ((4 + 4 + 120 + 60) (row data length) + 5 (row header) + 6 (Transaction ID) + 7 (Roll Pointer)) = 78.5 rows

The format of the non-leaf page is as follows:

Since the sysbench primary key id is an integer (4 bytes), the number of rows that can be stored in a 16KB page is calculated as:

(16 * 1024 - 200) / (5 (row header) + 4 (cluster key) + 4 (child page number)) = 1233 rows

The following table shows the height and size of a B-tree at different levels:

Height	Non-leaf Pages	Leaf Pages	Rows	Size
1	0	1	79	16KB
2	1	1233	97,407	19MB
3	1234	1,520,289	120,102,831	23GB
4	1,521,523	1,874,516,337	148,086,790,623	27.9TB

From the above, we can see that for a sysbench-like table with 140 billion rows and a size of 27.9TB, the B-tree height does not exceed 4 levels. Therefore, you do not need to worry about performance issues caused by B-tree height, even with large datasets.

Impact of Using BIGINT as the Primary Key

If the primary key is changed to BIGINT (8 bytes), the number of rows per leaf page changes slightly:

(16 * 1024 - 200) / ((8 + 4 + 120 + 60) + 13) = 78.9 rows

The number of rows in non-leaf pages changes as well:

(16 * 1024 - 200) / (5 + 8 + 4) = 952 rows

Height	Non-leaf Pages	Leaf Pages	Rows	Size
1	0	1	79	16KB
2	1	952	75,208	15MB
3	953	906,304	71,598,016	13.8GB
4	907,257	862,801,408	68,161,311,232	12.8TB

After switching to BIGINT, a four-level B-tree can store 60 billion rows and about 12TB of data.

Example of a More Complex Table (Polarbench)

For more complex tables, such as those used in SaaS scenarios, we use the following structure:

CREATE TABLE `prefix_off_saas_log_10` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `saas_type` varchar(64) DEFAULT NULL,
  `saas_currency_code` varchar(3) DEFAULT NULL,
  `saas_amount` bigint(20) DEFAULT '0',
  `saas_direction` varchar(2) DEFAULT 'NA',
  `saas_status` varchar(64) DEFAULT NULL,
  `ewallet_ref` varchar(64) DEFAULT NULL,
  `merchant_ref` varchar(64) DEFAULT NULL,
  `third_party_ref` varchar(64) DEFAULT NULL,
  `created_date_time` datetime DEFAULT NULL,
  `updated_date_time` datetime DEFAULT NULL,
  `version` int(11) DEFAULT NULL,
  `saas_date_time` datetime DEFAULT NULL,
  `original_saas_ref` varchar(64) DEFAULT NULL,
  `source_of_fund` varchar(64) DEFAULT NULL,
  `external_saas_type` varchar(64) DEFAULT NULL,
  `user_id` varchar(64) DEFAULT NULL,
  `merchant_id` varchar(64) DEFAULT NULL,
  `merchant_id_ext` varchar(64) DEFAULT NULL,
  `mfg_no` varchar(64) DEFAULT NULL,
  `rfid_tag_no` varchar(64) DEFAULT NULL,
  `admin_fee` bigint(20) DEFAULT NULL,
  `ppu_type` varchar(64) DEFAULT NULL,
  PRIMARY KEY (`id`),
   KEY `saas_log_idx01` (`user_id`) USING BTREE,
  KEY `saas_log_idx02` (`saas_type`) USING BTREE,
  KEY `saas_log_idx03` (`saas_status`) USING BTREE,
  KEY `saas_log_idx04` (`merchant_ref`) USING BTREE,
  KEY `saas_log_idx05` (`third_party_ref`) USING BTREE,
  KEY `saas_log_idx08` (`mfg_no`) USING BTREE,
  KEY `saas_log_idx09` (`rfid_tag_no`) USING BTREE,
  KEY `saas_log_idx10` (`merchant_id`)
  ) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8

Since this table contains variable-length fields, and most references are assumed to have values, let’s assume all varchar fields are fully used.

When we add up all these fields, including the extra space for the Record Header, it comes to approximately 974 bytes per record.

Therefore, the number of records that can be stored in a leaf page is:

(16 * 1024 - 200) / 974 = 16.6 rows

For non-leaf pages, the capacity is similar to the sysbench table.

Height	Non-leaf Pages	Leaf Pages	Rows	Size
1	0	1	16	16KB
2	1	952	15,232	15MB
3	953	906,304	14,500,864	13.8GB
4	907,257	862,801,408	13,804,822,528	12.8TB

It can be seen that even for a table where each row is about 1KB, if the primary key is still BIGINT, the B-tree height remains within 4 levels for data sizes under 10TB, allowing the table to store about 13.8 billion rows.

Thus, storing tens of billions of rows in MySQL is not an issue.

MySQL best practices suggest avoiding UUIDs as primary keys.

For example, if the primary key of the prefix_off_saas_log_10 table is changed to a 32-byte UUID, the number of records that can be stored in a non-leaf page is:

(16 * 1024 - 200) / (5 + 32 + 4) = 394 rows

Height	Non-leaf Pages	Leaf Pages	Rows	Size
1	0	1	16	16KB
2	1	394	6,304	6MB
3	395	155,236	2,483,776	2GB
4	155,631	61,162,984	978,607,744	981GB
5	61,318,615	24,098,215,696	385,571,451,136	386TB

From the table above, we can see that if UUID is used as the primary key, the same four-level B-tree can store 970 million rows, while using BIGINT can store 13.8 billion rows. However, even if UUID is mistakenly used as the primary key, the depth of MySQL’s B-tree will not exceed five levels, capable of storing up to 3.8 trillion rows and 386TB of data. This is unrealistic, as MySQL supports a maximum of 64TB per table.

Conclusion

In general, there’s no need to worry about increased B-tree height impacting performance as the data size grows. For tables under 10TB, the B-tree height will always be within four levels, and even above 10TB, it will remain at five levels because MySQL tables have a maximum size of 64TB.

PolarDB supports many large tables online, with plenty of tables exceeding 10TB. I’ve also seen real-world cases shared by DBAs from major companies, like Weibo’s “6B” brother, who talked about a single Weibo table with 6 billion rows. The founder of NineData shared examples from overseas WeChat-like businesses handling tens of billions of rows in a single table, and these run just fine. So, if the table structure is designed reasonably, large tables are completely manageable, and there’s no need to be misled by current database vendors.

MySQL 单表大数据量下的 B-tree 高度问题

2024-08-30T00:00:00+00:00

有一些老的DBA 还记得在很早的时候, 坊间流传的是在MySQL里面单表不要超过500万行，单表超过 500 万必须要做分库分表. 有很多 DBA 同学担心MySQL 表大了以后, Btree 高度会变得非常大, 从而影响实例性能.

其实 Btree 是一个非常扁平的 Tree, 绝大部分 Btree 不超过 4 层的, 我们看一下实际情况

我们以常见的 sysbench table 举例子

CREATE TABLE `sbtest1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  `c` char(120) NOT NULL DEFAULT '',
  `pad` char(60) NOT NULL DEFAULT '',
  PRIMARY KEY (`id`),
  KEY `k_1` (`k`)
) ENGINE=InnoDB AUTO_INCREMENT=10958 DEFAULT CHARSET=latin1

在 InnoDB 里面主要 2 种类型 Page, leaf page and non-leaf page

Leaf Page 格式如下, 每一个 Record 主要由 Record Header + Record Body 组成, Record Header 主要用来配合 DD(data dictionary) 信息来接下 Record Body. Record Body 是 Record 的主要内容.

16KB page 里面sysbench 这样的表, Leaf Page 一个表里面可以存差不多存储的行数是:

(16 * 1024 - 200(Page 一些 Header, tail, Diretory slot 长度) )/ ((4 + 4 + 120 + 60)行数据长度 + 5(每行数据的 header) + 6(Transaction ID) + 7(Roll Pointer)) = 78.5

Non-leaf Page 格式如下:

因为 sysbench primary key id 是 int 是 4 个字节, 那么 16KB page 可以存的行数就是

(16 * 1024 - 200) / (5(每行数据 Header + 4 (Cluster Key) + 4(Child Page Number)) = 1233

那么不同高度的计算公式如下:

高度	Non-leaf pages	Leaf pages	行数	大小
1	0	1	79	16KB
2	1	1233	97407	19MB
3	1234	1520289	120102831	23GB
4	1521523	1874516337	148086790623	27.9TB

从上面可以看到, 如果是类似 sysbench 这样的表, 那么单表 1400 亿行, 数据大小是 27.9TB 的情况下, Btree 的高度都不会超过 4 层. 所以不用担心数据量大了以后, Btree 高度增加的问题

这里如果 sysbench 的 primary key 是 BIGINT, 也就是 8 字节那么大概是怎样的呢?

leaf page 里面可以存的 record 行数就是:

(16 * 1024 - 200) / ((8 + 4 + 120 + 60) + 13) = 78.9

可以看到这个 leaf page record number 变化不大

non-leaf page 可以存的 record 数变化稍微大一些:

(16 * 1024 - 200)/(5+8+4) = 952

高度	Non-leaf pages	Leaf pages	行数	大小
1	0	1	79	16KB
2	1	952	75208	15MB
3	953	906304	71598016	13.8GB
4	907257	862801408	68161311232	12.8TB

从上面可以看到, 如果 sysbench 的 primary key 改成 BIGINT 之后, 那么 4 层的 btree 可以存 600 亿行, 大概可以存 12TB 的数据.

如果 Sysbench 这样的 Table 不具有代表性, 那么更复杂的一些 Table, 比如 Polarbench(用于模拟各个行业的场景数据库使用场景的工具) 里面的 SaaS 场景常用的 log 表来看

CREATE TABLE `prefix_off_saas_log_10` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `saas_type` varchar(64) DEFAULT NULL,
  `saas_currency_code` varchar(3) DEFAULT NULL,
  `saas_amount` bigint(20) DEFAULT '0',
  `saas_direction` varchar(2) DEFAULT 'NA',
  `saas_status` varchar(64) DEFAULT NULL,
  `ewallet_ref` varchar(64) DEFAULT NULL,
  `merchant_ref` varchar(64) DEFAULT NULL,
  `third_party_ref` varchar(64) DEFAULT NULL,
  `created_date_time` datetime DEFAULT NULL,
  `updated_date_time` datetime DEFAULT NULL,
  `version` int(11) DEFAULT NULL,
  `saas_date_time` datetime DEFAULT NULL,
  `original_saas_ref` varchar(64) DEFAULT NULL,
  `source_of_fund` varchar(64) DEFAULT NULL,
  `external_saas_type` varchar(64) DEFAULT NULL,
  `user_id` varchar(64) DEFAULT NULL,
  `merchant_id` varchar(64) DEFAULT NULL,
  `merchant_id_ext` varchar(64) DEFAULT NULL,
  `mfg_no` varchar(64) DEFAULT NULL,
  `rfid_tag_no` varchar(64) DEFAULT NULL,
  `admin_fee` bigint(20) DEFAULT NULL,
  `ppu_type` varchar(64) DEFAULT NULL,
  PRIMARY KEY (`id`),
   KEY `saas_log_idx01` (`user_id`) USING BTREE,
  KEY `saas_log_idx02` (`saas_type`) USING BTREE,
  KEY `saas_log_idx03` (`saas_status`) USING BTREE,
  KEY `saas_log_idx04` (`merchant_ref`) USING BTREE,
  KEY `saas_log_idx05` (`third_party_ref`) USING BTREE,
  KEY `saas_log_idx08` (`mfg_no`) USING BTREE,
  KEY `saas_log_idx09` (`rfid_tag_no`) USING BTREE,
  KEY `saas_log_idx10` (`merchant_id`)
  ) ENGINE=InnoDB AUTO_INCREMENT=0 DEFAULT CHARSET=utf8

因为这里面有变长字段, 不过大部分 ref 是有值的, 所以假设 varchar 字段完全被使用的情况.

所有这些字段加起来, 再额外计算Record Header 信息, 差不多974 bytes.

那么 Leaf Page 可以存的 record 数就是 (16 * 1024 - 200)/974 = 16.6

对于 Non-Leaf Page 那么和之前 Sysbench BIGINT 一样, 可以存的 record 是 952

高度	Non-leaf pages	Leaf pages	行数	大小
1	0	1	16	16KB
2	1	952	15232	15MB
3	953	906304	14500864	13.8GB
4	907257	862801408	13804822528	12.8TB

可以看到即使是单行差不多 1KB的 Table, 如果 primary key 还是 BIGINT 的话, 那么数据在 10T 以内, Btree 的高度也一定在 4 层之内, 同时在 4 层之内, 这个Table 大概可以存 138 亿行了.

所以 MySQL 存几十亿行这样的场景其实是完全没问题的.

MySQL 还是有一个最佳实践, “不建议使用 uuid 作为主键”. 我们来看看为什么?

比如上面的 prefix_off_saas_log_10 如果把 primary key 改成 32 字节的 uuid, 那么在 Leaf Page 不变的情况下,

Non-Leaf Page 存的 record number:

(16 * 1024 - 200)/(5+32+4) = 394

高度	Non-leaf pages	Leaf pages	行数	大小
1	0	1	16	16KB
2	1	394	6304	6MB
3	395	155236	2483776	2GB
4	155631	61162984	978607744	981GB
5	61318615	24098215696	385571451136	386TB

从上面的 Table 可以看出, 如果使用 uuid 作为主键以后, 那么同样 4 层的 Btree, 如果使用 BIGINT 那么可以存 138 亿行数据, 而使用 uuid 仅仅只能存9.7 亿行数据.

但是即使错误的使用 uuid 作为主键, 其实 MySQL 的 Btree 的深度也不会超过 5 层, 5 层最多可以存 3.8 千亿行了, 386TB 的数据. 其实是不可能的, 因为 MySQL 单表其实最大就支持 64TB 了.

整体而言MySQL 里面完全不用担心数据量大了以后, Btree 高度增加影响性能的问题, 10TB 以内的数据 Btree 高度一定在 4 层以内, 超过 10TB 以后也会停留在 5 层, 不会更高了, 因为 MySQL 单表最大就支持 64TB 了.

PolarDB 在线上支持了非常多的大表实例, 10+TB 的大表其实非常多, 我也看到之前很多大厂 DBA 朋友的实际分享, 比如微博6B(billion) 哥, 讲述微博的某一张单表 60 亿行数据等等, NineData 创始人斗佛公众号大圣聊数据库讲述海外类似微信业务单表几十亿都是运行的挺好的. 所以其实如果业务表结构设计合理, 其实大表是完全没问题的, 不用被现在的数据库厂商强行引导.

lehman blink-tree and Vladimir Lanin cocurrent Btree

2024-07-26T00:00:00+00:00

PosegreSQL blink-tree 实现方式引用了两个文章

Lehman and Yao’s high-concurrency B-tree management algorithm

V. Lanin and D. Shasha, A Symmetric Concurrent B-Tree Algorithm

MySQL InnoDB 的 btree 实现主要参考的是

R. Bayer & M. Schkolnick Concurrency of operations on B-trees March 1977

lehman blink-tree

Blink-tree 的 2 个核心变化

Adding a single “link” pointer field to each node.

这里有一个当时时间点的背景, 我们现在见到的大部分的 Btree 实现里面, 都会有 left/right point 指向 left/right page. 但是当时对标准 Btree 的定义并没有这个要求. Btree 是非叶子节点也保存数据, B+tree 是只有叶子节点保存数据, 从而使 btree height 尽可能低. 但是并没有严格的要求把叶子节点连接到一起.

但是总体而言, 对 Btree 来说, 并没有强制要求有 left/right 指针指向左右 page.

像 InnoDB 里面的 btree 已经自带了 leaft page 和 right page 指针了, 同时在不同的 level 包含 leaf/non-leaf node left/right 指针都指向了自己的兄弟节点了.

所以到现在这里 right page 指针就可以和 link page 指针复用.
在每个节点内增加一个字段high key, 在查询时如果目标值超过该节点的high key, 就需要循着link pointer继续往后继节点查找

所以目前和 PolarDB 的 blink-tree 比较大的区别是取消了 lock-coupling 的操作, search 操作不加锁

PolarDB blink-tree

search 操作是通过 lock-coupling 操作, 自上而下进行加锁放锁操作.

SMO 操作则没有 lock-coupling, 是先加子节点lock, 然后释放子节点, 再去加父节点.具体是:

给 leaf-page 加锁完成操作要插入父节点的时候, 需要把子节点 page lock 释放, 然后重新 search btree, 找到父节点加 page lock 并且修改. 当然这里也可以通过把父节点指针保存下来, 从而规避第二次 search 操作, 但这个是一个优化

在标准的 blink-tree 中, 也就是 PostgreSQL Blink-tree

search 操作并没有lock coupling. 而是只需要加当前层的 latch, 如果查找到 child page id 到获得 child page 之间, 因为没有 lock-coupling, 释放完 parent node latch, 到加上 child nodt latch 这一段时间是完全不持有 latch 的, 因此child page 发生了SMO 操作, 要查找的 record 不在 child page 了, 那么该如何处理?

PolarDB blink-tree 中, 通过 lock-coupling 操作保证searh 操作同时持有 parent node 和 child node latch, 从而不会发生这样的情况.

下面这个例子就是这样的情况:

search 15 操作和触发 SMO 的insert 9 操作再并发进行着

15 原本在 y 里面, find(15) 操作的时候 y 进行了分裂, 分裂成 y 和 y’. 15 到了新的 y’ 里面.

# This is not how it works in postgres. This demonstrates the problem:
"Thread A, searching for 15"   |   "Thread B, inserting 9"
                               |   node2 = read(x);
node = read(x);                |
"Examine node, 15 lies in y"   |   "Examine node2, 9 belongs in y"
                               |   node2 = y;
                               |   # 9 does not fit in y
                               |   # Split y into (8,9,10) and (12,15)
                               |   y = (8,9,10); y_prime = (12,15)
                               |   x.add_pointer(y_prime)
                               |   
"y now points to (8,9,10)!"    |
node = read(y)                 |
find(15) "15 not found in y!"  |

对于这个例子, 可以看到 PolarDB blink-tree 通过 lock-coupling 去解决了问题, 在 read(x) 操作之后, 同时去持有 node(y) s lock, 那么 Thread B SMO 操作的时候需要持有 node(y) x lock, 那么SMO 操作就会被阻塞, 从而避免了上述问题的发生.

lehman 介绍的 blink-tree 怎么解决呢?

在 node(y) 里面, 增加了 link-page 以及 high key 以后.

上述的find(15) 操作判断 15 > node(y)’s high-key, 那么就去 node(y)’s link-page 去进行查找. 也就是 y’. 那么在 y’ 上就可以找到 15

那么 SMO 操作是如何进行的呢?

lehman blink-tree SMO 操作是持有子节点去加父节点的锁, 并且是自下而上的latch coupling操作, 由于 search 操作不需要 lock coupling, 那么自下而上的操作也就不会有问题. 所以可以持有 child latch 同时去申请 parent node latch.

这里会同时持有 child, parent 两个节点的latch.

如果这个时候 parent 节点也含有 link page, 也就是需要插入到 parent node -> link page. 那么就需要同时持有 child, parent, parent->link page 这 3 个 page 的 latch.

如果在 parent->link page 依然找不到插入位置, 需要到 parent->link page->link page, 那么就可以把 parent node 放开, 再去持有 link page -> link page.

因此同一时刻最多持有 3 个节点的 latch.

大部分情况下 link page 只会有一个, 很多操作可以简化.

这里在 Vladimir Lanin Concurrent Btree 里面会有进一步的优化.

按照现在PG 实现, 如果锁住子节点再向父节点进行插入, 只会出现一个 link page. 因为第一个 page 发生分裂的时候, 在分裂没有结束之前是不会放开 page lock, 那么新的插入是无法进行的.

只有像 PolarDB blink-tree 做法一样,插入child node完成以后, 放开child node latch, 然后再去插入parent node, 允许插入parent node过程中, link page 继续被插入才可能出现多个 link page 的情况了.

我理解 PG 这里也是做了权衡, 为了避免出现多个 link page 的复杂情况的.

这里虽然不会出现多个 Link-page, 但是有可能 search/insert 的时候需要走多个 link page 到目标 Page, 比如下面例子

其实这里也可以使用类似 PolarDB blink-tree 的方式, 也就是插入子节点以后, 就可以把子节点的锁放开, 重新遍历 btree 去插入父节点, 从而可以进一步的让子节点的 latch 尽早放开.

其实 blink-tree 这个文章也讲到了 remembered list

We then proceed back up the tree (using our “remembered” list of nodes through which we searched)

Vladimir Lanin Cocurrent Btree

一开始总结了在 Blink Tree 之前Btree 并发的实现方式.

search 的时候自上而下 lock coupling 加锁, SMO 的时候 lock subtree 并且自上而下加锁方式, 由于 Search and SMO 操作都是自上而下, 那么就可以避免死锁的发生.

该文章出来之前的并发控制方式, 缺点在哪里呢?

很难计算清楚 lock subtree 的范围到底是多少, 这个也是在 MySQL 现有代码里面非常繁琐的一块.
lock coupling 并发的范围还是不够. 这里强调 lock-coupling 不一定需要配合 blink-tree 使用, 配合标准的 btree 使用也是可以的. 在这个文章里面就是配合 b+tree 使用的.

这 2 种方法都是牺牲并发去获得安全性.

当然也有在 lock coupling + lock subtree 的优化方法, 就是通过先乐观加锁, 再悲观加锁的方法. 乐观路径的时候一路都是 S lock, 然后找到 leaf node, 仅仅对 leaf node 加 X lock, 那么在 (k-1)/k (2k 表示一个 page 里面 record 个数) 情况下, 都可以走乐观. 其实 InnoDB 就是先乐观再悲观的方式.

其他做法和 lehman blink-tree 类似, 只不过在SMO 的时候, 实现了 only lock one node at a time, 不过在 PostgreSQL 具体实现的时候并没有这样实现, 我理解主要为了考虑安全性.

文章也提到:

Although it is not apparent in [Lehman, Yao 811 itself, the B-link structure allows inserts and searches to lock only one node at a time.

也就是可以实现 insert and search only one node, 这个也是我的想法.

Each action holds no more than one read lock at a time during its descent, an insertion holds no more than one write lock at a time during its ascent, and a deletion needs no more than two write locks at a time during its ascent.

After the completion of a half-split or a half-merge, all locks are released.

在文章里面确实是这样, half-split 之后, 所有的 locks 都释放了, 那么插入父节点的时候就会 PolarDB 现有做法类似, 也就是释放所有的 lock 重新去插入新的一层的数据, 从而保证 SMO 操作统一时刻也仅仅只有 Lock 一层.

Normally, finding the node with the right coverlet for the add-link or remove-link is done as in [Lehman, Yao 811, by reaccessing the last node visited during the locate phase on the level above the current node. Sometimes (e.g. when the locate phase had started at or below the current level) this is not possibie, and the node must-be found by a new descent from the top.

插入父节点的时候可以通过保存的 memory-list 或者重新遍历了

另外, 用类似 link-page 思路补充了再 lehman 文章中没有实现的delete 操作

如果仅仅是和 MySQL 的 InnoDB 对比, PG 的 Blink-tree 实现在加锁粒度上明显更加的细致, 避免的整个 Btree 的 Index lock 的同时, 也同时规避了通过 Lock subtree 的方式进行 Search 操作和 SMO 操作的冲突问题.

PolarDB read future page

2024-06-20T00:00:00+00:00

背景:

用户使用 PolarDB/Aurora 这样基于共享存储一写多读架构的时候, 很常见的想法是, 希望使用 PolarDB rw(读写节点), ro(只读节点) 和传统的 MySQL 主备节点一样. 用户认为可以在备节点上做任何复杂操作, 即使备节点有问题, 比如因为跑了复杂查询, 从而导致 CPU 升高, 导致复制有延迟, 但是也不应该影响到主节点.

但是, 其实在 PolarDB 里面, 其实不是这样的, 如果 RO 节点有复杂查询, 那么其实会影响到RW 节点的, 因为访问数据一致性的约束, 如果 RO 节点复制有延迟, 那么RW 节点的刷脏是存在约束的. 会导致 RW 节点无法进行刷脏.

目前 PolarDB的处理方法是如果RO 节点复制延迟过高, 影响了 RW 刷脏, 那么会让 RO 节点自动 crash 重启, 从而避免 RW 节点出现问题.

但是还是有用户希望使用 MySQL 主备一样使用 PolarDB 的 RW 和 RO, 那么如果出现了有延迟的 RO 节点, 又不想让 RO 节点重启, 那么有办法么?

直观的想法是不限制 RW 节点刷脏, 那么就可能出现 RO 节点读取到 future page.

如果RO 节点读取到future page, 会有什么问题?

其实Aurora 这样的架构虽然有存储多版本的支持, 但是依然也有和 PolarDB 类似的问题, 他也要解决的.

https://repost.aws/knowledge-center/aurora-read-replica-restart

Aurora 回答这个问题的时候也强调, Aurora 的 RW 和 RO 架构其实是和传统 MySQL 主备架构不一样.

Aurora/Socrate 依赖Page server 的多版本, 那么Page server必须保留最老的版本, 这样才能保证读取到想要的版本. 因此Page server 不能随意执行redo + page => new_page 逻辑, 需要等到所有的 RO 节点都已经同步到相同的 redo log 之后, 对应的 Page 才可以更新成 new_page. 其实是和PolarDB 里面限制RW刷脏是差不多的.

PolarDB 也一样, 存储节点保留的是最老版本, 从而保证ro 可以读取到指定的版本.

其实虽然 PolarDB/Aurora 架构有所区别, 但是这个问题是都有的.

也都存在分险, 也就是如果RO 节点延迟太多, 那么 PolarDB 由于刷脏约束可能导致节点crash, Aurora 由于刷脏约束也会导致 Page server 无法推进.

所以两边都有一个逻辑, 如果有一个慢RO 延迟太大, 那么RO 节点自动重启.

不过 Aurora 受到的影响会小很多, 因为将这些延迟的page 打散到多个 Page Server 上, 而 PolarDB 是聚集在一个节点上.

要解决这个问题, 可以从两个方面来解决. RW or RO 解决

通过 RW 节点

目前 PolarDB 和 Aurora 都选择类似的做法, 都是在 RW 节点进行限制. PolarDB 叫刷脏约束, Aurora 是限制 page server 生成新版本page.

但是这个方案存在2个问题.
1. 因为内存都有限制, 因此如果一个 RO 阶段延迟太后, 那么内存可能撑不住, 所以 PolarDB 和 Aurora 都存在自动restart 逻辑
2. 由于迟迟无法推进最新 Page, 那么读取最新 Page 需要old_page + redo => new page 那么性能可能受影响, 后面讲到的方案如果允许Redo log 放在磁盘上虽然可以规避内存问题, 但是增加了额外redo IO, 性能影响更大.
两个方案都可以通过把redo log 持久化, PolarDB 通过刷脏的时候只写log index 但是不写Page, Aurora 可以通过Page server 内存中的redo log offload 到磁盘从而不会将内存打满.

但是这样都会影响到latency.

或者也可以实现类似.mibd 的解决方案, 核心还是不能对old page 原地更新, 将new_page 写入到新的文件里面, 等老 RO lsn 往前推进, 再进行把.mibd 写回到.ibd 文件中.

多版本引擎实现类似方案, 但是这里问题在于page IO 写放大了 2 倍, 额外增加了一个读 Page IO 性能影响非常大.
通过 RO 节点实现

目前 Socrate 看过去是类似的做法, 不对 RW 节点刷脏进行限制, 允许 RW 节点任意刷脏, 那么就需要 RO 节点去处理不一致问题. 但是Socrates 里面提到访问到 Future Page 处理的方法非常简单, 就是一个简单的重试. 其实简单的重试是最直接的处理方法, 但是对性能有影响的. 需要有更细致的处理方法

这里不一致问题主要有 2 个方面
1. 逻辑不一致, 也就是可见性判断问题
2. 物理不一致, 也就是 SMO 导致访问到的 Page 不一致问题.

RO 读 Future Page

如果希望去掉限制刷脏逻辑, 允许RO 读取到future page, 那么需要内核在这里处理两个问题

逻辑一致性问题, 也就是可见性判断问题

为什么在rw 上没有这个问题?

rw 上面也会在没有事务commit 的时候, 提前就已经进行刷脏操作. 那么同样rw 也会读取到太新Page, 但是提前刷脏的page 里面的record 里面记录的trx_id 肯定在活跃事务数组里面, 那么就可以知道这个record 是不可见的, 可以通过readview 找到历史版本

这个问题的本质是 rw 上更新readview 和刷脏的先后顺序是可以保证的, 但是ro 上面不能保证. 出现了刷脏但是对应的trx_id 还没有传到ro. 导致读取到了未来 Page 的问题.

为什么刷脏约束可以解决这个问题.

因为刷脏约束保证了刷脏之前, 对应的redo log 已经传给ro 节点, 对应的 trx_id 也同步给 ro, 那么此刻ro 节点已经获得了正确的 readview, 那么此刻rw 再刷脏, 就和rw 的行为一致了
物理一致性问题

同样为什么rw 上没有这个问题?

因为如果rw 上面发生了 SMO 操作, 如果有一个查询正在持有page s latch, 那么这个SMO 操作是无法进行的, 只有当查询操作将page s lock 释放了以后, 该 SMO 操作才可以进行.

但是ro 上面的查询是无法限制SMO 的, 也就是 RO 上面的查询即使lock 了next_page, 但是这里next_page 还是有可能被更新.

而如果有刷脏约束, 如何解决这个问题?

有刷脏约束的情况下, 如果有SMO 情况发生, 那么根据 PolarDB sync_counter 介绍, 会去持有index x lock, 从而和RO 上面的查询互斥, 实现rw 类似的效果.

如果没有刷脏约束, 该如何解决?

可以通过在mtr 内部重试来解决, 类似Socrate 解决方案, 从而保证访问到的是同一个版本的btree. 这里重试的开销还是有的, 需要做的更加细致一些.
1. 发生了 SMO, 这里也分 2 种
  1. 访问的 Record 还在当前 Page
  2. 访问的 Record 不在当前 Page

InnoDB B-tree Latch Optimization History

2024-06-09T00:00:00+00:00

In general, in a database, “latch” refers to a physical lock, while “lock” refers to a logical lock in transactions. In this article, the terms are used interchangeably.

In the InnoDB implementation, there are two main types of locks in the B-tree: index lock and page lock.

Index lock refers to the lock on the entire index, which is represented in the code as dict_index->lock.
Page lock refers to the lock present on each page within the B-tree.

When we refer to B-tree locks, we generally mean both the index lock and the page lock working together.

In the 5.6 implementation, the process of B-tree latching is relatively simple, as follows:

1. For a query request:

First, acquire an S LOCK on btree index->lock.
Then, after finding the leaf node, acquire an S LOCK on the leaf node as well, and release the index->lock.

2. For a leaf page modification request:

Similarly, acquire an S LOCK on btree index->lock.
Then, after finding the leaf node, acquire an X LOCK on it because the page needs to be modified. After that, release the index->lock. At this point, there are two scenarios depending on whether the modification of this page will cause a change in the B-tree structure:
- If it doesn’t, that’s good. Once the X LOCK on the leaf node is acquired, modify the data and return.
- If it does, you will need to perform a pessimistic insert operation and re-traverse the B-tree. Acquire an X LOCK on the B-tree index and execute btr_cur_search_to_nth_level to the specified page.
  
  Since modifying the leaf node may cause changes to the B-tree all the way up to the root node, other threads must be prevented from accessing the B-tree during this time. Therefore, an X LOCK is required on the entire B-tree, meaning no other query requests can access it. Moreover, since an X LOCK is held on the index, and record insertion into the page might cause the upper-level pages to change, this process may involve disk I/O, potentially making the X LOCK last for an extended time. During this time, all read-related operations will be blocked.
  
  The specific code for this is in row_ins_clust_index_entry. Initially, an optimistic insert operation is attempted:
```
err = row_ins_clust_index_entry_low(
    0, BTR_MODIFY_LEAF, index, n_uniq, entry, n_ext, thr,
    &page_no, &modify_clock);
```
  If the insert fails, a pessimistic insert operation is attempted:
```
return(row_ins_clust_index_entry_low(
    0, BTR_MODIFY_TREE, index, n_uniq, entry, n_ext, thr,
    &page_no, &modify_clock));
```
  As you can see, the only difference here is that the latch_mode is either BTR_MODIFY_LEAF or BTR_MODIFY_TREE. Since btr_cur_search_to_nth_level is executed in the row_ins_clust_index_entry_low function, the B-tree is re-traversed when the pessimistic insert is retried after a failed optimistic attempt.

As shown above, in 5.6, the index lock is only applied to the entire B-tree index, and the page lock is applied only to leaf node pages in the B-tree. Non-leaf node pages in the B-tree are not locked.

This simple implementation makes the code easy to understand, but it has obvious disadvantages. During SMO (Structure Modification Operation), read operations cannot proceed, and because SMOs may involve disk I/O, the resulting performance fluctuations are quite noticeable. We have often observed such phenomena in production.

The 8.0 Improvements

In response, official changes were introduced, starting in 5.7. Here, we’ll take 8.0 as an example. The main improvements include:

The introduction of SX LOCK.
The introduction of non-leaf page locks.

SX LOCK Introduction

Let’s first introduce SX LOCK. SX LOCK can be used for both index locks and page locks.

SX LOCK does not conflict with S LOCK but does conflict with X LOCK. SX LOCKs also conflict with each other.
The purpose of an SX LOCK is to indicate the intention to modify the protected area, but the modification has not yet started. Therefore, the resource is still accessible, but once the modification begins, access will no longer be allowed. Since an intention to modify exists, no other modifications can occur, so it conflicts with X LOCKs.

The main usage now is that index SX LOCK does not conflict with S LOCK, which allows reads and optimistic writes to proceed even during pessimistic insert operations.

SX LOCK was introduced through this work log: WL#6363.

SX LOCK was primarily introduced to optimize read operations. Since SX LOCK conflicts with X LOCK but not with S LOCK, places that previously required X LOCKs were changed to SX LOCKs, making the system more read-friendly.

Non-leaf Page Lock Introduction

In fact, this is how most commercial databases operate—both leaf pages and non-leaf pages have page locks.

The main idea is Latch Coupling, where during a top-down traversal of the B-tree, the page lock on the parent node is released only after acquiring the lock on the child node. This minimizes the lock coverage. To implement this, non-leaf pages must also have page locks.

However, InnoDB did not completely remove the index->lock, which means that only one BTR_MODIFY_TREE operation can occur at a time. Therefore, when B-tree structure modifications are highly concurrent, performance can degrade significantly.

Back to the 5.6 Problem

As we can see, in 5.6, the worst-case scenario is when modifying a B-tree leaf page triggers a change in the B-tree structure. In this case, an X LOCK on the entire index is required. However, we know that such changes may only affect the current page and the page at the next level. If we can reduce the lock scope, it will undoubtedly help improve concurrency.

In MySQL 8.0

1. For a query request:

First, acquire an S LOCK on btree index->lock.
Then, during the B-tree traversal, acquire an S LOCK on the non-leaf node pages encountered.
After reaching the leaf node, acquire an S LOCK on the leaf node page and release the index->lock.

2. For a leaf page modification request:

Similarly, acquire an S LOCK on btree index->lock and S LOCKs on the non-leaf node pages.
After reaching the leaf node, acquire an X LOCK on the leaf node because the page needs to be modified, and then release the index->lock. At this point, the situation branches into two scenarios depending on whether the page modification triggers a B-tree structure change:
- If it doesn’t, then the X LOCK on the leaf node is sufficient. After modifying the data, return as normal.
- If it does, a pessimistic insert operation is performed by re-traversing the B-tree. At this point, the index->lock is acquired with an SX LOCK.
  - Since the B-tree now has an SX LOCK, the pages along the search path do not require locks. However, the pages encountered during the search process need to be saved, and X LOCKs are applied to the pages that may undergo structural changes.
  - This ensures that read operations are minimally affected during the search process.
  - Only after confirming the scope of the B-tree changes at the final stage, and acquiring X LOCKs on the affected pages, will the operation proceed.

In 8.0, the duration of holding the SX LOCK is as follows:

Holding the SX LOCK: After the first btr_cur_optimistic_insert fails, row_ins_clust_index_entry calls row_ins_clust_index_entry_low(flags, BTR_MODIFY_TREE ...) to insert. Inside row_ins_clust_index_entry_low, the SX LOCK is acquired in the btr_cur_search_to_nth_level function. At this point, the B-tree is locked by the SX LOCK, preventing further SMO operations. An optimistic insert is still attempted at this stage, with the SX LOCK still being held. If that fails, a pessimistic insert is attempted.
Releasing the SX LOCK: In a pessimistic insert, the SX LOCK is held until a new page (page2) is created and connected to the parent node. If the page undergoing SMO is a leaf page, the SX LOCK is released only after the SMO operation is completed, and the insert is successful.

The function responsible for executing the SMO and inserting is btr_page_split_and_insert.

The btr_page_split_and_insert operation consists of approximately 8 steps:

1. Find the record to split from the page that is about to be split. Ensure the split location is at the record boundary.

2. Allocate a new index page.

3. Calculate the boundary record for both the original page and the new page.

4. Add a new index entry for the new page to the parent index page. If the parent page does not have enough space, it triggers the split of the parent page.

5. Connect the current index page, the current page’s prev_page, next_page, father_page, and the newly created page. The connection order is to first connect the parent page, then prev_page/next_page, and finally connect the current page and the new page. (At this point, the index->sx lock can be released.)

6. Move some records from the current index page to the new index page.

7. The SMO operation is complete, and the insertion location for the current insert operation is calculated.

8. Perform the insert operation. If the insert fails, try reorganization of the page and attempt the insert again.

In the existing code, there is only one scenario where index->lock will acquire an X lock, which is:

if (lock_intention == BTR_INTENTION_DELETE && trx_sys->rseg_history_len > BTR_CUR_FINE_HISTORY_LENGTH && buf_get_n_pending_read_ios()) {

// If the lock_intention is BTR_INTENTION_DELETE and the history list is too long, the index will acquire an X lock.

Summary:

Improvements in 8.0 compared to 5.6

In 5.6, during a write operation, if an SMO (structure modification operation) is in progress, the entire index->lock would be locked with an X lock. During this time, all read operations would be blocked.

In 8.0, read operations and optimistic write operations are allowed to proceed during an SMO.

However, in 8.0 there is still a limitation: only one SMO can occur at a time because the SX lock must be acquired during an SMO. Since SX locks conflict with other SX locks, this remains one of the main issues in 8.0.

Optimization Points:

Of course, there are still some optimization opportunities here.

There is still a global index->lock. Although it is an SX LOCK, in theory, according to the 8.0 implementation, it is possible to fully release the index lock. However, many details need to be handled.
During the actual split operation, can the holding of the index lock inside btr_page_split_and_insert be optimized further?
- For example, based on a certain sequence, could the index->lock be released after connecting the newly created page to the new_page?
- Another consideration is the holding time of the X LOCK on the page where the SMO (structure modification operation) occurs.
  
  Currently, the X LOCK is held on all pages along the path until the SMO is completed, and the current insert operation is finished. Meanwhile, the father_page, prev_page, and next_page also hold X LOCKs. Could the number of locked pages be reduced? For example, this optimization is mentioned in BUG#99948.
- In btr_attach_half_pages, multiple traversals of the B-tree using btr_cur_search_to_nth_level could be avoided. This function is responsible for establishing links like the father link, prev link, and next link. However, it redundantly executes btr_page_get_father_block to traverse the B-tree to find the parent node, which internally calls btr_cur_search_to_nth_level. This step could be avoided since the index is already SX LOCKed, and the father node won’t change. The result from the previous btr_cur_search_to_nth_level call could be reused.
- Can we mark pages undergoing SMO with a state similar to a B-link tree, where the page is still readable? Although the record to be read might not exist on the current page, the reader could attempt to retrieve it from the page’s next_page. If the record can be found, the read operation is still valid.
Can the pages encountered during the btr_cur_search_to_nth_level search be preserved? This way, even for repeated searches, only the max trx_id of the upper-level pages needs to be checked. If unchanged, the entire search path hasn’t changed, so no full traversal is necessary.
Is it still necessary to retain the optimistic insert followed by a pessimistic insert approach?

My understanding is that this process exists because the cost of pessimistic inserts was too high in the 5.6 implementation. To minimize pessimistic inserts, this process was carried over into the current 8.0 implementation. However, multiple insert attempts require multiple B-tree traversals, leading to additional overhead.

talking

https://dom.as/2011/07/03/innodb-index-lock/

https://dev.mysql.com/worklog/task/?id=6326

MySQL deadlock cause by lock inherit

2024-03-20T00:00:00+00:00

In our user environment, we find deadlock cause by this example.

However, in this case, sometimes session 2 and session 3 lead to dead lock and sometimes it won’t.

create table t(a int AUTO_INCREMENT, b int, PRIMARY KEY (a));
insert into t(a, b) values(10, 8);
insert into t(a, b) values(5, 8); 

session 1: begin; delete from t where a=5;
session 2: insert into t(a, b) values (5, 8) on duplicate key update b = 11;
session 3: insert into t(a, b) values (5, 8) on duplicate key update b = 11;

session 1: commit;

# Then sometimes session 2 and session 3 lead to dead lock and sometimes it won't.

I find the root cause is lock inherit cause the deadlock.

In session 1 get the record 5 X, REC_NOT_GAP lock.

Then session 2 waiting in the record 5 X, REC_NOT_GAP lock.

And session 3 waiting in the record 5 X, REC_NOT_GAP lock.

+-----------+------------+-----------+---------------+-------------+-----------+
| thread_id | index_name | lock_type | lock_mode     | LOCK_STATUS | lock_data |
+-----------+------------+-----------+---------------+-------------+-----------+
|       148 | PRIMARY    | RECORD    | X,REC_NOT_GAP | WAITING     | 5         |
|       150 | PRIMARY    | RECORD    | X,REC_NOT_GAP | WAITING     | 5         |
|       146 | PRIMARY    | RECORD    | X,REC_NOT_GAP | GRANTED     | 5         |
+-----------+------------+-----------+---------------+-------------+-----------+

Then when session 1 execute commit, there is two scenario:

Whether the record 5 was purged after session 1 commit and before session 2 and session 3 was executing.

If the record was not purged, the record 5 won’t be physical deleted, then only one session, session2 or session 3 will get the 5 X, REC_NOT_GAP lock, then the session 2 and session 3 doing the insert one by one.

This scenario will not cause deadlock.

If the record was purged, the record 5 was physical deleted, then the waiting records will inherit to next record. Then session 2 and session 3 will wait for record 10, X GAP lock. Then session 2 and session 3 will both get the record 10, X GAP lock. Then they will doing the insert work, both sessions hold the X GAP lock and waiting other’s X,GAP,INSERT_INTENTION lock. Then deadlock happend.

Get the lock information by adding a breakpoint before deadlock check.

+-----------+------------+-----------+------------------------+--------------+-----------+
| thread_id | index_name | lock_type | lock_mode              | LOCK_STATUS  | lock_data |
+-----------+------------+-----------+------------------------+--------------+-----------+
|       148 | PRIMARY    | RECORD    | X,GAP                  | GRANTED      | 10        |
|       148 | PRIMARY    | RECORD    | X,GAP,INSERT_INTENTION | WAITING      | 10        |
|       150 | PRIMARY    | RECORD    | X,GAP                  | GRANTED      | 10        |
|       150 | PRIMARY    | RECORD    | X,GAP,INSERT_INTENTION | WAITING      | 10        |
+-----------+------------+-----------+------------------------+--------------+-----------+

The deadlock information from mysql

The default behaviour of lock inherit is Let the next record’s GAP lock inherit the record’s REC_NOT_GAP lock.

However, in this case, the X REC_NOT_GAP lock is conflict with X REC_NOT_GAP lock, after the inherit, X REC_NOT_GAP lock inherit to next record’s X GAP lock. The X GAP lock won’t conflict with X GAP lock. Then two sessions both get the X GAP lock, then the deadlock happened.

I suggest in X REC_NOT_GAP lock inherit case, let the next record inherit the NEXT-KEY lock, then in this case NEXT_KEY lock conflict with NEXT_KEY lock, the deadlock won’t happened.

MySQL 常见死锁场景– 并发插入相同主键场景

2024-03-19T00:00:00+00:00

在之前的文章介绍了由于二级索引 unique key 导致的 deadlock, 其实主键也是 unique 的, 那么同样其实主键的 unique key check 一样会导致死锁.

主键 unique 的判断在

row_ins_clust_index_entry_low

这里有一个判断

if (!index->allow_duplicates && n_uniq && (cursor->up_match >= n_uniq || cursor->low_match >= n_uniq)) {

这里判断的意思是:

如果当前 index 是 unique index, (cursor->up_match >= n_uniq

cursor->low_match >= n_uniq) cursor 找到和插入的 record 一样的 record 了. 那么就需要走 row_ins_duplicate_error_in_clust. 对于普通的INSERT操作, 当需要检查primary key unique时, 加 S record lock. 而对于Replace into 或者 INSERT ON DUPLICATE操作, 则加X record lock

否则就是当前index 没有插入过这个 record, 也就是第一次 insert primary key, 那么就不需要走 duplicate check 的逻辑. 也就不需要加锁了.

例子 1

create table t1 (a int primary key);

# 然后有三个不 session:

session1: begin; insert into t1(a) values (2);

session2: insert into t1(a) values (2);

session3: insert into t1(a) values (2);

session1: rollback;

rollback 之前:

这个时候 session2/session3 会wait 在这里2 等待s record lock, 因为session1 执行delete 时候会执行row_update_for_mysql => lock_clust_rec_modify_check_and_lock

这里会给要修改的record 加x record lock

insert 的时候其实也给record 加 x record lock, 只不过大部分时候先加implicit lock, 等真正有冲突的时候触发隐式锁的转换才会加上x lock

问题1: 这里为什么granted lock 里面 record 2 上面有x record lock 和 s record lock?

在session1 执行 rollback 以后, session2/session3 获得了s record lock, 在insert commit 时候发现死锁, rollback 其中一个事务, 另外一个提交, 死锁信息如下

这里看到 trx1 想要 x insert intention lock.

但是trx2 持有s next-key lock 和 trx1 x insert intention lock 冲突.

同时trx 也在等待 x insert intention lock, 这里从上面的持有Lock 可以看到肯定在等待trx1 s next-key lock

问题: 等待的时候是 S gap lock, 但是死锁的时候发现是 S next-key lock. 什么时候进行的升级?

这里问题的原因是这个 table 里面只有record 2, 所以这里认真看, 死锁的时候是等待在 supremum 上的, 因为supremum 的特殊性, supremum 没有gap lock, 只有 next-key lock

0: len 8; hex 73757072656d756d: asc supremum; // 这个是等在supremum 记录

在 2 后面插入一个 3 以后, 就可以看到在record 3 上面是有s gap lock 并不是next-key lock, 如下图:

那么这个 gap lock 是哪来的?

这里gap lock 是在 record 3 上的. 这个record 3 的s lock 从哪里来? session2/3 等待在record 2 上的s record lock 又到哪里去了?

这几涉及到锁升级, 锁升级主要有两种场景

insert record, 被next-record 那边继承锁. 具体代码 lock_update_insert
delete record(注意这里不是delete mark, 必须是purge 的物理delete), 需要将该record 上面的lock, 赠给next record上, 具体代码 lock_update_delete

并且由于delete 的时候, 将该record 删除, 如果有等待在该record 上面的record lock, 也需要迁移到next-key 上, 比如这个例子wait 在record 2 上面的 s record lock

另外对于wait 在被删除的record 上的trx, 则通过 lock_rec_reset_and_release_wait(block, heap_no); 将这些trx 唤醒

具体看 InnoDB Trx lock

总结:

2 个trx trx2/trx3 都等待在primary key 上, 锁被另外一个 trx1 持有. trx1 回滚以后, trx2 和 trx3 同时持有了该 record 的 s lock, 通过锁升级又升级成下一个 record 的 GAP lock. 然后两个 trx 同时插入的时候都需要获得insert_intention lock(LOCK_X

LOCK_GAP

LOCK_INSERT_INTENTION); 就变成都想持有insert_intention lock, 被卡在对方持有 GAP S lock 上了.

例子 2

mysql> select * from t1; +—+ | a | +—+ | 2 |

+—+

然后有三个不同 session:

session1: begin; delete from t1 where a = 2;

session2: insert into t1(a) values (2);

session3: insert into t1(a) values (2);

session1: commit;

commit之前

这个时候session2/3 都在等待s record 2 lock, 等待时间是 innodb_lock_wait_timeout,

commit 之后

在session1 执行 commit 以后, session2/session3 获得到正在waiting的 s record lock, 在commit 的时候, 发现死锁, rollback 其中一个事务, 另外一个提交, 死锁信息如下

trx1 等待x record lock, trx2 持有s record lock(这个是在session1 commit, session2/3 都获得了s record lock)

不过这样发现和上面例子不一样的地方, 这里的record 都lock 在record 2 上, 而不是record 3, 这是为什么?

本质原因是这里的delete 操作是 delete mark, 并没有从 btree 上物理删除该record, 因此还可以保留事务的lock 在record 2 上, 如果进行了物理删除操作, 那么这些record lock 都有迁移到next record 了

问题: 这里insert 操作为什么不是 insert intention lock?

比如如果是sk insert 操作就是 insert intention lock. 而这里是 s record lock?

这里delete record 2 以后, 由于record 是 delete mark, 记录还在, 因此insert 的时候会将delete mark record改成要写入的这个record(这里不是可选择优化, 而是btree 唯一性, 必须这么做). 因此插入就变成 row_ins_clust_index_entry_by_modify

所以不是insert 操作, 因此就没有 insert intention lock.

而sk insert 的时候是不允许将delete mark record 复用的, 因为delete mark record 可能会被别的readview 读取到.

通过GDB + call srv_debug_loop() 可以让GDB 将进程停留在 session1 提交, 但是session2/3 还没有进入死锁之前, 这个时候查询performance_schema 可以看到session2/3 获得了record 10 s lock. 这个lock 怎么获得的呢?

这个和上述的例子一样, 这里因为等的比较久了, 所以发生了purge, 因为record 2 被物理删除了. 因此发生了锁升级, record 2 上面的record 会转给next-record, 这里next-record 是10,

总结:

和上一个例子基本类似.

2 个trx trx2/trx3 都等待在primary key 上的唯一性检查上, 锁被另外一个 trx1 持有. trx1 commit 以后, trx2 和 trx3 同时持有了该 record 的 s record lock, 然后由于 delete mark record 的存在, insert 操作变成 modify 操作, 因此就变成都想持有X record lock, 被卡在对方持有 S recordlock 上了.

PolarDB 物理复制SMO 同步机制

2024-03-12T00:00:00+00:00

问题背景:

sync_counter 这个东西引入主要解决这个问题这个场景的问题.

如果一个mtr 里面修改了多个page(最常见的场景就是 btree split/merge 的场景), 这个时候如果在replica 上面有一个search 操作, 那么会存在search 到某一个page 的时候, 这个page 指向的next_page 是不对的这样的场景.

如下图所示:

比如在这个场景里面 RO 的Search 97 已经到了Page 5, 这个时候获得child page 是 Page 8, 由于物理复制插入了 90, 因此造成了 Page 8, Page 8 里面的 97 分裂到了 Page 9 中, 所以到 Page 8 Search 97 是Search 不到的.

如果这个Search 97 操作都在 RW 上面进行, 会有这个问题么?

不会有问题的, 因为RW 上面 5.6 之前有x lock 保护, Search 操作需要持有index s lock 与 x lock 是互相冲突的, 所以会等分裂操作结束了在进行. 5.7/8.0 有了sx lock 以后, Search 操作的s lock 和 sx lock 是不冲突的, 但是5.7/8.0 会将对应的子树锁住, 也就是分裂的过程Page 5/8/9 page 是持有x lock 的, 那么 Search 97 操作无法持有 Page 5 s lock, 那么也就不会有问题.

blink-tree 在这个场景里面类似, Search 到 Page 8 的时候, 如果smo 正在进行, 那么需要等待 Page 8 的address lock 上, 等 SMO 结束以后唤醒Search 操作, 重新 search, 确保能够找到对应97 在 Page 9 上面了.

现有解决方法 sync counter

更早之前 index lock

原先的index lock 机制是在应用一批redo log 的时候, 如果该index 发生了smo, 那么就需要持有index x lock, 等这批redo log 应用完成然后释放index x lock.

通过添加index x lock, 保证ro 访问到的要么是分裂之前的 Btree 结构, 要么是分裂之后的 Btree 结构.

如何保证?

因为Search 操作需要持有 Index s lock, 如果想要 SMO 完成, 那么就必须等待现有 Search 都结束, 而新的 Search 需等待 SMO 操作都完成才可以进行. 从而保证访问到 Btree 完整性.

上图 Search 操作是 SMO 完成之后的访问.

可以看到这个和最早的 5.6 在 RW 节点上处理 SMO 和 search 操作的方法是一样的, 因为需要持有index x lock, 持有的时间为物理复制apply batch 的时间. 既影响了物理复制的性能, 也影响了用户的请求.

sync counter 机制:

对比原先持有Index lock 持有的时间需要整个apply batch 完成, 新的机制只需要持有更新m_sync_counter 的时间, 但是依然需要持有index x lock.

存在的问题:

sync counter 机制在更新index sync_counter 的时候还是需要持有page index lock.

在函数 IndexLockRepl::index_sync_all() => index_sync_with_id() 里面

/* Update the sync counter under protection of index lock. */ rw_lock_x_lock(rw_lock); index->sync_counter = m_sync_counter;

正常的search 操作是需要持有index s lock, 更新sync counter 需要持有x lock, 那么就需要等search 操作结束, 因此如果ro 上面读取的操作比较多, 那么apply phase 其实是需要等待的. 也就是影响到了物理复制的效率. (解释一下这里所有的等待都是 mtr 为维度的等待, 不是 trx 维度的等待, 因为 mtr commit 以后, 这个index lock 就会释放了)
另外一方面, 由于 sync counter 在更新的过程中是需要持有 index x lock, 而这个 apply phase 由于需要等待其他的 mtr 结束, 造成等待 index x lock 时间过长, 那么同样也会造成新的 search 操作等待时间过长的问题.

相当于 sync counter 把所有的 mtr 操作截成了多个串行的截断.

还有一个问题, 通过乐观的机制可能导致 search 操作访问next page 的时候, 需要频繁 store_position && restore position 操作, 会频繁的重新遍历btree. 访问child page 的时候, 需要通过apply_runtime_redo 把对应child page apply 到最新版本, 特别是如果 smo 影响的page 与访问的page 无关, 这样的操作更是多余的.

所以sync_counter 对比 5.7/8.0 sx lock 机制已经是更大的加锁范围, 对比blink-tree 更是. 至少sx lock 机制引入可以保证smo 和 search 的冲突仅仅局限在冲突的子树上.

而现有的sync_counter 虽然是乐观的, 但是还是一个smo 影响整个btree 的search, 并且在 RW 有大量的 SMO 场景, 可能导致频繁的store_position && restore_position 从而影响性能.

有更好的方法么?

有的, smo page queue 或者 LogIndex

SMO page queue

目前已经知道有 SMO 操作, 如果可以把发生 SMO page 的 ID 传到RO 节点, 放在一个SMO_array里面, 那么在访问child_page 的apply runtime redo 里面以及next page 的store && restore 操作里面就可以加一个过滤条件, 如果不在SMO_array 里面, 就可以跳过上述的操作, 就不会影响性能了.

上述方法减少了频繁的store && restore 操作, 但是依然存在的一个问题是需要通过index_sync_all 持有 index x lock. 也就是长时间的mtr 会影响物理复制效率, 并且SMO 还是和search 操作冲突.

其实完成可以和主节点一样, SMO 和search 操作不冲突.

完全把 index_sync_all 去掉, 只需要判断是否在SMO_array 里面就可以, 这样实现类似 RW 节点的效果. 也就是Search 操作只和发生 SMO 的 page 之间互相冲突, 其他page 完全不冲突.

并且如果把index_sync_all 操作都去掉, 那么可以实现 SMO 和 SMO 之间也完全不冲突. 也就是类似blink-tree 的效果了.

SMO page queue 解决了SMO 和 Search 冲突的问题, 确保SMO 只会影响 SMO 相应的page. 但是没有解决一个问题, RO 的查询有时候需要访问到过于新的Page, 不过这个问题好像 RW 也存在.

LogIndex

另外一种通过 LogIndex 也可以实现的方法, 就是访问 Page 的时候带上需要lsn 信息, 从而访问到指定版本 Page, 不会出现访问到不存在Page 的情况.

其实 RW 上面也可以通过类似的方法, 这个其实就类似bw-tree 了.

这样其实也解决了非 SMO 场景下面, search 操作page 和正常物理复制apply page 之间的page 冲突.

看过去 LogIndex && bw-tree 非常类似, 有区别么?

bw-tree 和 LogIndex 区别在于bw-tree 在内存中保存的page 是最老版本的page, 加上每一个 page 的 delta-chain, 从而可以读取到任意版本的 page

LogIndex 目前是内存中保留最新版本的 page, 磁盘中保留最老版本 Page, 如果需要读取指定版本 Page, 那么需要通过读取磁盘 Page + parsed redo log 从而访问到任意版本 Page

看过去也有 Socrate 的getPage(lsn) 协议类似, 有区别么?

Socrate getPage(lsn) 协议返回 >= lsn 的任意一个 page

如果 LogIndex && bw-tree 协议返回 <= lsn 的最大lsn_id page.

为什么 Aurora 或者 Socrate 有没有这个问题?

其实也有同样的问题.

Aurora/Socrate 使用的是类似 getPage(lsn) 协议, 返回>= lsn 的任意 page, 那么也会存在访问到的Page 太新, 导致不一致的情况.

具体看 Socrates

注意: Socrates getPage 协议这里返回的 Page 是>= LSN 的任意 Page, 只需要大于 LSN, 不是>= LSN 的第一个 Page, 所以可能存在当前 Page 过于新, 是未来页的情况.

有一个问题?

为什么 getPage(lsn) 协议里面不返回<= LSN 的最大 Page, 这样看过去更合理, 也就不会出现未来页的情况, 而且也不需要通过undo log 去读取历史版本.

Socrates 里面访问到 Future Page 处理的方法非常简单, 就是一个简单的重试, 我们是否也可以?

有没有可能bw-tree 是最适合这种一写多读场景的btree?

TODO:

ro smo page queue
允许读取future page

具体sync_counter 机制代码:

本质原因是因为我们apply redo log 的时候, 是并行apply 的, 一个mtr 里面多个page 是并发修改的, 这个时候如果replica 有读取进来的话, 由于这个mtr 所有pages apply 不是原子的, 所以有可能读取到这个mtr page apply 的中间状态, 就有可能产生读取到的page 的next_page 不对这样的情况.

当然这里btree 访问的3个方向都有可能有问题, 因此都需要处理

child page
next page
prev page
child page 访问到page 在buf_page_get_gen() 里面通过apply_runtime_redo() 判断是否要应用到最新的redo log 去处理
next page 是在 btr_pcur_move_to_next_page() 函数里面, 访问next page 的时候, 因为当前page 里面记录的next page 可能是错误的, 可能next page 已经发生修改了, 因此需要 store_position, restore_position 重新定位当前的page, 确保里面记录的next page 是正确的.
prev page 由于默认访问prev page 的时候都需要store_position, restore_position. 所以不需要处理.

代码里面可以看到 child page 和 next page 是否可能产生了smo 其实判断条件是一样的.

bool poss_restore = (log_sched->apply_phase_flag.is_set() && (log_sched->index_lock_handler()->sync_counter() == index->sync_counter) && (log_sched->next_apply_lsn() != page_applied_lsn));

都是类似这样的, 下面会解释为什么是这样的判断条件

为什么要区分apply_phase 和 parse_phase?

物理复制parse phase 和 apply phase 是严格分开的, 在parse_phase 的时候是不进行redo apply 的. 因为和用户请求冲突的时候只有在apply phase 的时候, 而parse phase 是不冲突的, 所以在apply phase 阶段的时候, 因为smo 的原因, 需要判断是否执行runtime_apply_redo(), 这个是有开销的.

可以认为区分apply_pahse 和 parse_phase 也是由于page smo 操作引入的, 做的优化

那么好处是在parse 阶段的时候, 我们可以理解之前parse 的redo log 一定已经都apply 完成了, 也就是parse 阶段所有的page 都已经到了 m_applied_lsn

在开启了apply phase 阶段以后, 这一个结论就不成立了.

开启apply phase 以后会设置 m_next_apply_lsn = 上一次parse 完成的lsn.

此刻m_next_apply_lsn > m_applied_lsn.

等这一batch redo log 都apply 完成以后会把m_applied_lsn 设置成 m_next_apply_lsn. 完成一波redo log 的apply.

那么这个apply phase 期间, 用户请求的读取和page 的更新是同时进行的.

但是在parse phase 期间, 其实所有的page 版本都是一致的, 因为都已经apply 到了同一个版本m_applied_lsn 上了. 并没有后台apply phase 在进行.

所以可以看到处理和用户请求的读取的冲突都在apply phase.

所以现在的smo 策略, 如果和用户读取请求冲突, 那么默认需要对齐到同一个版本, 这个版本就是这一次apply 这一batch redo 的m_next_apply_lsn. 从而保证访问的page 是同一个版本.

当然缺点是该index 上的所有访问, 无论是否冲突, 都需要对齐到最新版本.

产生 MLOG_INDEX_LOCK_ACQUIRE 位置:

增加了MLOG_INDEX_LOCK_ACQUIRE 类型的mtr, 在primary 产生mtr 的时候, 如果这次改动1 个mtr 里面涉及了多个page 的修改, 那么就产生这样的mtr, 具体代码 mtr/mtr0mtr.cc

    /* Append the index lock to local buffer */
    if (m_impl.m_modifications && m_impl.m_n_log_recs > 0
        && m_impl.m_log_mode != MTR_LOG_NO_REDO
        && m_impl.m_log_mode != MTR_LOG_NONE) {
      log_sched->index_lock_handler()->append_log(this);
    }

但是这里有个问题, m_impl.m_n_log_recs > 0 能够表示这次mtr 修改了多个page 么?

目前绝大部分mtr 只会修改一个page, 如果一个mtr 修改了多个page, 那么这次修改操作大概率是 SMO 操作.

child page 路径

在btr_cur_search_to_nth_level() => buf_page_get_gen() => apply_runtime_redo 函数里面.

这里其实很多条件是不需要apply_runtime_redo 到最新的

如果当前不是 apply_phase
如果当前page page.applied_lsn >= next_apply_lsn()
如果当前page 在parse buffer 里面并没有需要应用的 redo
如果当前index 并没有涉及 SMO 操作, 那么也不需要. 如何知道当前 index 没有涉及 SMO 操作呢? 如下代码

    ulint sync_cnt = mtr->get_index_sync_counter();
    if (!access_undo && (sync_cnt > 0 && (sync_cnt != index_lock_handler()->sync_counter() || sync_cnt <= index_lock_handler()->prev_sync_counter.load()))) {
      return;
    }

这里有3个sync_counter.

mtr->sync_cnt: mtr 开始时候的sync_counter, 是从index->sync_counter 拷贝过来. 对应变量: mtr->get_index_sync_counter();

index->sync_counter: 每一次当index 涉及了smo 操作了以后, 对应的 index->sync_counter = global_sync_counter. 对应变量: index->sync_index_sync_counter, 这个值是从index->sync_counter 拷贝了 sync_counter

sync_cnt != index_lock_handler()->sync_counter()

sync_cnt <= index_lock_handler()->prev_sync_counter.load()

为什么是这样的判断?

在apply_hashes 函数里面会执行()

index_lock_handler()->inc_sync_counter();
apply_phase_flag.set();
index_lock_handler()->index_sync_all();

line1 将全局的global_sync_counter + 1, 也就是m_sync_counter++;

line2 标记apply_pahse_flags, 为什么需要标记apply_phase_flag 看 physical copy.md

line3 把这一batch 里面涉及smo 的index 都进è_sync_counter);

具体哪些index 做标记是ro 收到MLOG否则都不需要的.

为什么 mtr-> sync_cnt < global_sc 就不需要apply_runtime_redo 了.

因为每次apply_phase 的时候 global_sc 都会+1, 当 mtr->sync_cnt < global_sc 的时候, 说明当前mtr 开始的时候是apply 之前的batch.

那么如果mtr 开始的batch1 和当前batch2 之间有smo 发生了, 也不会有问题么?

比如mtr 开始的时å 虽然mtr 运行的过程中持有page 8 s lock, 但是并¡有持有page 9, page 10 的x lock, 那么此时后台的appl¯发生了smo 操作, global_sc = 101, 那么会有问题么? arch_to_nth_level() 执行过程是持有index->lock s lock, 那么此时这个apply_pahse 是会被堵住的, 因为只有等到btree 遍历完, btr_cur_search_to_nth_level() 执行完才会将index->lock s lock 给释放, 这两个操作互斥, 因此就不会出现遍历btree 一半的过程中, 后台的apply phase 把某一些page 给修改了, 而是一定等所有的btree 遍历完, 再开始apply phase.

所以只要syno()到最新.

另外一个问题:

如果RO 节点出现 mtr 一直没有结束, 后台物理复制的redo batch apply 如何处理?

不会的, 分两种场景

如果当前inde¸直没有smo, 那么物理复制会一直正常进行的, 潓前mtr->sync_count 依然是100 也是不会有问题, 如果undo log 找到指定的版本.
如果当前index 出现了本的son node B

这样就造成了遍历一个btree 访问到不同版本的page 了.

现在好像不会出现这个问题了, 因为apply thread 需要拿到index lock 之后才可ä`c++ bool poss_restore = (log_sched->apply_phase_flag.is_setapply_lsn() != page_applied_lsn));


和上面判断apply_r因为log_sched->next_apply_lsn 是当前这一批batch redo 都apply 完以后的lsn.

因此不相等的话, 说明是老的页.

如果相等的话, 说明这个page 已经更新到这一批batch redo apply 了, 已经是最新版本了, 那就不用restore 了.

另外, 这里在restore_position 的时候是需要持有index s lock, 为什么呢?

```c++
  mtr_s_lock(dict_index_get_lock(index), mtr);

  btr_pcur_restore_position(BTR_SEARCH_LEAF | BTR_ALREADY_S_LATCHED,cursor, mtr);

持有index s lock 其实就和apply phase 互斥, 为了实现访问child page btr_cur_search_to_nth_level() 持有index s lock 一样的逻辑. 这样后台的apply phase 就无法进行, 因此apply phase 更新index->sync_counter 需要持有index x lock.

prev page 路径

访问prev_page 路径由于天然需要store_position 和 restore_position 所以不需要改动.

\#issue 111538 MySQL 8.0 instant add/drop column 性能回退问题

2023-12-10T00:00:00+00:00

issue 地址: https://bugs.mysql.com/bug.php?id=111538

影响范围: 从 8.0.29 版本开始, 在read heavy 场景, 性能可能有 5%~10% 的性能回退

MySQL 官方在8.0.29 里面加了instant add/drop column 能力, 能够实现 instant add 或者 drop cloumn 到表的任意位置. PolarDB 在这基础上增加了可以 Instant 修改列的能力, 具体可以看我们的月报

官方的实现介绍:

https://dev.mysql.com/blog-archive/mysql-8-0-instant-add-and-drop-columns/

instant DDL 核心观点只有一个: don’t touch any row but update the metadata only, 也就是仅仅去修改 Data Dictionary(DD) 信息, 而不去修改数据信息,这样才有可能做到 Instant.

具体的做法就是给每一个行增加了row_version, 然后DD 本身就是多版本, 不同的数据信息用不同的DD 信息去解析.

首先一个record 是否有row_version 信息添加到了Record info bits 里面.

info bits 包含有deleted flag, min record 等等信息, 后来在8.0.13 的时候增加record 是否有Instant ADD column 信息. 在 8.0.29 版本中增加了record 是否有 row_version 信息.

以上是这个 issue 背景, Instant add/drop column 的原理, 但是原因在哪里呢?

从Markus 提交上来的Flamegraph 可以看到, 在 8.0.33 里面 rec_get_offsets/cmp_dtuple_rec/rec_get_nth_field 等等相比 8.0.28 占比明显增多了. 整个 row_serch_mvcc 的调用开销也增加了.

核心原因由于数据record 增加了 row_version 信息, 导致在执行数据解析的函数 rec_get_offsets/rec_get_nth_field 等函数中增加了很多额外的判断, 并且官方把很多 inline function 改成了 non-inline.

为了验证想法, 我们做了 3 个地方的修改, 具体可以看 Issue 上面的代码提交:

1. 将一些 non-inline function 改回inline function

从 inline => non-inline. 修改的函数如下:

8.0.27

rec_get_nth_field => inline

rec_get_nth_field_offs => inline

rec_init_offsets_comp_ordinary => inline

rec_offs_nth_extern => inline

8.2.0

rec_get_nth_field => non-inline

rec_get_nth_field_offs => non-inline

rec_init_offsets_comp_ordinary => non-inline

rec_offs_nth_extern => non-inline

我们测试下来在 oltp_read_only 场景里面, 将这些 non-inline 函数改成 inline 以后, 性能可以有 3~5% 左右的提升空间. 具体改动代码可以在 issue 里面获得.

2. 简化get_rec_insert_state 逻辑

8.0.29 增加了 get_rec_insert_state 函数, 需要判断当前 record 是来自哪一个版本升级上来的, 从而使用合适的 DD 代码逻辑进行解析. 如果是包含有 row_version 版本, 还需要判断是否带有 version 信息, 如果没有 version 信息, 是不是8.0.12 instant add column 版本等等, 这里的逻辑非常琐碎.

所以 REC_INSERT_STATE 的状态非常多.

enum REC_INSERT_STATE {
  /* Record was inserted before first instant add done in the earlier
  implementation. */
  INSERTED_BEFORE_INSTANT_ADD_OLD_IMPLEMENTATION,
  /* Record was inserted after first instant add done in the earlier
  implementation. */
  INSERTED_AFTER_INSTANT_ADD_OLD_IMPLEMENTATION,
  /* Record was inserted after upgrade but before first instant add done in the
  new implementation. */
  INSERTED_AFTER_UPGRADE_BEFORE_INSTANT_ADD_NEW_IMPLEMENTATION,
  /* Record was inserted before first instant add/drop done in the new
  implementation. */
  INSERTED_BEFORE_INSTANT_ADD_NEW_IMPLEMENTATION,
  /* Record was inserted after first instant add/drop done in the new
  implementation. */
  INSERTED_AFTER_INSTANT_ADD_NEW_IMPLEMENTATION,
  /* Record belongs to table with no verison no instant */
  // 如果index 上面没有做过instant add 或者 最新的row_version 版本Instant add/drop
  INSERTED_INTO_TABLE_WITH_NO_INSTANT_NO_VERSION,
  NONE
};

具体获得 insert_state 代码:

static inline enum REC_INSERT_STATE get_rec_insert_state(
    const dict_index_t *index, const rec_t *rec, bool temp) {
  ut_ad(dict_table_is_comp(index->table) || temp);

  if (!index->has_instant_cols_or_row_versions()) {
    return INSERTED_INTO_TABLE_WITH_NO_INSTANT_NO_VERSION;
  }
  /* Position just before info-bits where version will be there if any */
  const byte *v_ptr =
      (byte *)rec -
      ((temp ? REC_N_TMP_EXTRA_BYTES : REC_N_NEW_EXTRA_BYTES) + 1);
  const bool is_versioned =
      (temp) ? rec_new_temp_is_versioned(rec) : rec_new_is_versioned(rec);
  // 如果有versioned 以后, 这里可以看到version 值是保存在Info bits 和 null field bitmap 中间的1 byte, 如下图
  const uint8_t version = (is_versioned) ? (uint8_t)(*v_ptr) : UINT8_UNDEFINED;

  const bool is_instant = (temp) ? rec_get_instant_flag_new_temp(rec)
                                 : rec_get_instant_flag_new(rec);
  // 说明一个Record 不能同时被instalt add 和 row_version 版本instant add/drop 处理过
  // 应该以后默认的新版本是row_version 版本 instant add/drop, 老的要被淘汰
  if (is_versioned && is_instant) {
    ib::error() << "Record has both instant and version bit set in Table '"
                << index->table_name << "', Index '" << index->name()
                << "'. This indicates that the table may be corrupt. Please "
                   "run CHECK TABLE before proceeding.";
  }
  enum REC_INSERT_STATE rec_insert_state = REC_INSERT_STATE::NONE;
  if (is_versioned) {
    ut_a(is_valid_row_version(version));
    if (version == 0) {
      ut_ad(index->has_instant_cols());
      // is_versioned 说明record 有row_version, 如果version = 0, 说明是row_version DD 之前插入, 然后row_version DD 做过以后, 又升级了实例, 所以给这些row_version 设置成0
      rec_insert_state =
          INSERTED_AFTER_UPGRADE_BEFORE_INSTANT_ADD_NEW_IMPLEMENTATION;
    } else {
      // 最正常的record, row_version DD 之后插入的, 有自己的row_version 版本
      ut_ad(index->has_row_versions());
      rec_insert_state = INSERTED_AFTER_INSTANT_ADD_NEW_IMPLEMENTATION;
    }
  } else if (is_instant) {
    // 到这里说明record 上面没有row_version DD 标记, 只有instant add 标记
    // 说明这个Record 是Instant add 之后插入的record, 并且没有做过row_version DD
    ut_ad(index->table->has_instant_cols());
    rec_insert_state = INSERTED_AFTER_INSTANT_ADD_OLD_IMPLEMENTATION;
  } else if (index->table->has_instant_cols()) {
    // 到这里说明record 上面 没有row_version DD 和 instant add 标记, 但是这个index 上面有instant add 标记
    // 说明这个record 是instant add 之前就插入的
    rec_insert_state = INSERTED_BEFORE_INSTANT_ADD_OLD_IMPLEMENTATION;
  } else {
    // record 上面没有row_version DD, 也没用instant add 标记, 并且index 上面也没用instant add
    // 那么这个Record 是在row_version DD 以及 instant add 做过之前就插入的
    rec_insert_state = INSERTED_BEFORE_INSTANT_ADD_NEW_IMPLEMENTATION;
  }

  ut_ad(rec_insert_state != REC_INSERT_STATE::NONE);
  return rec_insert_state;
}

这里虽然 inline enum REC_INSERT_STATE get_rec_insert_state 定义的是 inline, 但是其实这个只是代码给编译器的定义, 具体函数是否 Inline 其实是编译器自己决定的, 最后其实具体运行的时候该函数并没有 inline, 因为可以从Flamegraph 看到, 说明这个函数是有符号表的信息的, 因此肯定不是 inline 的

3. 将 swatch case 改成 if/else, 并且给编译器提示likely 执行的 branch

最后我们发现 switch case 对于有些明显的分支预测并不友好, 通过 if/else 可以手动调整哪些 branch 更有可能执行, 从而优化编译器的选择.

AWS re:Invent2023 Aurora 发布了啥

2023-12-04T00:00:00+00:00

这个是去年AWS re:Invent 2022 的内容, 有兴趣可以看这个链接: Aurora re:Invent 2022

AWS reInvent 2023 刚刚结束, 笔者作为数据库从业人员主要关注的是AWS Aurora 今年做了哪些改动.

笔者主要介绍 4 个方面感兴趣的内容

Aurora limitless
Global Database
Performance
存储计费

Aurora limitless

今年发布会最大的内容应该是推出了Aurora limitless 去解决 Database scaling 的场景, 类似的产品在已经非常多, 像 Spanner/TiDB/OceanBase/Polar-X.

从产品能力上, 支持Shared table 和 Reference table.

从下图可以看到 Shared table 将一个 table partitioned 到多个 Shared 上.

Reference table 将一份数据 Copy 到多个 Shared 中, 每一个 Shared 都有完整的数据, 主要解决的场景是在 Join 等场景中, 可以做到 Local Join 从而优化性能

在具体用户使用上, 需要用户手动指定 create_table_mode, create_table_shared_key, create_table_collocate_with 等等语句对用户有感的实现Sharding

# Create Sharded Table
SET rds_aurora.limitless_create_table_mode='sharded';
SET rds_aurora.limitless_create_table_shard_key='{"cust_id"}';
CREATE TABLE customer (
    cust_id INT PRIMARY KEY NOT NULL,
    name ТЕХТ,
    email VARCHAR (100)
);

SET rds_aurora.limitless_create_table_mode='sharded';
SET rds_aurora.limitless_create_table_shard_key:='{"cust_id"}';
SET rds_aurora.limitless_create_table_collocate_with='customer';

SET rds_aurora.limitless_create_table_mode ='reference';

具体在技术实现上:

在分布式事务实现上, 通过EC2 TimeSync service 实现和 Google 的 True Time 类似的解决方案.

Ture Time 解决方案核心逻辑是 adding latency in the commit time. 在 Spanner 里面这里叫 commit wait. 等earlist possible time > t110 的时候, 那么就可以确保事务提交了, 这里肯定增加了commit 的时候的 latency, 这里EC2 TimeSync service 越精确, 也就是[earliest possible time, latest possible time] 范围越小, 那么对事务提交的影响是越小的.

这里 Aurora limitless 做了优化, commit wait 的时候和 disk IO 是并行的, 由于在寄存分离架构下, disk IO 是网络的 disk IO 需要增加网络的延迟, 这里一般单次 IO 在 tcp 场景下是有可嫩需要 300~400us 左右的. 而 EC2 TimeSync service 保证的精确时间在 us 级别, 那么绝大部分情况下这个时间都可以忽略不计, 因为大部分commit wait 的过程, disk IO 还没有完成, 所以这里可以忽略不计了.

注意: 这里是在 T2 是在获得 commit@t110 以后, 开始等待的.

笔者观点:

Aurora limitless 定位有点尴尬, 不一定能够发展很好. 目前 Aurora limitless 仅仅支持指定shared_key, 对应的 PolarDB-X 同时支持指定 shared_key 以及对用户完全透明无感的分布式, 以及类似的 tidb 支持对用户完全透明无感分布式.

实际上我们看到对于云上分布式数据库一直又这样尴尬的情况, 小客户数据量和写入量整体不大, 不需要使用分布式数据库, 大部分情况 PolarDB/Aurora 这种 share storage 场景就可以满足, 难得有用户想要使用分布式数据库的要求, 希望的又是完全无感使用, 因为不指定 shared_key 从而性能可能不如单机数据库来的理想. 而 Aurora limitless 的使用方式小客户可能肯定不会使用了

大客户可能存在使用分布式数据库的场景, 也愿意学习使用指定 shared_key 的方式从而实现更好的性能, 但是大客户又会担心被云厂商绑定等等问题, 在分布式数据库还没有成为标准的情况下, 不愿意使用云厂商的分布式数据库, 更多愿意使用开源数据库自建的方式使用数据库.

Global Database

Aurora Global Database 推荐计划内切主的能力, 叫 Switchover. 在 PolarDB 上面的跨 AZ 切换场景中, 主可用区切换是类似的能力.

这个场景里面 Switchover 会等待两个 Region 的 write lsn 完全对齐以后, 再进行切换. 从而保证 RPO = 0. 同时也保证 standby region 的资源和 primary region 对齐, 从而不影响切换过来的性能.

他们的一个 User Case 是. 有一个客户每天进行 3 次跨 region 切换, 因为他们的业务是全球的, 白天时候是高峰期, 所以一直切换保证就近的 Region 读取的性能是最好的.

当然 Aurora 同时也保留原来的 Failover 的功能.

Aurora 跨 region 切换 RTO = 1~2 minutes. 切换过去以后, Region A 会重新和 Region B 建立主备关系, Region B 成为主 region, region A 成为 standby region.

并且这里 Region A 会在 crash 那个时刻打一个快照, 从而方便用户查询数据

Performance

在性能方面, Aurora 这次发布在计算节点增加本地NVEe SSD, 从而优化云存储 IO latency 带来的延迟. 在 PolarDB 里面已经有类似的能力, 叫 External BufferPool.

笔者观点:

现在的存储引擎InnoDB/RocksDB/ClickHouse 等待都是针对本地盘设计的存储引擎, 并没有针对云存储进行优化. 所以需要实现大量的IO 路径上面的优化减少云存储 latency 带来的影响. 具体可以看 CloudJump 这个文章.

另外, 笔者认为下一步的存储引擎应该会往云原生方面发展, 也就是存储引擎本身应该合理利用云上的 SSD/云存储/OSS 等待资源, 从而实现最好的性价比. 我们管这个叫 Cloud-Tier-Engine.

对于临时表通过本地NVMe Storage 进行加速

Tiered Cache 能力, 通过本地盘对Aurora storage 进行读加速

Tiered cache 流程是在 buffer pool 里面保留了一份tiered cache 的 MetaData, 读取的时候先检查 MetaData 里面有没有, 如果有直接从本地盘读取, 如果没有从 Aurora storage 读取.

那么什么时候往 tiered cache 里面写入内容呢?

和 external bufferpool 一样, 等这个 page 被 LRU list淘汰的时候(Page 不能是脏页), 并不是直接从内存中删除, 而是加入到 tiered cache 里面, 这里具体实现的时候要考虑 LRU list mutex 的开销了.

在读取的路径并不会主动去更新 tiered cache, 从而保证了读取性能.

Update 的时候也只需要更新 tiered cache 的 MetaData, 表示 tiered cache 里面的 page 是无用的就可以. 下次读取的时候, 就不会读取 tiered cache 里面的 Page.

那么 tiered cache 里面的内容如果做 LRU list 的淘汰呢?

这里Aurora 选择的测试是随机淘汰. PolarDB 的实现上则是根据 LRU 算法去选择合适的 Page 进行淘汰

存储计费能力

在存储能力方面, Aurora 终于发布了 Aurora I/O-Optimized. 直接按照磁盘空间大小进行计费, 原来的 Aurora I/O 的计费模式称为 Aurora Standard.

笔者观点:

Aurora 之前的存储计费模式一直被很多人诟病, 大部分的存储是按照磁盘空间大小收费, 而 Aurora 的存储按照磁盘空间以及 IOPS 进行收费, 导致用户使用的时候非常难以预估具体可能费用, 现在终于做出了改变.