Bigtable数据模型和架构

前⾔

最近在看Bigtable的论⽂，其中的数据模型这部分⼀直没有很好的理解。现在先将论⽂中的部分原⽂展⽰出来，并附上中⽂翻译。之后是⾃⼰对Bigtable数据模型知识的整理。

Introduction 简介

In many ways, Bigtable resembles a database: it shares many implementation strategies with databases. Parallel

databases and main-memory databases have achieved scalability and high performance, but Bigtable provides a different interface than such systems. Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage. Data is indexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings,although clients often serialize various forms of structured and semi-structured data into these strings. Clients can control the locality of their data through careful choi

ces in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to

serve data out of memory or from disk.

在很多⽅⾯，Bigtable和数据库很类似：它使⽤了很多数据库的实现策略。并⾏数据库和内存数据库已经具备可扩展性和很⾼的性能，但是Bigtable提供了⼀个和这些系统完全不同的接⼝。Bigtable不⽀持完整的关系数据模型；与之相反，Bigtable为客户提供了简单的数据模型，利⽤这个模型，客户可以动态控制数据的分布和格式，⽤户也可以⾃⼰推测底层存储数据的位置相关性。数据的下标是⾏和列的名字，名字可以是任意的字符串。Bigtable将存储的数据都视为字符串，但是Bigtable本⾝不去解析这些字符串，客户程序通常会在把各种结构化或者半结构化的数据串⾏化到这些字符串⾥。通过仔细选择数据的模式，客户可以控制数据的位置相关性。最后，可以通过Bigtable的模式参数来控制数据是存放在内存中、还是硬盘上。

Data Model 数据模型

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column

key, and a timestamp; each value in the map is an uninterpreted array of bytes。(row:string, column:string, time:int64)→ string

Bigtable是⼀个稀疏的、分布式的、持久化存储的多维度排序Map。Map的索引是⾏关键字、列关键字以及时间戳；Map中的每个value都是⼀个未经解析的字符串。(row:string, column:string,time:int64)->string

We settled on this data model after examining a variety of potential uses of a Bigtable-like system. As one concrete example that drove some of our design decisions,suppose we want to keep a copy of a large collection of web pages and related information that could be used by many different projects; let us call this particular table the Webtable. In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents: column under the timestamps when they were fetched, as illustrated in Figure 1.

我们仔细分析了⼀个类似Bigtable的系统的种种潜在⽤途之后，决定使⽤这个数据模型。先举个具体的例⼦，这个例⼦促使我们做了很多设计决策；假设我们想要存储海量的⽹页及相关信息，这些数据可以⽤于很多不同的项⽬中，我们称这个特殊的表为Webtable。在Webtable ⾥，我们使⽤URL作为⾏关键字，使⽤⽹页的某些属性作为列名，⽹页的内容存在"contents:"列中，并⽤获取该⽹页的时间戳作为标识，如图⼀所⽰。

时珍国医国药Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page is referenced by both the Sports Illustrated and the MY-look home pages, so the row

contains columns named anchor:cnnsi and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6。

图⼀：Webtable例⼦中的⽚断。⾏名是⼀个反向URL。contents列族存放的是⽹页的内容，anchor列族存放引⽤该⽹页的锚链接⽂本。CNN的主页被Sports Illustrater和MY-look的主页引⽤，因此该⾏包含了名为"anchor:cnnsi"和 "anchor:my.look.ca"的列。每个锚链接只有⼀个版本（时间戳标识了列的版本，t9和t8分别标识了两个锚链接的版本）；⽽contents列则有三个版本，分别由时间戳

t3，t5，和t6标识。

Rows ⾏

The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic (regar

dless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the

system’s behavior in the presence of concurrent updates to the same row.

Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned.Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of short row ranges are efficient and typically require communication with only a small number of machines. Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses. For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data le/index.html under the le.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient.

表中的⾏关键字可以是任意的字符串（⽬前⽀持最⼤64KB的字符串，但是对⼤多数⽤户，10-100个字节就⾜够了）。对同⼀个⾏关键字的读或者写操作都是原⼦的（不管读或者写这⼀⾏⾥多少个不同列），这个设计决策能够使⽤户很容易的理解程序在对同⼀个⾏进⾏并发更新操作时的⾏为。

乔丹法则

Bigtable通过⾏关键字的字典顺序来组织数据。表中的每个⾏都可以动态分区。每个分区叫做⼀个"Tablet"， Tablet是数据分布和负载均衡调整的最⼩单位。这样做的结果是，当操作只读取⾏中很少⼏列的数据时效率很⾼，通常只需要很少⼏次机器间的通信即可完成。⽤户可以通过选择合适的⾏关键字，在数据访问时有效利⽤数据的位置相关性，从⽽更好的利⽤这个特性。举例来说，在Webtable⾥，通过反转URL中主机名的⽅式，可以把同⼀个域名下的⽹页聚集起来组织成连续的⾏。具体来说，我们可以把le/index.html的数据存放在关键字le.maps/index.html下。把相同的域中的⽹页存储在连续的区域可以让基于主机和域名的分析更加有效。

Column Families 列族

Column keys are grouped into sets called column families, which form the basic unit of access control. All data stored in a column family is usually of the same type (we compress data in the same column family together). A column family must be created before data can be stored under any column key in that family; after a family has been created, any column key within the family can be used. It is our intent that the number of distinct column families in a table be small (in the hundreds at most), and that families rarely change during operation. In contrast,a table may have an unbounded number of columns.

A column key is named using the following syntax:family:qualifier. Column family names must be printable,

but qualifiers may be arbitrary strings. An example column family for the Webtable is language, which stores the language in which a web page was written. We use only one column key in the language family, and it stores each web page’s language ID. Another useful column family for this table is anchor; each column key in this family represents a single anchor, as shown in Figure 1. The qualifier is the name of the referring site; the cell contents is the link text.

Access control and both disk and memory accounting are performed at the column-family level. In our Webtable example, these controls allow us to manage several different types of applications: some that add new base data, some that read the base data and create derived column families, and some that are only allowed to view existing data (and possibly not even to view all of the existing families for privacy reasons).

列关键字组成的集合叫做"列族"，列族是访问控制的基本单位。存放在同⼀列族下的所有数据通常都属于同⼀个类型（我们可以把同⼀个列族下的数据压缩在⼀起）。列族在使⽤之前必须先创建，然后才能在列族中的任何⼀列关键字下存放数据；列族创建后，其中的任何⼀个列关键字下都可以存放数

据。根据我们的设计意图，⼀张表中的列族不能太多（最多⼏百个），并且列族在运⾏期间很少改变。与之相对应的，⼀张表可以有⽆限多个列。

列关键字的命名语法如下：列族：限定词。列族的名字必须是可打印的字符串，⽽限定词的名字可以是任意的字符串。⽐如，Webtable有个列族language，language列族⽤来存放撰写⽹页的语⾔。我们在language列族中只使⽤⼀个列关键字，⽤来存放每个⽹页的语⾔标识ID。Webtable中另⼀个有⽤的列族是anchor；这个列族的每⼀个列关键字代表⼀个锚链接，如图⼀所⽰。Anchor列族的限定词是引⽤该⽹页的站点名；Anchor列族每列的数据项存放的是链接⽂本。

访问控制、磁盘和内存的使⽤统计都是在列族层⾯进⾏的。在我们的Webtable的例⼦中，上述的控制权限能帮助我们管理不同类型的应⽤：我们允许⼀些应⽤可以添加新的基本数据、⼀些应⽤可以读取基本数据并创建继承的列族、⼀些应⽤则只允许浏览数据（甚⾄可能因为隐私的原因不能浏览所有数据）。

Timestamps 时间戳

Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by

timestamp.Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case

they represent "real time" in microseconds, or be explicitly assigned by client applications. Applications that need to avoid

collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing

timestamp order, so that the most recent versions can be read first.

To make the management of versioned data less onerous, we support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically.The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept (e.g., only keep values that were written in the last seven days).

In our Webtable example, we set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above lets us keep only the most recent three versions of every page.

在Bigtable中，表的每⼀个数据项都可以包含同⼀份数据的不同版本；不同版本的数据通过时间戳来索引。Bigtable时间戳的类型是64位整型。Bigtable可以⽤精确到毫秒的"实时"时间给时间戳赋值；⽤

户程序也可以给时间戳赋值。如果应⽤程序需要避免数据版本冲突，那么它必须⾃⼰⽣成具有唯⼀性的时间戳。数据项中，不同版本的数据按照时间戳倒序排序，即最新的数据排在最前⾯。

为了减轻多个版本数据的管理负担，我们对每⼀个列族配有两个设置参数，Bigtable通过这两个参数可以对废弃版本的数据⾃动进⾏垃圾收集。⽤户可以指定只保存最后n个版本的数据，或者只保存“⾜够新”的版本的数据（⽐如，只保存最近7天的内容写⼊的数据）。

在Webtable的例⼦⾥，contents：列存储的时间戳信息是⽹络爬⾍抓取⼀个页⾯的时间。上⾯提及的垃圾收集机制可以让我们只保留最近三个版本的⽹页数据。

Bigtable数据模型知识整理

Bigtable不是，但是却沿⽤了很多关系型数据库的术语，像table（表）、row（⾏）、column（列）等。Bigtable⽤于存储关系较为复杂的半结构化数据。数据⼤致可以分为以下三类：

⾮结构化数据：包括所有格式的办公⽂档、⽂本、图⽚、图像、⾳频和视频信息等。

结构化数据：⼀般存储在关系数据库中，可以⽤⼆维关系表结构来表⽰。结构化数据的模式(Schema，包括属性、数据类型以及数据之间的联系)和内容是分开的，数据的模式需要预先定义。

半结构化数据：介于⾮结构化数据和结构化数据之间，HTML⽂档就属于半结构化数据。它⼀般是⾃描述的，与结构化数据最⼤的区别在于，半结构化数据的模式结构和内容混在⼀起，没有明显的区分，也不需要预先定义数据的模式结构。

本质上说，Bigtable是⼀个键值（）映射。按作者的说法，Bigtable是⼀个稀疏的，的，持久化的，多维的排序映射。Bigtable的键有三维，分别是⾏键（row key）、列键（column key）和（timestamp），⾏键和列键都是字节串，时间戳是64位整型；⽽值是⼀个字节串。

与⼆维关系表结构不同的是，Bigtable的键是三维的。⼆维表就像⽇常⽣活中遇到的普通表格或数据库中的表格，根据⾏列就能到信息，⽽且⾏列唯⼀确定⼀条信息。三维结构更像图书馆⾥的⼀个书架，⾏列可以确定书架上的⼀个空间(图书馆的书架上相邻空间放的都是⼀类书，由上⽂可知，在Bigtable⾥，"类似的"数据也是放的很近的)，⽽不能唯⼀的确定某⼀本书。时间戳给⽤户介绍了这些书的"顺序"，⽤户可以根据顺序拿到⾃⼰想要的那本书。例如，图⼀的t5就可以理解成从左⾄右第5本书。

⾏键可以是任意字节串，⾏的读写都是原⼦性的。Bigtable按照⾏键的存储数据。Bigtable的表会根据⾏键⾃动划分为⽚（tablet），⽚是负载均衡的单元。最初表都只有⼀个⽚，但随着表不断增⼤，⽚会⾃动分裂，⽚的⼤⼩控制在100-200MB。Bigtable将多个列组织成列族，这样，列名由两个部分组成：

(column family，qualifier)。列族是Bigtable中访问控制的基本单元，也就是说，访问权限的设置是在列族这⼀级别上进⾏的。Bigtable中的列族在创建表格的时候需要预先定义好，个数也不允许过多；然⽽，每个列族包含哪些qualifier是不需要预先定义的。

在看Bigtable相关的知识时，最先让我感到奇怪的就是Bigtable三维结构的键。因为我们⼀般使⽤的Map存储⼀维的数据，就算是数据库⾥的表也只有⼆维⽽已。这种多维的结构让我想起了操作系统中的多级页表(建⽴索引，优点是不⽤占据内存太多空间，也能管理⾜够多的数据，也不⽤盲⽬地顺序查页表项)。

由Bigtable的定义可知，它的⾏、列和时间戳都是索引。⾏是表的第⼀级索引，可以把该⾏的列、时间戳和value看成⼀个整体(结构1)，简化为⼀维键值映射，类似于：

table{

"11111" : {结构1},//⾏1

"aaaaa" : {结构1},//⾏2

"bbbbb" : {结构1},//⾏3

"asdaa" : {结构1},//⾏4

"zzzzz" : {结构1} //⾏5

}

列是第⼆级索引，每⾏拥有的列是不受限制的，可以随时增加减少。特点：⼀个列族⾥的列⼀般存储相同类型的数据。⼀⾏的列族很少变化，但是列族⾥的列可以随意添加删除。列键按照family:qualifier格式命名。这次将列拿出来，将时间和value看成⼀个整体(结构2)，简化为⼆维键值映射，类似于：

table{

// ...

"aaaaa" : { //⾏1

"A:foo" : {结构2},//列1，列族名为A，列名是foo

"A:bar" : {结构2},//列2，列族名为A，列名是bar

"B:" : {结构2} //列3，列族名为B，但列名是空字串

}，

丙烯酸乙酯

"bbbbb" : { //⾏2

"A:foo" : {结构2},//列1，列族名为A，列名是foo

"B:asd" : {结构2} //列2，列族名为B，列名是asd

}，

// ...

}

是第三级索引。Bigtable允许保存数据的多个版本，版本区分的依据就是。数据的不同版本按照降序存储，因此先读到的是最新版本的数据。加⼊时间戳后，就得到了Bigtable的完整数据模型，类似于：

table{

// ...

"aaaaa" : { //⾏

"A:foo" : { //列1

6 : "y", //版本1

5 : "m" //版本2

沈长富

"A:bar" : { //列2

15 : "d", //版本1

"B:" : { //列3

12 : "w"， //版本1

10 : "o"， //版本2

9 : "w" //版本3

}

// ...

北部湾新闻

}

查询时，如果只给出⾏列，那么返回的是最新版本的数据；如果给出了⾏列，那么返回的是时间⼩于或等于时间戳的数据。⽐如，我们查询"aaaaa"/"A:foo"/6，返回的值是 "y"；查询 "aaaaa"/"A:foo"/5，返回的结果就是 "m"；查询 "aaaaa"/"A:foo"/2，返回的结果是空。图1 中 "contents:" 列下保存了⽹页的三个版本，我们可以⽤ ("comn.www", "contents:", t5) 来到 CNN 主页在 t5 时刻的内容。

>电子病历基本规范

本文发布于:2024-09-22 13:38:30，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/335558.html

上一篇：基于新凯恩斯主义DSGE模型的中国经济周期波动研究

下一篇：JVM内存模型和结构

标签：数据列族版本

留言与评论（共有 0 条评论）