前段时间导师要求了解 GreenPlum
数据库,后来安装和使用了一下,感觉和其他数据库没有什么不同,于是就不了了之了。现在重新看一遍 GPDB
的特性并尝试使用这些特性。
其实,官方宣传页上写的特性
才是真正需要我去了解的。
特性
首先,GPDB
的特性是什么?从哪里找?产品的宣传页上肯定有。
最大特性
GPDB
的最大特性就是 MPP
, 即 Massively Parallel Processing
。 在首页上,最明显的就是这两个:
- Massively Parallel, 大规模并行
- Analytics,分析
然后,下滑页面,两个明显的特性是:
- Power at scale: High performance on petabyte-scale data volumes. PB级数据高性能处理。
- True Flexibility: Deploy anywhere. 部署灵活。
主要特性
首页接着往下滑,写明了 GPDB
的主要特性:
- MPP Architecture
- Petabyte-Scale Loading: 加载速度随着每个额外节点的增加而增加,每个机架的加载速度超过10Tb/h ( 约为 347.22 GB/s ) 。
- Innovative Query Optimization: 工业界中首个大数据工作负载的
cost-based query optimizer
. - Polymorphic Data Storage: 多态数据存储。完全控制表和—*分区存储、执行和压缩的配置。
- Integrated In-Database Analytics:
Apache MADlib
提供的一个库,用于可伸缩的数据库内分析,通过用户定义的函数扩展Greenplum数据库的SQL功能。 - Federated Data Access: 联邦数据访问。使用Greenplum optimzer和 query processing engine 查询外部数据源。包括Hadoop、Cloud Storage、ORC、AVRO、Parquet等 Polygot 数据存储。
遗憾的是,首页上的这些图标都不能点击。所以要体验这些特性,只能自己去探索了。
TODO
- 从 hadoop 中查询数据。
GPDB vs. Hadoop + Hive
来源:Greenplum介绍), 这是 Alibaba 在 2011.02.17 做的汇报, GPDB 版本为 4.x.
对比 Hadoop + Hive
, GPDB 的查询性能比 Hive 好,但是 GPDB 支持的集群节点数太少,最多可以 1000 个 segment, 而Hive 可支持上万个节点。
Data Loading
主要有 3 种加载方法:
- the SQL INSERT statement: 在加载大量数据时低效,适用于小数据集。
- the COPY command: 可以自定义 the format of the text file 以解析成 columns and rows. 比 INSERT 快, 但是不是一个 parallel process.
gpfdist
andgpload
: 可以高效的将外部数据转储到数据表中。快速,并行加载。Administrator 可以定义single row error isolation mode
以继续加载格式正常的 rows。gpload
需要提前编写一个YAML-formated
control file, 用于描述 source data location,format, transformations required, participating hosts, database destinations, and others. 这允许你执行一个复杂的加载任务。
以下只演示 gpfdist
和 gpload
的使用:
gpfdist
1 | gpfdist [-d directory] [-p http_port] [-l log_file] [-t timeout] |
示例:
Uses the gpfdist
protocol to create a readable external table, ext_expenses, from all files with the txt extension. The column delimiter is a pipe(|) and NULL (‘ ‘) is a space. Access to the external table is single row error isolation mode. If the error count on a segment is greater than five (the SEGMENT REJECT LIMIT value), the entire external table operation fails and no rows are processed.
1 | =# CREATE EXTERNAL TABLE ext_expenses ( name text, |
To create the readable ext_expenses table from CSV-formatted text files:
1 | =# CREATE EXTERNAL TABLE ext_expenses ( name text, |
以下是我的个人实验:
创建一个数据目录
~/Datasets/baike
, 将数据集baike_triple.txt
移动到该目录下。这里的数据集来自 CN-DBpedia, 包含 900万+ 的 百科实体 以及 6700万+ 的 三元组关系。其中 mention2entity 信息 110万+,摘要信息 400万+,标签信息 1980万+,infobox 信息 4100万+。大小约为 4.0 G。 数据实例如下:
1
2
3
4
5
6
7"1+8"时代广场 中文名 "1+8"时代广场
"1+8"时代广场 地点 咸宁大道与银泉大道交叉口
"1+8"时代广场 实质 城市综合体项目
"1+8"时代广场 总建面 约11.28万方
"1.4"河南兰考火灾事故 中文名 "1.4"河南兰考火灾事故
"1.4"河南兰考火灾事故 地点 河南<a>兰考县</a>城关镇
"1.4"河南兰考火灾事故 时间 2013年1月4日
开启
gpfdist
后台:1
2
3gpfdist -d ~/Datasets/baike -p 8081 > /tmp/gpfdist.log 2>&1 &
ps -A | grep gpfdist # 查看进程号
30693 pts/8 00:00:00 gpfdist # 表示进程号为 30693选项说明:
- -d directory: 指定一个目录,gpfdist 将从该目录中为可读外部表提供文件,或为可写外部表创建输出文件。如果没有指定,默认为当前目录。
- -p http_port: gpfdist 提供文件要使用的HTTP端口。默认为8080。
查看日志:
1
2
3
4
5
6
7
8
9
10gt@vm1:~$ more /tmp/gpfdist.log
2020-01-13 02:49:38 30693 INFO Before opening listening sockets - following listening sockets are a vailable:
2020-01-13 02:49:38 30693 INFO IPV6 socket: [::]:8081
2020-01-13 02:49:38 30693 INFO IPV4 socket: 0.0.0.0:8081
2020-01-13 02:49:38 30693 INFO Trying to open listening socket:
2020-01-13 02:49:38 30693 INFO IPV6 socket: [::]:8081
2020-01-13 02:49:38 30693 INFO Opening listening socket succeeded
2020-01-13 02:49:38 30693 INFO Trying to open listening socket:
2020-01-13 02:49:38 30693 INFO IPV4 socket: 0.0.0.0:8081
Serving HTTP on port 8081, directory /home/gt/Datasets/baike
以
gpadmin
身份开启一个 psql session, 创建tables
:ext_baike
用于存放加载的数据,ext_load_baike_err
用于存放加载错误的日志。1
2
3
4
5
6
7
8
9
10
11
12psql -h localhost -d db_kg # 进入数据库 db_kg
# 创建外部表
CREATE EXTERNAL TABLE ext_baike (
head text, rel text, tail text)
LOCATION ('gpfdist://vm1:8081/baike_triples.txt')
FORMAT 'TEXT' (DELIMITER E'\t')
LOG ERRORS SEGMENT REJECT LIMIT 50000;
# 创建内部存储表
CREATE TABLE tb_baike (
id SERIAL PRIMARY KEY, head text, rel text, tail text);创建外部表语法详细: CREATE EXTERNAL TABLE .
创建 外部表后,就可以直接从外部表读取数据了,例如:
1
2
3
4
5
6
7
8
9
10
11
12
13
14db_kg=# select * from ext_baike limit 10;
head | rel | tail
-----------------+----------+---------------
*裔* | 中文名 | *裔*
*裔* | 作者 | Amarantine
*裔* | 小说进度 | 暂停
*裔* | 连载网站 | 晋江文学城
*西方犯罪学概论 | BaiduTAG | 书籍
*西方犯罪学概论 | ISBN | 9787811399967
*西方犯罪学概论 | 作者 | 李明琪 编
*西方犯罪学概论 | 出版时间 | 2010-4
*西方犯罪学概论 | 定价 | 22.00元
*西方犯罪学概论 | 页数 | 305
(10 rows)
将外部表数据导入到内部表:
1
INSERT INTO tb_baike(head, rel, tail) SELECT * FROM ext_baike;
由于 虚拟机的存储空间不足,最后运行失败。可以在日志中看到:
1
2
3more greenplum/data/data1/primary/gpseg0/pg_log/gpdb-2020-01-13_000000.csv
2020-01-13 04:10:14.623480 UTC,,,p24832,th930718592,,,,0,,,seg0,,,,,"PANIC","53100","could not writ
e to file ""pg_xlog/xlogtemp.24832"": No space left on device",,,,,,,0,,"xlog.c"start: 15:33:30
abnormally end: 16:03
错误:
db_kg=# INSERT INTO tb_baike(head, rel, tail) SELECT * FROM ext_baike;
ERROR: gpfdist error: unknown meta type 108 (url_curl.c:1635) (seg0 slice1 127.0.1.1:6000 pid=15880) (url_curl.c:1635)
CONTEXT: External table ext_baike, file gpfdist://vm1:8081/baike_triples.txtLIMIT 10000;
start 16:10:00
end: 17:12:42
ERROR: interconnect encountered a network error, please check your network (seg3 slice1 192.168.5
6.6:6001 pid=11612)
DETAIL: Failed to send packet (seq 1) to 127.0.1.1:56414 (pid 15913 cid -1) after 3562 retries in
3600 seconds
Cost-based Query Optimizer
当master接受到一条SQL语句,会将这条语句解析为执行计划 DAG,将 DAG 中不需要进行数据交换的划分为 slice ,join,aggregate,sort 的时候,都会涉及到 slice
的重分布,会有一个 motion
任务来执行数据的重分布。将 slice 下发到涉及到的相关 segment 中。来源:GreenPlum:基于PostgreSQL的分布式关系型数据库
参考 About Greenplum Query Processing
slice
: To achieve maximum parallelism during query execution, Greenplum divides the work of the query plan into slices. A slice is a portion of the plan that segments can work on independently. A query plan is sliced wherever a motion operation occurs in the plan, with one slice on each side of the motion.
motion
: A motion operation involves moving tuples between the segments during query processing. Note that not every query requires a motion. For example, a targeted query plan does not require data to move across the interconnect.
- a redistribute motion that moves tuples between the segments to complete the join.
- A gather motion is when the segments send results back up to the master for presentation to the client. Because a query plan is always sliced wherever a motion occurs, this plan also has an implicit slice at the very top of the plan (slice 3).
Federated Data Access
MADlib 扩展
添加 madlib扩展
从 Pivotal Network 上面下载 MADlib 1.16+8 for RHEL 7
.
使用 gppkg
安装:
出现错误:
1 | gt@vm1:~/madlib-1.16+8-gp6-rhel7-x86_64$ gppkg -i madlib-1.16+8-gp6-rhel7-x86_64.gppkg |
最后, 重新安装了系统: Centos 7. 在 Pivotal Network 上说明了 madlib
可以安装的系统:
Pivotal Network
只给出了 Redhat 6.x
, Redhat 7.x
的 binary package. 之前装的 Ubuntu 18.04
并不能使用。Redhat Enterprise Linux
咱也用不起。。所以考虑使用免费的 Fedora
or CentOS
系统。 我装的是 CentOS 7
,
1 | [gpadmin@vm1 ~]$ cat /etc/os-release |
安装的 Greenplum
是 6.3
版本的。 因此下载的文件是 madlib-1.16+9-gp6-rhel7-x86_64.tar.gz
, 解压后如下:
1 | [gpadmin@vm1 madlib-1.16+9-gp6-rhel7-x86_64]$ ls -la |
里面的 madlib-1.16+9-gp6-rhel7-x86_64.gppkg
文件需要使用 GreenPlum 的工具 gppkg
安装。 安装方式为:
1 | {GPHOME}/bin/gppgk [-i <package>| -u <package> | -r <name-version> | -c] |
安装:
1 | {GPHOME}/bin/gppgk -i madlib-1.16+9-gp6-rhel7-x86_64.gppkg |
然后,安装 MADlib
Object to Database:
1 | GPHOME/madlib/bin/madpack install -s madlib -p greenplum -c gpadmin@vm1:5432/testdb |
安装记录如下(需要预先安装 m4
, yum install m4
):
1 | [gpadmin@vm1 madlib-1.16+9-gp6-rhel7-x86_64]$ $GPHOME/madlib/bin/madpack install -p greenplum |
madlib 的使用
错误(在数据库 db_kg
):
ERROR: schema “madlib” does not exist
由于使用 madpack
安装 madlib
时没有指定 database, 而在 db_kg 中使用 madlib
时会出现错误,使用 madpack install-check
检查:
1 | [gpadmin@vm1 ~]$ $GPHOME/madlib/bin/madpack install-check -p greenplum -c gpadmin@vm1:5432/db_madlib_demo |
发现确实在 数据库 db_kg 中没有安装 madlib. 看看上面 madpack install
的日志,发现它默认选择的数据库是 gpadmin
, 即用户名同名数据库:
1 | madpack.py: INFO : MADlib database version = None (host=localhost:5432, db=gpadmin, schema=madlib) |
在 数据库 db_kg 上测试发现确实已经安装:
1 | [gpadmin@vm1 ~]$ $GPHOME/madlib/bin/madpack install-check -p greenplum |
所以在 db_kg 上安装 madlib 即可。
在数据库内检查是否已经安装 madlib
:
1 | [gpadmin@vm1 ~]$ psql -d db_madlib_demo |
正文开始…
MADlib
的使用详见: MADlib Documentation. 内容太多,只挑几个案例学习。
Array Operations
Array Operations, copy from Array Operations:
Operation | Description |
---|---|
array_add() | Adds two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
sum() | Aggregate, sums vector element-wisely. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_sub() | Subtracts two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_mult() | Element-wise product of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_div() | Element-wise division of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_dot() | Dot-product of two arrays. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_contains() | Checks whether one array contains the other. This function returns TRUE if each non-zero element in the right array equals to the element with the same index in the left array. |
array_max() | This function finds the maximum value in the array. NULLs are ignored. Return type is the same as the input type. |
array_max_index() | This function finds the maximum value and corresponding index in the array. NULLs are ignored. Return type is array in format [max, index], and its element type is the same as the input type. |
array_min() | This function finds the minimum value in the array. NULLs are ignored. Return type is the same as the input type. |
array_min_index() | This function finds the minimum value and corresponding index in the array. NULLs are ignored. Return type is array in format [min, index], and its element type is the same as the input type. |
array_sum() | This function finds the sum of the values in the array. NULLs are ignored. Return type is the same as the input type. |
array_sum_big() | This function finds the sum of the values in the array. NULLs are ignored. Return type is always FLOAT8 regardless of input. This function is meant to replace array_sum() in cases when a sum may overflow the element type. |
array_abs_sum() | This function finds the sum of abs of the values in the array. NULLs are ignored. Return type is the same as the input type. |
array_abs() | This function takes an array as the input and finds abs of each element in the array, returning the resulting array. It requires that all the values are NON-NULL. |
array_mean() | This function finds the mean of the values in the array. NULLs are ignored. |
array_stddev() | This function finds the standard deviation of the values in the array. NULLs are ignored. |
array_of_float() | This function creates an array of set size (the argument value) of FLOAT8, initializing the values to 0.0. |
array_of_bigint() | This function creates an array of set size (the argument value) of BIGINT, initializing the values to 0. |
array_fill() | This functions set every value in the array to some desired value (provided as the argument). |
array_filter() | This function takes an array as the input and keep only elements that satisfy the operator on specified scalar. It requires that the array is 1-D and all the values are NON-NULL. Return type is the same as the input type. By default, this function removes all zeros. |
array_scalar_mult() | This function takes an array as the input and executes element-wise multiplication by the scalar provided as the second argument, returning the resulting array. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_scalar_add() | This function takes an array as the input and executes element-wise addition of the scalar provided as the second argument, returning the resulting array. It requires that all the values are NON-NULL. Return type is the same as the input type. |
array_sqrt() | This function takes an array as the input and finds square root of each element in the array, returning the resulting array. It requires that all the values are NON-NULL. |
array_pow() | This function takes an array and a float8 as the input and finds power of each element in the array, returning the resulting array. It requires that all the values are NON-NULL. |
array_square() | This function takes an array as the input and finds square of each element in the array, returning the resulting array. It requires that all the values are NON-NULL. |
normalize() | This function normalizes an array as sum of squares to be 1. It requires that the array is 1-D and all the values are NON-NULL. |
array_unnest_2d_to_1d() | This function takes a 2-D array as the input and unnests it by one level. It returns a set of 1-D arrays that correspond to rows of the input array as well as an ID column with values corresponding to row positions occupied by those 1-D arrays within the 2-D array. |
示例:
创建数据表, 并插入数据
1
2
3
4
5
6
7CREATE TABLE array_tbl ( id integer NOT NULL PRIMARY KEY,
array1 integer[],
array2 integer[]
);
INSERT INTO array_tbl VALUES
( 1, '{1,2,3,4,5,6,7,8,9}', '{9,8,7,6,5,4,3,2,1}' ),
( 2, '{1,1,0,1,1,2,3,99,8}','{0,0,0,-5,4,1,1,7,6}' );
使用 functions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18db_madlib_demo=# select id,
madlib.array_sum(array1),
madlib.array_sub(array2, array1),
madlib.array_max(array1),
madlib.array_min(array1),
madlib.array_mean(array1),
madlib.normalize(array1)
from array_tbl group by id;
id | array_sum | array_sub | array_max | array_min | array_mean |
normalize
----+-----------+-----------------------------+-----------+-----------+------------------+-----------------------------------------------------
-----------------------------------------------------------------------------------------------------------------
1 | 45 | {8,6,4,2,0,-2,-4,-6,-8} | 9 | 1 | 5 | {0.0592348877759092,0.118469775551818,0.177704663327
728,0.236939551103637,0.296174438879546,0.355409326655455,0.414644214431365,0.473879102207274,0.533113989983183}
2 | 116 | {-1,-1,0,-6,3,-1,-2,-92,-2} | 99 | 0 | 12.8888888888889 | {0.0100595273380576,0.0100595273380576,0,0.010059527
3380576,0.0100595273380576,0.0201190546761152,0.0301785820141728,0.995893206467704,0.0804762187044609}
(2 rows)
Low-Rank Matrix Faxtorization
这个模块用一个low-rank approximation
实现了 incomplete matrix
的分解。
Mathematically, this model seeks to find matrices U and V (also referred as factors) that, for any given incomplete matrix A, minimizes:
$||A - UV^T||_2$, s.t. $rank(UV^T) <= r$,
where $||.||_2$ denotes the Frobenius norm. Let $A$ be a $m×n$ matrix, then $U$ will be $m×r$ and $V$ will be $n×r$, in dimension, and $1≤r≪min(m,n)$. This model is not intended to do the full decomposition, or to be used as part of inverse procedure. This model has been widely used in recommendation systems (e.g., Netflix [2]) and feature selection (e.g., image processing [3]).
Function Syntax
1 | lmf_igd_run( rel_output, # TEXT. The name of the table to receive the output. |
Examples
Create table, insert data.
1
2CREATE TABLE lmf_data (row INT, col INT, val FLOAT8, primary key (row, col));
INSERT INTO lmf_data VALUES (1, 1, 5.0), (3, 100, 1.0), (999, 10000, 2.0);
execute function
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23db_madlib_demo=# SELECT madlib.lmf_igd_run('lmf_model', 'lmf_data', 'row', 'col', 'val', 999,10000, 3, 0.1, 2, 10, 1e-9 );
NOTICE: Matrix lmf_data to be factorized: 999 x 10000
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column named 'id' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
CONTEXT: SQL statement "
CREATE TABLE lmf_model (
id SERIAL,
matrix_u DOUBLE PRECISION[],
matrix_v DOUBLE PRECISION[],
rmse DOUBLE PRECISION)"
PL/pgSQL function madlib.lmf_igd_run(character varying,regclass,character varying,character varying,character varying,integer,integer,integer,double precision,double precision,integer,double precision) line 47 at EXECUTE statement
NOTICE:
Finished low-rank matrix factorization using incremental gradient
DETAIL:
* table : lmf_data (row, col, val)
Results:
* RMSE = 3.61661832699015e-06
Output:
* view : SELECT * FROM lmf_model WHERE id = 1
lmf_igd_run
-------------
1
(1 row)
Check the result.
1
2
3
4
5
6
7
8db_madlib_demo=# SELECT array_dims(matrix_u) AS u_dims, array_dims(matrix_v) AS v_dims
FROM lmf_model
WHERE id=1;
u_dims | v_dims
--------------+----------------
[1:999][1:3] | [1:10000][1:3]
(1 row)
Query the result value
1
2
3
4
5
6
7
8db_madlib_demo=# SELECT matrix_u[2:2][1:3] AS row_2_features
FROM lmf_model
WHERE id = 1;
row_2_features
---------------------------------------------------------
{{1.97037281095982,0.312463999725878,1.06016968935728}}
(1 row)
Make prediction of a missing entry (row=2, col=7654).
1
2
3
4
5
6
7
8
9
10
11db_madlib_demo=# SELECT madlib.array_dot(
matrix_u[2:2][1:3],
matrix_v[7654:7654][1:3]
) AS row_2_col_7654
FROM lmf_model
WHERE id = 1;
row_2_col_7654
------------------
2.37682774869935
(1 row)
Singular Value Decomposition
1 | SVD Function for Dense Matrices |
Neural Network
Classification Training Function
1 | mlp_classification( |
Regression Training Function
1 | mlp_regression( |
Optimizer Parameters
1 | 'learning_rate_init = <value>, |
Prediction Function
1 | mlp_predict( |
…
还有其他很多的函数,没有一一的实验,需要用到了再查找吧。
总之,这个东西,个人感觉是 数据库的 边缘用例 了。而 GreenPlum
的核心特性是 Analysis, 但是吧,这个东西是 Apache 的,到底可不可以算是卖点呢。。而且,使用数据库来分析数据,emmm, 总感觉有点违和。并且,Greenplum 本身也不稳定,bug 还是比较多。前面加载海量数据就失败了(gpfdist
)。
当然,这个工具是方便的,在数据存储的地方就把数据处理完了。
总之,知道有这个东西就好了。